The Complete Guide to Protein Structure Prediction: Methods, Tools & Best Practices

The Complete Guide to Protein Structure Prediction: Methods, Tools & Best Practices

Protein structure prediction has been transformed more completely in the last five years than in the previous five decades. AlphaFold changed what’s possible — but it didn’t replace every tool that came before it, and it introduced new questions about how to use and trust predicted structures. This guide covers the full landscape: why structure matters, how prediction methods work, which tool to use for your situation, and how to go from sequence to a structure ready for docking or MD simulation.

Why protein structure matters

A protein’s function is determined almost entirely by its three-dimensional shape. The amino acid sequence encodes this shape, but the sequence alone — a string of letters — tells you little about what the protein does, where it binds, how it moves, or how to design a drug that modulates it. The 3D structure is where biological function becomes interpretable.

Drug discovery
A protein structure reveals the binding pocket — the cavity where a drug molecule can insert and interfere with function. Without a structure, virtual screening and molecular docking cannot begin.
Understanding disease mechanisms
Disease mutations often work by distorting protein structure or disrupting binding interfaces. Seeing the structural consequence of a mutation — at atomic resolution — reveals the mechanism.
Protein engineering
Designing enzymes with new activities, antibodies with improved binding, or thermostable variants for industrial applications all require knowing where the active site is and how the protein folds.
Understanding protein-protein interactions
How proteins recognize and bind each other — and how to disrupt those interactions therapeutically — is legible only from structure. Complex structure prediction is one of the most active areas of current research.

For most of molecular biology’s history, determining a protein structure required years of difficult experimental work — crystallography, NMR, or cryo-EM. Computational prediction offers a faster route, though with important caveats about accuracy, confidence, and appropriate use.

The protein folding problem

Proteins are linear chains of amino acids that spontaneously fold into specific three-dimensional shapes in water. The sequence determines the structure — but predicting that structure from sequence alone is hard enough that it was considered one of the grand challenges of biology for over fifty years.

The difficulty comes from the conformational space problem. A protein of 100 amino acids has an astronomically large number of possible conformations. If it sampled them all randomly, it would take longer than the age of the universe to find the folded state. Yet proteins fold in milliseconds. Evolution has found sequences that fold reliably and quickly to specific functional shapes, and the information for doing so is somehow encoded in the sequence.

What Anfinsen showed in the 1970s — that denatured proteins refold to their native state without any additional information — established that the sequence contains all the information needed for folding. What took another 50 years was learning how to extract that information computationally.

The four levels of protein structure
Primary — the amino acid sequence. Secondary — local folding patterns (alpha helices, beta sheets). Tertiary — the complete 3D fold of a single protein chain. Quaternary — the arrangement of multiple chains in a complex. Structure prediction typically targets tertiary structure; multimer prediction addresses quaternary.

A brief history of prediction methods

Understanding how prediction methods evolved helps explain why different tools exist and when each is appropriate. The field moved through several distinct paradigms:

1970s
1970s–1990s
Physics-based methods
Early approaches tried to predict structure by minimizing energy functions — finding the conformation of lowest energy. These worked reasonably for small peptides but scaled poorly to full proteins. Rosetta emerged from this tradition, combining energy functions with fragment-based sampling.
1990s
1990s–2010s
Homology modeling
The key insight: if a protein sequence is similar to one with a known structure, the unknown protein probably has a similar fold. Homology modeling uses an experimentally determined structure as a template. Programs like MODELLER and Swiss-Model automate this. Limited to proteins with detectable homologs in the PDB.
2010s
2010s
Co-evolution and contact prediction
Residues that are spatially close in 3D structure tend to co-evolve — when one mutates, the other compensates. Mining multiple sequence alignments for these co-evolutionary signals allowed prediction of residue-residue contacts, which constrain the fold. This was the conceptual foundation AlphaFold would build on.
2020
2020–2021
AlphaFold2 — the paradigm shift
AlphaFold2 achieved accuracy approaching experimental resolution for many proteins, winning CASP14 by a margin that shocked the field. By combining multiple sequence alignments with attention-based neural networks trained on the PDB, it learned to predict structure with unprecedented accuracy — largely solving single-chain structure prediction for proteins with detectable homologs.
Now
2022–present
Generalization and complexes
ESMFold and other language model-based approaches showed that MSAs aren’t always necessary — protein language models trained on sequences alone can predict structure at competitive accuracy. AlphaFold3 extended prediction to protein-DNA, protein-RNA, protein-ligand, and protein-protein complexes, dramatically expanding what’s structurally predictable.

AlphaFold2 and AlphaFold3

AlphaFold2 is the tool that changed the field. Released by DeepMind in 2021, it uses a deep neural network architecture called Evoformer — trained on the entire PDB and on large databases of multiple sequence alignments — to predict 3D coordinates for every atom in a protein. Its predictions for most globular, well-folded proteins are close enough to crystal structures to be directly useful for structural biology and drug discovery.

The AlphaFold Protein Structure Database, maintained jointly by DeepMind and EMBL-EBI, provides pre-computed structures for over 200 million proteins — essentially the entire known protein universe. For most standard proteins, the structure you need is already there waiting to be downloaded rather than predicted.

Access AlphaFold2 predictions: alphafold.ebi.ac.uk for the database, or run your own predictions via ColabFold (colab.research.google.com) for sequences not in the database.

AlphaFold3

AlphaFold3, released in 2024, extended the AlphaFold framework to predict structures of molecular complexes — not just single proteins, but proteins bound to DNA, RNA, small molecule ligands, and other proteins. It uses a diffusion-based architecture rather than Evoformer, which allows it to generate multiple plausible conformations rather than a single prediction.

AlphaFold3 is accessible through the AlphaFold Server (alphafoldserver.com) for research use, though with some query limitations. It represents the current state-of-the-art for protein-ligand complex prediction, with accuracy competitive with docking for many systems — though it doesn’t yet replace molecular docking for large-scale screening.

AlphaFold3 license restrictions
AlphaFold3’s model weights are available under a license that restricts commercial use and requires permission for some applications. The web server is available for non-commercial research. If your work has any commercial application, check the license carefully before using AF3 predictions.

ESMFold and protein language models

ESMFold, released by Meta AI in 2022, takes a fundamentally different approach from AlphaFold2. Instead of requiring multiple sequence alignments — searching databases for evolutionary relatives — it uses a protein language model (ESM-2) trained on sequences alone, then feeds the representations from that model into a structure prediction head.

The practical implication is speed. ESMFold is orders of magnitude faster than AlphaFold2 because it skips the computationally expensive MSA step. For a typical protein, ESMFold runs in seconds rather than minutes. This makes it the preferred choice for large-scale screening — predicting structures for thousands of sequences in a pipeline, for example.

The tradeoff is accuracy. For proteins with many sequence relatives (good MSA coverage), AlphaFold2 is significantly more accurate than ESMFold. For “orphan” proteins with few or no detectable homologs — where MSA-based methods struggle anyway — the accuracy gap narrows considerably. ESMFold is accessible at esmatlas.com and via an API for batch predictions.

ToolMethodSpeedAccuracyComplexes?Access
AlphaFold2 MSA + Evoformer Minutes–hours Best for single chains Multimer only Free (DB + ColabFold)
AlphaFold3 Diffusion + MSA Minutes Best for complexes Yes — full Free (server, limits)
ESMFold Language model Seconds Good (less than AF2) No Free (API)
RoseTTAFold2 MSA + SE(3) network Minutes Excellent RoseTTAFold2NA for nucleic acids Free (server/local)
Swiss-Model Homology modeling Seconds–minutes Good when template exists Limited Free (web server)
MODELLER Homology modeling Minutes Good — more control Yes (manual) Free (academic)

Homology modeling — when it still matters

With AlphaFold2 available for nearly every known protein, homology modeling might seem obsolete. It isn’t — but its role has changed. There are several situations where homology modeling remains the better or necessary choice:

When you need a specific conformation. AlphaFold2 predicts a single “consensus” structure — typically the apo (unbound) state. If you need a model in the active conformation, the ligand-bound conformation, or a specific mutant conformation, homology modeling using a template in the desired state can be more appropriate than trusting AlphaFold’s averaged prediction.

When experimental templates provide information AlphaFold misses. A high-resolution crystal structure of a close homolog in complex with a relevant ligand contains information that AlphaFold cannot access from sequence alone — the conformation of the binding site under the influence of that specific ligand. Using that structure as a homology template preserves this conformational detail.

When you need highly controlled model building. Tools like MODELLER give you explicit control over every step of the modeling process — which template to use, how to align sequences, how to handle insertions and deletions. This level of control matters for tasks like predicting the effect of a specific mutation or building a model optimized for a particular downstream application.

When target sequence identity to a known structure exceeds ~50%. In this range, homology modeling consistently produces high-quality models, and the template provides a reliable backbone geometry that pure sequence-based methods can struggle with for specific loop regions.

Understanding pLDDT scores and PAE

AlphaFold doesn’t just predict structure — it also predicts its own confidence. Understanding these confidence metrics is essential for using predicted structures appropriately. Trusting a low-confidence prediction as if it were a crystal structure is one of the most common — and consequential — mistakes in computational structural biology.

pLDDT (predicted Local Distance Difference Test)

pLDDT is a per-residue confidence score ranging from 0 to 100. It predicts how well a given residue’s position would agree with an experimental structure — if one existed. It is not a measure of how much the protein folds at all: a disordered region will have low pLDDT not because AlphaFold failed, but because it correctly recognized that the region has no single stable conformation to predict.

pLDDT confidence scale — per residue interpretation
90–100
Very high confidence — backbone and side chains reliable. Use directly for docking, MD, structure analysis.
70–90
Confident — backbone positions reliable, side chains less so. Good for most downstream applications with care.
50–70
Low confidence — may represent disordered or flexible region. Backbone uncertain. Treat with caution; do not dock to low-pLDDT sites.
Below 50
Very low — likely intrinsically disordered. Position is essentially a guess. Do not use for any structural analysis.

PAE (Predicted Aligned Error)

PAE is a matrix showing AlphaFold’s confidence in the relative position of every pair of residues. Each cell (i, j) contains the expected error in angstroms for residue j’s position when the structure is aligned on residue i. PAE is most important for interpreting multi-domain proteins and complexes.

A low PAE value between two regions means AlphaFold is confident they are in the correct relative orientation — they form a well-predicted interface or domain. A high PAE value between two regions means their relative orientation is uncertain — even if each region is well-predicted individually, how they sit relative to each other is not reliable. This is the crucial check for multi-domain proteins: low within-domain PAE with high inter-domain PAE means the individual domains are well-predicted but their arrangement is uncertain.

The most important thing most people don’t check
The PAE plot is more important than pLDDT for multi-domain proteins. A protein can have excellent pLDDT throughout yet have high inter-domain PAE — meaning each domain is well predicted but their relative orientation is uncertain. Using such a structure for docking or MD without checking PAE means you may be simulating a physically implausible domain arrangement. Always look at the PAE matrix for any protein with more than one domain.

How to choose the right method

Your situationRecommended approach
Protein is in the AlphaFold Database Download from alphafold.ebi.ac.uk — fastest, no compute needed
Novel sequence, not in AFDB, need highest accuracy AlphaFold2 via ColabFold (free, Google Colab)
Need structure fast, accuracy slightly less critical ESMFold — seconds per sequence via web API
Need protein-ligand complex structure AlphaFold3 server — best current option for complexes
Need protein-protein or protein-DNA complex AlphaFold-Multimer (via ColabFold) or AlphaFold3
Close homolog (> 50% identity) with known structure Swiss-Model for speed, MODELLER for control
Need specific conformation (active, ligand-bound) Homology modeling using template in desired state
Predicting thousands of sequences in a pipeline ESMFold API or ColabFold batch mode
Protein is intrinsically disordered throughout No tool predicts reliably — consider IDP-specific tools or NMR

The complete structure prediction workflow

From sequence to a validated, publication-ready structure model — here is the standard workflow, regardless of which prediction tool you use:

1
Check the AlphaFold Database first
Before running any prediction, search alphafold.ebi.ac.uk with your UniProt accession or sequence. If a pre-computed model exists and your protein is relatively well-studied, use it rather than re-predicting — the AFDB models are produced with the full AlphaFold2 pipeline including deep MSAs, which is hard to replicate with free compute. Download the PDB file and the confidence JSON.
2
Run the prediction (if needed)
For novel sequences: run ColabFold on Google Colab for AlphaFold2-quality predictions at no cost. For speed: use ESMFold via the API. For complexes: use the AlphaFold3 server. Always request multiple models (typically 5) and compare them — agreement between models is a strong confidence signal.
3
Assess confidence with pLDDT and PAE
Visualize pLDDT in PyMOL (color by B-factor, which AF2 uses to store pLDDT values). Identify low-confidence regions. Check the PAE matrix — especially if your protein has multiple domains. High-confidence regions (pLDDT > 70) can be used for downstream analysis; low-confidence regions should be treated as placeholder positions.
4
Run structure quality assessment
Submit the model to MolProbity (molprobity.biochem.duke.edu) for stereochemical validation — Ramachandran plot analysis, rotamer outliers, clashscore. For homology models, also calculate DOPE or QMEAN scores. A good predicted structure should have a MolProbity score comparable to crystal structures of similar resolution.
5
Compare to experimental data where available
If any experimental data exists — even partial — compare it to your model. Mutagenesis data pointing to key binding residues, HDX-MS data showing flexible regions, SAXS envelope data — all of these can validate or contradict the predicted structure. Predicted structures that agree with independent experimental observations are significantly more trustworthy.
6
Prepare for downstream use
For docking: remove low-confidence terminal regions, add hydrogens, assign protonation states. For MD: run energy minimization to relieve any geometry issues introduced by the prediction; consider running a short simulation to allow the structure to relax before production. The preparation tutorial on this site covers both workflows in detail.

Using predicted structures downstream

A predicted structure is a starting point, not an endpoint. How you use it depends on what question you’re asking — and on how confident the prediction is in the regions that matter for your question.

For molecular docking: AlphaFold structures work well as docking receptors when the binding site has high pLDDT (above 70) and the PAE between the binding site and the rest of the protein is low. The main caveat is that AlphaFold predicts the apo state — the unbound conformation — which may differ from the ligand-bound conformation. For targets with known induced-fit behavior, ensemble docking or MD-based receptor relaxation helps address this.

For MD simulation: AlphaFold structures generally simulate well once properly prepared. The key step is energy minimization — AF2 models sometimes have small geometric imperfections (slightly non-ideal bond lengths or angles) that need to be resolved before dynamics can start. Low-pLDDT disordered regions will rapidly deviate from their predicted positions in MD, which is expected behavior — they’re genuinely flexible.

For structure-based drug design: Use AlphaFold structures as a starting point for hypothesis generation. Experimental validation of any binding site identified from a predicted structure remains important — the binding site geometry in the predicted structure may not perfectly match the true bound conformation.

The cross-pillar connection
The three content pillars on this site form a natural sequential workflow. Structure prediction (this pillar) gives you the 3D structure. Molecular docking uses that structure to predict how ligands bind. Molecular dynamics validates binding stability and calculates binding free energy. Most serious computational drug discovery projects use all three in sequence — predicted structure → docking screen → MD validation.

Next steps

This guide has covered the landscape of protein structure prediction — why it matters, how the methods evolved, what the current tools are and when to use each, how to interpret confidence metrics, and what the full workflow looks like. The tutorials linked below go deeper on each specific tool and use case.

The one-paragraph summary

Protein structure prediction has been transformed by deep learning, and AlphaFold2 has made high-quality structure models available for essentially every known protein. For most researchers, the workflow is: check the AlphaFold Database first, assess confidence with pLDDT and PAE, run quality checks with MolProbity, and prepare the structure for downstream use. Homology modeling remains valuable for specific conformational states and highly controlled modeling. ESMFold is the right choice when speed matters more than top accuracy. AlphaFold3 is the current best option for molecular complexes. In all cases, a predicted structure is a hypothesis — one that must be interpreted in light of its confidence scores and validated against any available experimental data.

Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *