The Complete Guide to Protein Structure Prediction: Methods, Tools & Best Practices
Protein structure prediction has been transformed more completely in the last five years than in the previous five decades. AlphaFold changed what’s possible — but it didn’t replace every tool that came before it, and it introduced new questions about how to use and trust predicted structures. This guide covers the full landscape: why structure matters, how prediction methods work, which tool to use for your situation, and how to go from sequence to a structure ready for docking or MD simulation.
Why protein structure matters
A protein’s function is determined almost entirely by its three-dimensional shape. The amino acid sequence encodes this shape, but the sequence alone — a string of letters — tells you little about what the protein does, where it binds, how it moves, or how to design a drug that modulates it. The 3D structure is where biological function becomes interpretable.
For most of molecular biology’s history, determining a protein structure required years of difficult experimental work — crystallography, NMR, or cryo-EM. Computational prediction offers a faster route, though with important caveats about accuracy, confidence, and appropriate use.
The protein folding problem
Proteins are linear chains of amino acids that spontaneously fold into specific three-dimensional shapes in water. The sequence determines the structure — but predicting that structure from sequence alone is hard enough that it was considered one of the grand challenges of biology for over fifty years.
The difficulty comes from the conformational space problem. A protein of 100 amino acids has an astronomically large number of possible conformations. If it sampled them all randomly, it would take longer than the age of the universe to find the folded state. Yet proteins fold in milliseconds. Evolution has found sequences that fold reliably and quickly to specific functional shapes, and the information for doing so is somehow encoded in the sequence.
What Anfinsen showed in the 1970s — that denatured proteins refold to their native state without any additional information — established that the sequence contains all the information needed for folding. What took another 50 years was learning how to extract that information computationally.
A brief history of prediction methods
Understanding how prediction methods evolved helps explain why different tools exist and when each is appropriate. The field moved through several distinct paradigms:
AlphaFold2 and AlphaFold3
AlphaFold2 is the tool that changed the field. Released by DeepMind in 2021, it uses a deep neural network architecture called Evoformer — trained on the entire PDB and on large databases of multiple sequence alignments — to predict 3D coordinates for every atom in a protein. Its predictions for most globular, well-folded proteins are close enough to crystal structures to be directly useful for structural biology and drug discovery.
The AlphaFold Protein Structure Database, maintained jointly by DeepMind and EMBL-EBI, provides pre-computed structures for over 200 million proteins — essentially the entire known protein universe. For most standard proteins, the structure you need is already there waiting to be downloaded rather than predicted.
Access AlphaFold2 predictions: alphafold.ebi.ac.uk for the database, or run your own predictions via ColabFold (colab.research.google.com) for sequences not in the database.
AlphaFold3
AlphaFold3, released in 2024, extended the AlphaFold framework to predict structures of molecular complexes — not just single proteins, but proteins bound to DNA, RNA, small molecule ligands, and other proteins. It uses a diffusion-based architecture rather than Evoformer, which allows it to generate multiple plausible conformations rather than a single prediction.
AlphaFold3 is accessible through the AlphaFold Server (alphafoldserver.com) for research use, though with some query limitations. It represents the current state-of-the-art for protein-ligand complex prediction, with accuracy competitive with docking for many systems — though it doesn’t yet replace molecular docking for large-scale screening.
ESMFold and protein language models
ESMFold, released by Meta AI in 2022, takes a fundamentally different approach from AlphaFold2. Instead of requiring multiple sequence alignments — searching databases for evolutionary relatives — it uses a protein language model (ESM-2) trained on sequences alone, then feeds the representations from that model into a structure prediction head.
The practical implication is speed. ESMFold is orders of magnitude faster than AlphaFold2 because it skips the computationally expensive MSA step. For a typical protein, ESMFold runs in seconds rather than minutes. This makes it the preferred choice for large-scale screening — predicting structures for thousands of sequences in a pipeline, for example.
The tradeoff is accuracy. For proteins with many sequence relatives (good MSA coverage), AlphaFold2 is significantly more accurate than ESMFold. For “orphan” proteins with few or no detectable homologs — where MSA-based methods struggle anyway — the accuracy gap narrows considerably. ESMFold is accessible at esmatlas.com and via an API for batch predictions.
| Tool | Method | Speed | Accuracy | Complexes? | Access |
|---|---|---|---|---|---|
| AlphaFold2 | MSA + Evoformer | Minutes–hours | Best for single chains | Multimer only | Free (DB + ColabFold) |
| AlphaFold3 | Diffusion + MSA | Minutes | Best for complexes | Yes — full | Free (server, limits) |
| ESMFold | Language model | Seconds | Good (less than AF2) | No | Free (API) |
| RoseTTAFold2 | MSA + SE(3) network | Minutes | Excellent | RoseTTAFold2NA for nucleic acids | Free (server/local) |
| Swiss-Model | Homology modeling | Seconds–minutes | Good when template exists | Limited | Free (web server) |
| MODELLER | Homology modeling | Minutes | Good — more control | Yes (manual) | Free (academic) |
Homology modeling — when it still matters
With AlphaFold2 available for nearly every known protein, homology modeling might seem obsolete. It isn’t — but its role has changed. There are several situations where homology modeling remains the better or necessary choice:
When you need a specific conformation. AlphaFold2 predicts a single “consensus” structure — typically the apo (unbound) state. If you need a model in the active conformation, the ligand-bound conformation, or a specific mutant conformation, homology modeling using a template in the desired state can be more appropriate than trusting AlphaFold’s averaged prediction.
When experimental templates provide information AlphaFold misses. A high-resolution crystal structure of a close homolog in complex with a relevant ligand contains information that AlphaFold cannot access from sequence alone — the conformation of the binding site under the influence of that specific ligand. Using that structure as a homology template preserves this conformational detail.
When you need highly controlled model building. Tools like MODELLER give you explicit control over every step of the modeling process — which template to use, how to align sequences, how to handle insertions and deletions. This level of control matters for tasks like predicting the effect of a specific mutation or building a model optimized for a particular downstream application.
When target sequence identity to a known structure exceeds ~50%. In this range, homology modeling consistently produces high-quality models, and the template provides a reliable backbone geometry that pure sequence-based methods can struggle with for specific loop regions.
Understanding pLDDT scores and PAE
AlphaFold doesn’t just predict structure — it also predicts its own confidence. Understanding these confidence metrics is essential for using predicted structures appropriately. Trusting a low-confidence prediction as if it were a crystal structure is one of the most common — and consequential — mistakes in computational structural biology.
pLDDT (predicted Local Distance Difference Test)
pLDDT is a per-residue confidence score ranging from 0 to 100. It predicts how well a given residue’s position would agree with an experimental structure — if one existed. It is not a measure of how much the protein folds at all: a disordered region will have low pLDDT not because AlphaFold failed, but because it correctly recognized that the region has no single stable conformation to predict.
PAE (Predicted Aligned Error)
PAE is a matrix showing AlphaFold’s confidence in the relative position of every pair of residues. Each cell (i, j) contains the expected error in angstroms for residue j’s position when the structure is aligned on residue i. PAE is most important for interpreting multi-domain proteins and complexes.
A low PAE value between two regions means AlphaFold is confident they are in the correct relative orientation — they form a well-predicted interface or domain. A high PAE value between two regions means their relative orientation is uncertain — even if each region is well-predicted individually, how they sit relative to each other is not reliable. This is the crucial check for multi-domain proteins: low within-domain PAE with high inter-domain PAE means the individual domains are well-predicted but their arrangement is uncertain.
How to choose the right method
| Your situation | Recommended approach |
|---|---|
| Protein is in the AlphaFold Database | Download from alphafold.ebi.ac.uk — fastest, no compute needed |
| Novel sequence, not in AFDB, need highest accuracy | AlphaFold2 via ColabFold (free, Google Colab) |
| Need structure fast, accuracy slightly less critical | ESMFold — seconds per sequence via web API |
| Need protein-ligand complex structure | AlphaFold3 server — best current option for complexes |
| Need protein-protein or protein-DNA complex | AlphaFold-Multimer (via ColabFold) or AlphaFold3 |
| Close homolog (> 50% identity) with known structure | Swiss-Model for speed, MODELLER for control |
| Need specific conformation (active, ligand-bound) | Homology modeling using template in desired state |
| Predicting thousands of sequences in a pipeline | ESMFold API or ColabFold batch mode |
| Protein is intrinsically disordered throughout | No tool predicts reliably — consider IDP-specific tools or NMR |
The complete structure prediction workflow
From sequence to a validated, publication-ready structure model — here is the standard workflow, regardless of which prediction tool you use:
Using predicted structures downstream
A predicted structure is a starting point, not an endpoint. How you use it depends on what question you’re asking — and on how confident the prediction is in the regions that matter for your question.
For molecular docking: AlphaFold structures work well as docking receptors when the binding site has high pLDDT (above 70) and the PAE between the binding site and the rest of the protein is low. The main caveat is that AlphaFold predicts the apo state — the unbound conformation — which may differ from the ligand-bound conformation. For targets with known induced-fit behavior, ensemble docking or MD-based receptor relaxation helps address this.
For MD simulation: AlphaFold structures generally simulate well once properly prepared. The key step is energy minimization — AF2 models sometimes have small geometric imperfections (slightly non-ideal bond lengths or angles) that need to be resolved before dynamics can start. Low-pLDDT disordered regions will rapidly deviate from their predicted positions in MD, which is expected behavior — they’re genuinely flexible.
For structure-based drug design: Use AlphaFold structures as a starting point for hypothesis generation. Experimental validation of any binding site identified from a predicted structure remains important — the binding site geometry in the predicted structure may not perfectly match the true bound conformation.
Next steps
This guide has covered the landscape of protein structure prediction — why it matters, how the methods evolved, what the current tools are and when to use each, how to interpret confidence metrics, and what the full workflow looks like. The tutorials linked below go deeper on each specific tool and use case.
The one-paragraph summary
Protein structure prediction has been transformed by deep learning, and AlphaFold2 has made high-quality structure models available for essentially every known protein. For most researchers, the workflow is: check the AlphaFold Database first, assess confidence with pLDDT and PAE, run quality checks with MolProbity, and prepare the structure for downstream use. Homology modeling remains valuable for specific conformational states and highly controlled modeling. ESMFold is the right choice when speed matters more than top accuracy. AlphaFold3 is the current best option for molecular complexes. In all cases, a predicted structure is a hypothesis — one that must be interpreted in light of its confidence scores and validated against any available experimental data.