How to Use AlphaFold2 with ColabFold: Predict a Protein Structure for Free (2026 Guide)

How to Use AlphaFold2 with ColabFold: Predict a Protein Structure for Free (2026 Guide)

ColabFold brings AlphaFold2-quality structure prediction to any researcher with a Google account — no installation, no HPC cluster, no cost. This tutorial walks through every step from opening the notebook to downloading and interpreting your results.

Step 1
Open notebook
Step 2
Enter sequence
Step 3
Choose settings
Step 4
Run prediction
Step 5–6
Download + interpret

What ColabFold is and how it relates to AlphaFold2

AlphaFold2 is the deep learning model developed by DeepMind that transformed protein structure prediction. Its model weights are publicly available — but running it yourself requires significant technical setup: installing dependencies, downloading multi-terabyte sequence databases, and having access to a GPU. For most researchers, that’s a significant barrier.

ColabFold, developed by the Steinegger lab at Seoul National University, solves this by packaging AlphaFold2 into a Google Colab notebook that runs entirely in your browser. It replaces AlphaFold2’s MSA generation step — which normally requires downloading massive databases — with a fast remote search against MMseqs2 servers. The result is a prediction pipeline that’s nearly as accurate as the original, takes 15–30 minutes per protein, and requires nothing more than a Google account.

Check the AlphaFold Database first
Before running ColabFold, go to alphafold.ebi.ac.uk and search your protein’s name or UniProt ID. If a pre-computed model already exists — and for most proteins in UniProt it does — download it directly. You get the same quality prediction in seconds rather than waiting for ColabFold to run.

Before you start

You need three things before beginning:

  • A Google account — Colab runs in your Google account’s compute allocation. Free tier works for most proteins; longer sequences or multiple models may benefit from Colab Pro.
  • Your protein sequence in FASTA format — the amino acid sequence starting with a >header line, followed by the sequence. Get this from UniProt, NCBI, or your own sequencing data.
  • A stable internet connection — the notebook connects to external MSA servers and the Colab runtime. Dropped connections mid-run require restarting from the beginning.
What sequence length can ColabFold handle?
ColabFold handles proteins up to around 1,400 residues on a standard Colab T4 GPU. For sequences above 1,000 residues, prediction time increases significantly — expect 45–90 minutes. For very large proteins or complexes, use Colab Pro (A100 GPU) or run ColabFold locally if you have access to institutional GPU resources.

Step 1 — Open the ColabFold notebook

1
Step 1
Open AlphaFold2 in Google Colab

Go to colab.research.google.com. In the search bar, type “ColabFold AlphaFold2” — you’ll find the official notebook published by the Steinegger lab. Alternatively, go directly to the ColabFold GitHub page at github.com/sokrypton/ColabFold and click the “Open in Colab” badge next to the AlphaFold2 notebook.

Once the notebook opens, sign into your Google account if prompted. You’ll see a notebook with several grey code cells — these are the steps that run in sequence. You don’t need to understand the code. You only interact with the form fields at the top of the first cell.

Before running anything, connect to a runtime with a GPU: go to Runtime → Change runtime type → T4 GPU, then click Save. Without a GPU, predictions run on CPU and take many hours. With a GPU, they complete in 15–30 minutes.

Step 2 — Enter your sequence

2
Step 2
Paste your protein sequence

In the first code cell, you’ll see a form field labelled query_sequence. Paste your sequence here. You can paste it as plain amino acid sequence (just the letters) or in FASTA format with a header line — ColabFold handles both.

Below the sequence field is a jobname field. Give your job a descriptive name — something like P53_human_TP53 or EGFR_kinase_domain. This name is used to label your output files, so it’s worth being specific. Avoid spaces and special characters; use underscores.

Only include the protein sequence — no DNA, no spaces
The sequence field accepts one-letter amino acid codes only (ACDEFGHIKLMNPQRSTVWY). Remove any signal peptides, tags, or non-standard characters before pasting. Spaces, numbers, and non-amino-acid characters will cause the run to fail. If your sequence comes from a genomic record, make sure you’re using the translated protein sequence, not the DNA sequence.

Step 3 — Choose your settings

3
Step 3
Configure MSA mode and model options

Below the sequence input you’ll find several dropdown menus controlling how the prediction runs. Most can be left at their defaults — but MSA mode is worth understanding:

Alternative
mmseqs2_uniref
Searches UniRef only — faster but misses environmental sequences. Use when you want a quicker result and your protein is well-studied.
Use carefully
single_sequence
No MSA — ESMFold-style prediction from sequence alone. Very fast but significantly less accurate. Only use if database searches consistently fail.
Advanced
custom
Upload your own MSA file. Only relevant if you’ve pre-computed an MSA using jackhmmer or another tool and want to use it directly.

Other settings worth knowing

  • num_relax — how many of the 5 predicted models to run through AMBER energy relaxation. Set to 1 for the top model. Setting to 5 gives you relaxed versions of all models but takes longer.
  • num_models — ColabFold generates 5 models by default and ranks them by predicted confidence. Keep this at 5 for publication work; reduce to 1 or 2 for quick exploratory predictions.
  • use_ptm — leave checked. This enables the pTM (predicted TM-score) confidence metric used for ranking models.
  • use_dropout — leave unchecked for standard predictions. Only check this if you want to generate diverse ensemble structures by introducing stochastic variation.

Step 4 — Run the prediction

4
Step 4
Run all cells and monitor progress

Go to Runtime → Run all (or press Ctrl+F9 / Cmd+F9). The notebook runs its cells sequentially, and you’ll see output appear below each cell as it completes.

The run proceeds through these phases — you’ll see progress messages in the cell output:

  • Installation (1–3 min) — ColabFold installs its dependencies. Only needed on first run or after runtime reset.
  • MSA generation (2–5 min) — your sequence is sent to the MMseqs2 servers and the MSA is built. You’ll see a count of sequences found.
  • Structure prediction (10–25 min) — AlphaFold2 runs through its 5 models. Progress bars appear for each.
  • Relaxation (2–5 min) — the top model(s) are refined with AMBER energy minimization.
  • Output generation — results are compiled and displayed inline in the notebook.
Don’t close the browser tab while running
Google Colab disconnects if the browser is inactive or the tab is closed. If the runtime disconnects mid-run, you’ll need to restart from the beginning. Keep the tab open and your screen active during prediction. For long runs on large proteins, check your Colab runtime periodically — free tier sessions disconnect after 12 hours of continuous runtime.

When the run completes, the notebook displays a 3D visualization of the top-ranked model directly in the browser, colored by pLDDT confidence. Blue regions are high confidence; yellow and orange are moderate; red is low. This inline view is useful for a quick sanity check but is not publication quality — download the files for proper visualization in PyMOL or VMD.

Step 5 — Download your results

5
Step 5
Download the output files

At the end of the notebook, a download cell compresses all outputs into a ZIP file and provides a download link. Click it to save everything to your computer. The ZIP contains:

  • *_relaxed_rank_1.pdb
    The top-ranked, AMBER-relaxed structure — this is the file to use for docking, MD simulation, or structural analysis. The number in “rank_1” refers to AlphaFold’s confidence ranking, not quality order.
  • *_unrelaxed_rank_*.pdb
    All 5 predicted models before AMBER relaxation. Useful for comparing model agreement — if all 5 look similar, the prediction is robust.
  • *_scores_rank_1.json
    Per-residue pLDDT scores and the PAE matrix in JSON format. Essential for assessing confidence — download this alongside the PDB file.
  • *_coverage.png
    MSA coverage plot showing how many sequences were found for each position. Poor coverage at a region predicts lower accuracy there.
  • *_pae.png
    Predicted Aligned Error heatmap — essential for multi-domain proteins. Dark blue squares indicate confident relative positioning between regions.
  • *_plddt.png
    Per-residue pLDDT plot — a quick visual overview of which parts of your protein are confidently predicted.

Step 6 — Understand and visualize your outputs

With your files downloaded, the most important next step is assessing confidence before doing anything else with the structure.

Reading the pLDDT score

Open the _plddt.png file first. This plot shows confidence per residue — x-axis is residue number, y-axis is pLDDT score. Here’s how to read it:

90–100
Very high confidence (blue in PyMOL). Backbone and side chains reliable. Use directly for docking, MD, or structural analysis.
70–90
Good confidence (teal/cyan). Backbone positions reliable. Suitable for most downstream applications.
50–70
Low confidence (yellow). Flexible or disordered region. Backbone uncertain — do not rely on for binding site analysis.
Below 50
Very low (red/orange). Almost certainly intrinsically disordered. Position is unreliable — exclude from structural analysis.

Visualizing in PyMOL

Load the top-ranked relaxed PDB into PyMOL and color by the pLDDT values stored in the B-factor column:

PyMOL command line
# Load the AlphaFold prediction
load jobname_relaxed_rank_001.pdb

# Color by pLDDT (stored in B-factor column)
# Blue = high confidence, red = low confidence
spectrum b, blue_cyan_yellow_orange_red, minimum=50, maximum=100

# Show as cartoon
show cartoon
hide lines

Reading the PAE plot

Open the _pae.png file. This is a square matrix — each cell (i,j) shows AlphaFold’s confidence in the relative position of residues i and j. Dark blue = confident. Light blue/white = uncertain.

For a single-domain protein, you want to see a uniformly dark blue square — high confidence everywhere. For a multi-domain protein, look for two or more dark blue squares along the diagonal (confident within each domain) with a lighter off-diagonal region (uncertain relative domain orientation). That pattern tells you the individual domains are well-predicted but their arrangement in space is uncertain.

Common problems and fixes

  • Runtime disconnected mid-run — reconnect and run all cells again from the beginning. The MSA step will re-run from scratch. Consider switching to Colab Pro for longer proteins.
  • Sequence too long — out of memory error — switch to a Colab Pro runtime with an A100 GPU, or trim your sequence to the domain of interest rather than the full protein.
  • MSA search returns very few sequences — this is expected for orphan proteins. The prediction will still run but confidence will be lower. Check the coverage PNG — if most positions have fewer than 10 sequences, consider ESMFold as an alternative.
  • All 5 models look completely different — low model agreement indicates the protein is either disordered, the MSA has poor coverage, or the protein has multiple stable conformations. Don’t pick one arbitrarily — investigate the pLDDT and PAE before proceeding.
  • Download link doesn’t appear — scroll to the very last cell and run it manually by clicking the play button on that cell. The download cell sometimes doesn’t auto-execute.
  • Checked the AlphaFold Database first — model wasn’t already available
  • GPU runtime selected in Colab (T4 or better)
  • Sequence pasted without spaces, numbers, or non-amino-acid characters
  • MSA mode set to mmseqs2_uniref_env (default)
  • Run completed — all 5 models generated
  • ZIP downloaded — contains PDB files and JSON confidence scores
  • pLDDT plot checked — binding site and key regions have pLDDT > 70
  • PAE plot checked — especially important for multi-domain proteins

ColabFold in one paragraph

ColabFold makes AlphaFold2-quality structure prediction accessible to any researcher with a Google account. Open the notebook, paste your sequence, run all cells, and download the results — the entire active process takes about five minutes. The remaining 15–25 minutes is compute time you spend doing something else. The outputs that matter most are the top-ranked relaxed PDB file and the JSON confidence scores — always check pLDDT and PAE before using any predicted structure for docking, MD simulation, or publication.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *