How to Download and Parse AlphaFold Structures with Python and BioPython

How to Download and Parse AlphaFold Structures with Python and BioPython

Every AlphaFold prediction is one API call away. Python makes it trivial to download hundreds of structures, extract their per-residue confidence scores, and filter out unreliable regions — automating what would otherwise be hours of manual work through the AlphaFold Database web interface.

Understanding pLDDT and where it lives in the PDB file

AlphaFold assigns every residue a per-residue confidence score called pLDDT (predicted Local Distance Difference Test), ranging from 0 to 100. Higher is more confident. The critical fact for parsing: pLDDT values are stored in the B-factor column of the AlphaFold PDB file — the same column that experimental crystal structures use for thermal displacement parameters. BioPython reads this column with atom.get_bfactor().

pLDDT rangeConfidence levelWhat it means for your analysis
> 90 Very high Backbone and side chains reliable. Safe to use directly for docking or MD input.
70 – 90 Confident Backbone reliable. Side chains may be less accurate. Suitable for most analyses.
50 – 70 Low May represent flexible or disordered regions. Treat with caution.
< 50 Very low Likely intrinsically disordered. Do not use for structural analysis or docking.
Cross-pillar connection — PyMOL
The same B-factor column used here for filtering in Python is what the PyMOL tutorial on this site uses for confidence-coloring: spectrum b, blue_cyan_yellow_orange_red, minimum=50, maximum=100. Once you’ve filtered to high-confidence residues in Python and saved a new PDB, you can load that filtered structure directly into PyMOL for visualization.

Downloading a single AlphaFold structure

The AlphaFold Database provides a simple REST API. Given a UniProt accession ID, the PDB file is available at a predictable URL — no authentication or API key required.

https://alphafold.ebi.ac.uk/files/AF-P04637-F1-model_v4.pdb
Base URL — same for every structure
AF-{UniProtID}-F1 — replace with your ID
-model_v4.pdb — latest model version
Python
import requests
import os

def download_alphafold(uniprot_id, outdir=".", version=4):
    """
    Download an AlphaFold structure PDB from the AFDB API.
    Returns the local filepath on success, None on failure.
    """
    url = (
        f"https://alphafold.ebi.ac.uk/files/"
        f"AF-{uniprot_id}-F1-model_v{version}.pdb"
    )
    response = requests.get(url, timeout=30)

    if response.status_code != 200:
        print(f"Not found: {uniprot_id} (HTTP {response.status_code})")
        return None

    os.makedirs(outdir, exist_ok=True)
    filepath = os.path.join(outdir, f"AF-{uniprot_id}.pdb")
    with open(filepath, "w") as f:
        f.write(response.text)

    print(f"Downloaded: {filepath}")
    return filepath

# Download p53 (human TP53, UniProt P04637)
filepath = download_alphafold("P04637", outdir="./af_structures")
# Downloaded: ./af_structures/AF-P04637.pdb
Finding the UniProt ID for your protein
Go to uniprot.org, search for your protein by name or gene, and copy the accession ID from the URL or the entry header — it looks like P04637 for human p53 or P00533 for EGFR. The AFDB covers most reviewed human proteins and a large fraction of the proteomes of major model organisms. If your protein isn’t in the AFDB, the API returns a 404 — check with response.status_code before trying to parse the result.

Batch downloading multiple structures

Python
import time

# List of UniProt IDs to download
targets = [
    "P04637",   # TP53 — human p53
    "P00533",   # EGFR — epidermal growth factor receptor
    "P35222",   # CTNNB1 — beta-catenin
    "Q9Y6Q9",   # TP53BP2 — ASPP2
]

downloaded = {}
for uid in targets:
    path = download_alphafold(uid, outdir="./af_structures")
    if path:
        downloaded[uid] = path
    time.sleep(0.5)    # be polite to the API — 500 ms between requests

print(f"\nSuccessfully downloaded {len(downloaded)}/{len(targets)} structures")
Rate limiting — add a sleep between requests
The AFDB API is free and open but not designed for bulk scripted access. Adding a time.sleep(0.5) between requests is courteous and prevents your IP from being temporarily blocked. For very large downloads (thousands of structures), consider using the AFDB bulk download service at alphafold.ebi.ac.uk/download, which provides pre-packaged tar archives by proteome, rather than calling the per-structure API endpoint repeatedly.

Parsing the structure and extracting pLDDT

Once downloaded, parse the PDB normally with BioPython’s PDBParser and read pLDDT from the B-factor column of each alpha carbon atom:

Python
from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser(QUIET=True)
structure = parser.get_structure("p53", "./af_structures/AF-P04637.pdb")
model = structure[0]
chain = model["A"]    # AlphaFold structures always have a single chain A

# Extract per-residue pLDDT from B-factor column
plddt_data = []
for residue in chain.get_residues():
    if residue.id[0] != " ":    # skip HETATM records
        continue
    try:
        ca = residue["CA"]
        plddt_data.append({
            "resi":   residue.id[1],
            "resn":   residue.get_resname(),
            "plddt":  ca.get_bfactor(),
        })
    except KeyError:
        pass   # residue missing CA — rare but possible

# Summary statistics
plddts = np.array([d["plddt"] for d in plddt_data])
print(f"Residues:       {len(plddts)}")
print(f"Mean pLDDT:     {plddts.mean():.1f}")
print(f"Median pLDDT:   {np.median(plddts):.1f}")
print(f"Below 70 (low): {(plddts < 70).sum()} residues")
print(f"Above 90 (VH):  {(plddts > 90).sum()} residues")

# Residues:       393
# Mean pLDDT:     71.4
# Median pLDDT:   76.2
# Below 70 (low): 128 residues
# Above 90 (VH):  89 residues

Filtering residues by confidence threshold

For docking and MD simulations, the standard practice is to remove residues with pLDDT below 70 before using the structure. Disordered residues (pLDDT < 50) can distort docking box calculations, and low-confidence loops can mislead MD force field parameterization:

Python
import pandas as pd

# Load pLDDT data into a DataFrame for easy filtering
df = pd.DataFrame(plddt_data)

# Inspect the confidence distribution
print(df["plddt"].describe().round(1))

# Classify each residue
def classify_plddt(score):
    if score >= 90: return "very_high"
    if score >= 70: return "confident"
    if score >= 50: return "low"
    return "very_low"

df["confidence"] = df["plddt"].apply(classify_plddt)
print(df["confidence"].value_counts())

# Get residue numbers with pLDDT above threshold
threshold = 70
high_conf_resis = df[df["plddt"] >= threshold]["resi"].tolist()
print(f"\nResidues above pLDDT {threshold}: {len(high_conf_resis)}")

# Check binding site residues specifically
binding_site = [175, 248, 249, 273, 282]   # p53 DNA-binding domain key residues
site_df = df[df["resi"].isin(binding_site)][["resi", "resn", "plddt", "confidence"]]
print("\nBinding site pLDDT:")
print(site_df.to_string(index=False))
print(f"Mean binding site pLDDT: {site_df['plddt'].mean():.1f}")

Saving the filtered structure to a new PDB file

BioPython’s PDBIO with a custom Select subclass lets you write only the residues that pass the confidence filter — producing a clean, high-confidence PDB ready for docking or MD input:

Python
from Bio.PDB import PDBIO, Select

class HighConfidenceSelect(Select):
    """Keep only residues with pLDDT at or above the threshold."""
    def __init__(self, min_plddt=70.0):
        self.min_plddt = min_plddt

    def accept_residue(self, residue):
        if residue.id[0] != " ":    # always keep HETATM
            return True
        try:
            plddt = residue["CA"].get_bfactor()
            return plddt >= self.min_plddt
        except KeyError:
            return False

# Save a structure with only pLDDT ≥ 70 residues
io = PDBIO()
io.set_structure(structure)
io.save("AF-P04637_confident.pdb", HighConfidenceSelect(min_plddt=70))

# Save a very high confidence only version (≥ 90) for the core domain
io.save("AF-P04637_very_high.pdb", HighConfidenceSelect(min_plddt=90))

print("Saved filtered structures")

Complete pipeline — download to filtered PDB

Combining all steps into a single reusable function that takes a UniProt ID and returns a filtered PDB file, plus a DataFrame of pLDDT values for reporting:

Python — complete pipeline
import requests, os, time
import numpy as np
import pandas as pd
from Bio.PDB import PDBParser, PDBIO, Select

class HighConfidenceSelect(Select):
    def __init__(self, min_plddt=70.0):
        self.min_plddt = min_plddt
    def accept_residue(self, residue):
        if residue.id[0] != " ": return True
        try: return residue["CA"].get_bfactor() >= self.min_plddt
        except KeyError: return False

def process_alphafold(uniprot_id, outdir=".", min_plddt=70.0):
    """
    Download, parse, filter, and save an AlphaFold structure.
    Returns (filtered_pdb_path, plddt_dataframe).
    """
    # 1. Download
    url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v4.pdb"
    r = requests.get(url, timeout=30)
    if r.status_code != 200:
        raise ValueError(f"Could not download {uniprot_id}: HTTP {r.status_code}")

    raw_path = os.path.join(outdir, f"AF-{uniprot_id}_raw.pdb")
    os.makedirs(outdir, exist_ok=True)
    with open(raw_path, "w") as f: f.write(r.text)

    # 2. Parse and extract pLDDT
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure(uniprot_id, raw_path)
    rows = []
    for res in structure[0]["A"].get_residues():
        if res.id[0] != " ": continue
        try: rows.append({"resi": res.id[1], "resn": res.get_resname(),
                          "plddt": res["CA"].get_bfactor()})
        except KeyError: pass
    df = pd.DataFrame(rows)

    # 3. Save filtered structure
    filtered_path = os.path.join(outdir, f"AF-{uniprot_id}_plddt{min_plddt:.0f}.pdb")
    io = PDBIO()
    io.set_structure(structure)
    io.save(filtered_path, HighConfidenceSelect(min_plddt))

    n_kept = (df["plddt"] >= min_plddt).sum()
    print(f"{uniprot_id}: {len(df)} total residues, {n_kept} kept (pLDDT≥{min_plddt}), "
          f"mean={df['plddt'].mean():.1f}")
    return filtered_path, df

# Run the full pipeline for several proteins
proteins = ["P04637", "P00533", "P35222"]
all_results = {}
for uid in proteins:
    path, df = process_alphafold(uid, outdir="./af_filtered", min_plddt=70)
    all_results[uid] = {"path": path, "df": df}
    time.sleep(0.5)

# Export a summary CSV across all proteins
summary = pd.DataFrame([{
    "uniprot_id": uid,
    "total_residues": len(v["df"]),
    "mean_plddt": v["df"]["plddt"].mean().round(1),
    "pct_above_70": (v["df"]["plddt"] >= 70).mean().round(3) * 100,
    "filtered_pdb": v["path"],
} for uid, v in all_results.items()])
summary.to_csv("alphafold_summary.csv", index=False)
print(summary.to_string(index=False))
What to report in your methods section
When using an AlphaFold structure for docking or MD in a publication, include: the UniProt accession ID, the model version downloaded (v4), the mean pLDDT of the binding site residues, and the confidence threshold used for filtering. Example: “The AlphaFold2 model of p53 (UniProt P04637, model v4) was used for docking. Binding site residues (R175, R248, R249, R273, R282) had a mean pLDDT of 87.4. Residues with pLDDT below 70 were removed prior to grid box definition.”

AlphaFold parsing in one paragraph

Download AlphaFold structures from https://alphafold.ebi.ac.uk/files/AF-{UniProtID}-F1-model_v4.pdb with Python’s requests library — no API key needed. Parse with PDBParser(QUIET=True) exactly like any other PDB file. Extract per-residue pLDDT from the B-factor column using residue["CA"].get_bfactor(). Filter to confident residues (pLDDT ≥ 70) using a custom Select subclass in PDBIO, and save the filtered structure as a new PDB. For batch work, wrap the whole pipeline in a function, iterate over UniProt IDs with a 500 ms sleep between requests, and export a summary DataFrame. The filtered PDB files are then ready for docking with AutoDock Vina or MD preparation with GROMACS.

Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *