How to Calculate RMSD and Align Protein Structures with Python and BioPython

11 min read Intermediate — basic BioPython assumed

RMSD is the standard metric for quantifying how similar two protein structures are. BioPython’s Superimposer class calculates it in three lines — but the lines before those three, where you select the right atoms to compare, are what make the result meaningful or misleading.

What RMSD measures

RMSD (root mean square deviation) measures the average distance between equivalent atom pairs in two structures after optimal superposition. A value of 0 means the structures are identical. Higher values mean larger average displacements between equivalent positions.

< 1 Å

Excellent agreement — same protein in same or very similar conformation. Typical for high-quality AlphaFold predictions vs crystal structures.

1–2 Å

Good. Close homologs or same protein with minor conformational differences (e.g., apo vs ligand-bound).

2–4 Å

Moderate. Significant conformational change or moderate sequence divergence. Conserved core still present.

> 4 Å

Large. Very different conformations, distant homologs, or wrong atom pairing. Investigate before reporting.

What constitutes a “good” RMSD depends entirely on what you’re comparing. An RMSD of 1.2 Å between an AlphaFold prediction and the experimental structure is excellent. The same value between two conformational states of the same protein after a large domain movement may be suspiciously low, suggesting the alignment focused only on the stable core.

The Superimposer class

BioPython’s Superimposer takes two lists of Atom objects — the reference and mobile atoms — finds the rotation and translation that minimizes RMSD between them, and then either reports the RMSD or applies the transformation to move the mobile structure. The two lists must be the same length and in the same order: atom i in list A is paired with atom i in list B.

Python — core Superimposer usage

from Bio.PDB import Superimposer

sup = Superimposer()

# Set reference atoms and mobile atoms (same length, same order)
sup.set_atoms(reference_atoms, mobile_atoms)

# After set_atoms(), these are populated:
print(f"RMSD: {sup.rms:.3f} Å")          # the RMSD value
print(sup.rotran)                          # (rotation_matrix, translation_vector)

# Apply the transformation to move the mobile structure
sup.apply(mobile_structure.get_atoms())    # modifies coordinates in-place

set_atoms vs apply — two separate steps

set_atoms() calculates the optimal superposition and RMSD but does not move the mobile structure. apply() then applies that transformation to any atom list you pass — typically all atoms of the mobile structure, not just the ones used for alignment. This separation lets you align on a subset (e.g., backbone only) but transform the entire structure including side chains and ligands.

Selecting equivalent atoms correctly

The quality of your RMSD calculation depends almost entirely on which atoms you pair. The two most common choices are Cα atoms (one per residue, robust) and backbone atoms (N, Cα, C, O — more atoms, more precise). The key requirement is that every atom in list A has a genuinely equivalent atom in list B at the same position.

Cα atoms only

name == “CA”

One atom per residue. Most common choice for overall structural comparison. Robust to missing side chain atoms.

Use for: global fold comparison, homolog comparison

Backbone (N, Cα, C, O)

name in {N, CA, C, O}

Four atoms per residue. More precise than Cα only. Standard for comparing protein conformational states.

Use for: apo vs bound, conformational change

All heavy atoms

element != “H”

Includes side chains. More sensitive to local differences but requires identical residue completeness in both structures.

Use for: comparing crystal structures of same protein

Specific region

resi in [100..200]

Align on a defined domain or secondary structure element. Best when one region moves relative to another.

Use for: domain movement, binding site comparison

Basic RMSD calculation — two structures

Python

from Bio.PDB import PDBParser, Superimposer

def load_structure(path, name):
    parser = PDBParser(QUIET=True)
    return parser.get_structure(name, path)

reference = load_structure("experimental.pdb", "ref")
mobile    = load_structure("alphafold.pdb",    "mob")

# Extract Cα atoms from chain A of each structure
def get_ca_atoms(structure, chain_id="A"):
    chain = structure[0][chain_id]
    return [
        res["CA"]
        for res in chain.get_residues()
        if res.id[0] == " " and "CA" in res
    ]

ref_atoms = get_ca_atoms(reference)
mob_atoms = get_ca_atoms(mobile)

# Structures must have the same number of residues to pair
if len(ref_atoms) != len(mob_atoms):
    print(f"Length mismatch: {len(ref_atoms)} vs {len(mob_atoms)} Cα atoms")
    # Trim to the shorter sequence
    n = min(len(ref_atoms), len(mob_atoms))
    ref_atoms, mob_atoms = ref_atoms[:n], mob_atoms[:n]

# Calculate RMSD and superimpose
sup = Superimposer()
sup.set_atoms(ref_atoms, mob_atoms)

print(f"Cα RMSD: {sup.rms:.3f} Å over {len(ref_atoms)} residues")
# Cα RMSD: 0.847 Å over 283 residues

Length mismatch is the most common error

Superimposer requires exactly the same number of atoms in both lists. If your two structures have different numbers of residues — a common situation when comparing an AlphaFold model to a crystal structure with missing loops — you must explicitly pair only equivalent residues before calling set_atoms(). The simplest approach is pairing by residue number, shown below.

Pairing by residue number when lengths differ

Python

def get_paired_ca_atoms(struct1, struct2, chain_id="A"):
    """Return paired Cα atoms for residues present in both structures."""
    chain1 = struct1[0][chain_id]
    chain2 = struct2[0][chain_id]

    # Build dicts: residue_number → CA atom
    ca1 = {res.id[1]: res["CA"] for res in chain1.get_residues()
           if res.id[0] == " " and "CA" in res}
    ca2 = {res.id[1]: res["CA"] for res in chain2.get_residues()
           if res.id[0] == " " and "CA" in res}

    # Keep only residues present in both
    common = sorted(set(ca1.keys()) & set(ca2.keys()))
    return [ca1[r] for r in common], [ca2[r] for r in common]

ref_atoms, mob_atoms = get_paired_ca_atoms(reference, mobile)
sup = Superimposer()
sup.set_atoms(ref_atoms, mob_atoms)
print(f"RMSD: {sup.rms:.3f} Å over {len(ref_atoms)} paired Cα atoms")

Domain-specific alignment

When comparing multi-domain proteins where one domain moves relative to another — kinases switching between open and closed conformations, for example — aligning on the whole protein produces an RMSD that obscures the interesting conformational change. Align on the stable domain instead:

Python

# Define the stable domain by residue number range
stable_domain = range(50, 200)   # residues 50–199 are the stable core

def get_domain_ca(structure, residue_range, chain_id="A"):
    chain = structure[0][chain_id]
    return [
        res["CA"]
        for res in chain.get_residues()
        if res.id[0] == " "
        and res.id[1] in residue_range
        and "CA" in res
    ]

# Align on the stable domain
ref_core = get_domain_ca(reference, stable_domain)
mob_core = get_domain_ca(mobile,    stable_domain)

sup = Superimposer()
sup.set_atoms(ref_core, mob_core)
print(f"Core RMSD (residues 50-199): {sup.rms:.3f} Å")

# Apply transformation to ALL atoms of the mobile structure
sup.apply(mobile.get_atoms())

# Now calculate RMSD for the mobile domain WITHOUT realigning
from Bio.PDB import Superimposer
import numpy as np

mobile_domain = range(200, 350)
ref_mob = get_domain_ca(reference, mobile_domain)
mob_mob = get_domain_ca(mobile,    mobile_domain)

# Calculate RMSD manually (without re-aligning) to see how much the domain moved
coords_ref = np.array([a.get_coord() for a in ref_mob])
coords_mob = np.array([a.get_coord() for a in mob_mob])
domain_rmsd = np.sqrt(np.mean(np.sum((coords_ref - coords_mob)**2, axis=1)))
print(f"Mobile domain RMSD (residues 200-349): {domain_rmsd:.3f} Å")

Comparing multiple models

Python

import pandas as pd
import glob

# Load the reference structure once
reference = load_structure("experimental.pdb", "ref")
ref_atoms, _ = get_paired_ca_atoms(reference, reference)  # self-pair for reference

results = []
for pdb_path in sorted(glob.glob("./models/*.pdb")):
    model_name = pdb_path.split("/")[-1].replace(".pdb", "")
    mobile = load_structure(pdb_path, model_name)

    ref_paired, mob_paired = get_paired_ca_atoms(reference, mobile)
    if len(ref_paired) < 10:
        continue    # skip if too few paired residues

    sup = Superimposer()
    sup.set_atoms(ref_paired, mob_paired)

    results.append({
        "model":          model_name,
        "paired_residues": len(ref_paired),
        "rmsd_a":         round(sup.rms, 3),
    })

df = pd.DataFrame(results).sort_values("rmsd_a")
print(df.to_string(index=False))
df.to_csv("rmsd_comparison.csv", index=False)

Saving aligned structures

Python

from Bio.PDB import PDBIO

# Superimpose (apply moves the mobile structure's coordinates)
sup = Superimposer()
sup.set_atoms(ref_atoms, mob_atoms)
sup.apply(mobile.get_atoms())   # modifies mobile coordinates in-place

# Save the aligned mobile structure
io = PDBIO()
io.set_structure(mobile)
io.save("mobile_aligned.pdb")
print(f"Saved aligned structure — RMSD was {sup.rms:.3f} Å")

# Save both structures to the same PDB file as separate models
# (useful for loading the overlay directly into PyMOL)
with open("overlay.pdb", "w") as f:
    io.set_structure(reference)
    io.save(f)
    io.set_structure(mobile)
    io.save(f)

Load the overlay directly into PyMOL

After saving, load the aligned structures into PyMOL with load experimental.pdb and load mobile_aligned.pdb — they’ll already be in the same coordinate frame, so no alignment step in PyMOL is needed. Color them differently with color slate, experimental and color salmon, mobile_aligned for an immediate publication-quality overlay figure.

Quick reference

Operation	Code
Create Superimposer	`sup = Superimposer()`
Calculate RMSD (no move)	`sup.set_atoms(ref, mob)` → `sup.rms`
Apply superposition	`sup.apply(mobile.get_atoms())`
Get rotation matrix	`rot, tran = sup.rotran`
Cα atoms from chain	`[r["CA"] for r in chain.get_residues() if r.id[0]==" " and "CA" in r]`
Pair by residue number	Build `{res_num: atom}` dicts, find common keys with `set & set`
Manual RMSD (no alignment)	`np.sqrt(np.mean(np.sum((coords1 - coords2)**2, axis=1)))`
Save aligned structure	`PDBIO().set_structure(mobile); io.save("aligned.pdb")`

BioPython RMSD in one paragraph

Create a Superimposer(), call set_atoms(reference_atoms, mobile_atoms) with two equal-length lists of paired Atom objects, then read sup.rms for the RMSD value. Call sup.apply(mobile.get_atoms()) to physically move the mobile structure. The critical step before all of this is atom pairing — build dictionaries keyed by residue number and find the intersection to handle structures with different numbers of residues. For multi-domain proteins, align on the stable domain and then measure RMSD of the mobile domain without re-aligning to quantify the conformational change. Save aligned structures with PDBIO for direct loading into PyMOL — no re-alignment needed because the coordinates are already superimposed.

How to Calculate RMSD and Align Protein Structures with Python and BioPython

What RMSD measures

The Superimposer class

Selecting equivalent atoms correctly

Basic RMSD calculation — two structures

Pairing by residue number when lengths differ

Domain-specific alignment

Comparing multiple models

Saving aligned structures

Quick reference

BioPython RMSD in one paragraph

How to Analyze Protein-Ligand Interactions over an MD Trajectory with MDAnalysis

MDAnalysis Tutorial for Beginners: Loading and Working with MD Trajectories in Python

How to Calculate RMSD and RMSF from MD Simulations with MDAnalysis

Python for Structural Biology: The Complete Guide to BioPython and MDAnalysis

How to Find and Analyze Protein-Protein Interfaces with BioPython

Python for Structural Biology Cheat Sheet: BioPython and MDAnalysis Essential Commands

Leave a Reply Cancel reply

What RMSD measures

The Superimposer class

Selecting equivalent atoms correctly

Basic RMSD calculation — two structures

Pairing by residue number when lengths differ

Domain-specific alignment

Comparing multiple models

Saving aligned structures

Quick reference

BioPython RMSD in one paragraph

Similar Posts

Leave a Reply Cancel reply