How to Calculate RMSD and Align Protein Structures with Python and BioPython
RMSD is the standard metric for quantifying how similar two protein structures are. BioPython’s Superimposer class calculates it in three lines — but the lines before those three, where you select the right atoms to compare, are what make the result meaningful or misleading.
What RMSD measures
RMSD (root mean square deviation) measures the average distance between equivalent atom pairs in two structures after optimal superposition. A value of 0 means the structures are identical. Higher values mean larger average displacements between equivalent positions.
What constitutes a “good” RMSD depends entirely on what you’re comparing. An RMSD of 1.2 Å between an AlphaFold prediction and the experimental structure is excellent. The same value between two conformational states of the same protein after a large domain movement may be suspiciously low, suggesting the alignment focused only on the stable core.
The Superimposer class
BioPython’s Superimposer takes two lists of Atom objects — the reference and mobile atoms — finds the rotation and translation that minimizes RMSD between them, and then either reports the RMSD or applies the transformation to move the mobile structure. The two lists must be the same length and in the same order: atom i in list A is paired with atom i in list B.
from Bio.PDB import Superimposer
sup = Superimposer()
# Set reference atoms and mobile atoms (same length, same order)
sup.set_atoms(reference_atoms, mobile_atoms)
# After set_atoms(), these are populated:
print(f"RMSD: {sup.rms:.3f} Å") # the RMSD value
print(sup.rotran) # (rotation_matrix, translation_vector)
# Apply the transformation to move the mobile structure
sup.apply(mobile_structure.get_atoms()) # modifies coordinates in-place
set_atoms() calculates the optimal superposition and RMSD but does not move the mobile structure. apply() then applies that transformation to any atom list you pass — typically all atoms of the mobile structure, not just the ones used for alignment. This separation lets you align on a subset (e.g., backbone only) but transform the entire structure including side chains and ligands.
Selecting equivalent atoms correctly
The quality of your RMSD calculation depends almost entirely on which atoms you pair. The two most common choices are Cα atoms (one per residue, robust) and backbone atoms (N, Cα, C, O — more atoms, more precise). The key requirement is that every atom in list A has a genuinely equivalent atom in list B at the same position.
Basic RMSD calculation — two structures
from Bio.PDB import PDBParser, Superimposer
def load_structure(path, name):
parser = PDBParser(QUIET=True)
return parser.get_structure(name, path)
reference = load_structure("experimental.pdb", "ref")
mobile = load_structure("alphafold.pdb", "mob")
# Extract Cα atoms from chain A of each structure
def get_ca_atoms(structure, chain_id="A"):
chain = structure[0][chain_id]
return [
res["CA"]
for res in chain.get_residues()
if res.id[0] == " " and "CA" in res
]
ref_atoms = get_ca_atoms(reference)
mob_atoms = get_ca_atoms(mobile)
# Structures must have the same number of residues to pair
if len(ref_atoms) != len(mob_atoms):
print(f"Length mismatch: {len(ref_atoms)} vs {len(mob_atoms)} Cα atoms")
# Trim to the shorter sequence
n = min(len(ref_atoms), len(mob_atoms))
ref_atoms, mob_atoms = ref_atoms[:n], mob_atoms[:n]
# Calculate RMSD and superimpose
sup = Superimposer()
sup.set_atoms(ref_atoms, mob_atoms)
print(f"Cα RMSD: {sup.rms:.3f} Å over {len(ref_atoms)} residues")
# Cα RMSD: 0.847 Å over 283 residues
Superimposer requires exactly the same number of atoms in both lists. If your two structures have different numbers of residues — a common situation when comparing an AlphaFold model to a crystal structure with missing loops — you must explicitly pair only equivalent residues before calling set_atoms(). The simplest approach is pairing by residue number, shown below.
Pairing by residue number when lengths differ
def get_paired_ca_atoms(struct1, struct2, chain_id="A"):
"""Return paired Cα atoms for residues present in both structures."""
chain1 = struct1[0][chain_id]
chain2 = struct2[0][chain_id]
# Build dicts: residue_number → CA atom
ca1 = {res.id[1]: res["CA"] for res in chain1.get_residues()
if res.id[0] == " " and "CA" in res}
ca2 = {res.id[1]: res["CA"] for res in chain2.get_residues()
if res.id[0] == " " and "CA" in res}
# Keep only residues present in both
common = sorted(set(ca1.keys()) & set(ca2.keys()))
return [ca1[r] for r in common], [ca2[r] for r in common]
ref_atoms, mob_atoms = get_paired_ca_atoms(reference, mobile)
sup = Superimposer()
sup.set_atoms(ref_atoms, mob_atoms)
print(f"RMSD: {sup.rms:.3f} Å over {len(ref_atoms)} paired Cα atoms")
Domain-specific alignment
When comparing multi-domain proteins where one domain moves relative to another — kinases switching between open and closed conformations, for example — aligning on the whole protein produces an RMSD that obscures the interesting conformational change. Align on the stable domain instead:
# Define the stable domain by residue number range
stable_domain = range(50, 200) # residues 50–199 are the stable core
def get_domain_ca(structure, residue_range, chain_id="A"):
chain = structure[0][chain_id]
return [
res["CA"]
for res in chain.get_residues()
if res.id[0] == " "
and res.id[1] in residue_range
and "CA" in res
]
# Align on the stable domain
ref_core = get_domain_ca(reference, stable_domain)
mob_core = get_domain_ca(mobile, stable_domain)
sup = Superimposer()
sup.set_atoms(ref_core, mob_core)
print(f"Core RMSD (residues 50-199): {sup.rms:.3f} Å")
# Apply transformation to ALL atoms of the mobile structure
sup.apply(mobile.get_atoms())
# Now calculate RMSD for the mobile domain WITHOUT realigning
from Bio.PDB import Superimposer
import numpy as np
mobile_domain = range(200, 350)
ref_mob = get_domain_ca(reference, mobile_domain)
mob_mob = get_domain_ca(mobile, mobile_domain)
# Calculate RMSD manually (without re-aligning) to see how much the domain moved
coords_ref = np.array([a.get_coord() for a in ref_mob])
coords_mob = np.array([a.get_coord() for a in mob_mob])
domain_rmsd = np.sqrt(np.mean(np.sum((coords_ref - coords_mob)**2, axis=1)))
print(f"Mobile domain RMSD (residues 200-349): {domain_rmsd:.3f} Å")
Comparing multiple models
import pandas as pd
import glob
# Load the reference structure once
reference = load_structure("experimental.pdb", "ref")
ref_atoms, _ = get_paired_ca_atoms(reference, reference) # self-pair for reference
results = []
for pdb_path in sorted(glob.glob("./models/*.pdb")):
model_name = pdb_path.split("/")[-1].replace(".pdb", "")
mobile = load_structure(pdb_path, model_name)
ref_paired, mob_paired = get_paired_ca_atoms(reference, mobile)
if len(ref_paired) < 10:
continue # skip if too few paired residues
sup = Superimposer()
sup.set_atoms(ref_paired, mob_paired)
results.append({
"model": model_name,
"paired_residues": len(ref_paired),
"rmsd_a": round(sup.rms, 3),
})
df = pd.DataFrame(results).sort_values("rmsd_a")
print(df.to_string(index=False))
df.to_csv("rmsd_comparison.csv", index=False)
Saving aligned structures
from Bio.PDB import PDBIO
# Superimpose (apply moves the mobile structure's coordinates)
sup = Superimposer()
sup.set_atoms(ref_atoms, mob_atoms)
sup.apply(mobile.get_atoms()) # modifies mobile coordinates in-place
# Save the aligned mobile structure
io = PDBIO()
io.set_structure(mobile)
io.save("mobile_aligned.pdb")
print(f"Saved aligned structure — RMSD was {sup.rms:.3f} Å")
# Save both structures to the same PDB file as separate models
# (useful for loading the overlay directly into PyMOL)
with open("overlay.pdb", "w") as f:
io.set_structure(reference)
io.save(f)
io.set_structure(mobile)
io.save(f)
load experimental.pdb and load mobile_aligned.pdb — they’ll already be in the same coordinate frame, so no alignment step in PyMOL is needed. Color them differently with color slate, experimental and color salmon, mobile_aligned for an immediate publication-quality overlay figure.
Quick reference
| Operation | Code |
|---|---|
| Create Superimposer | sup = Superimposer() |
| Calculate RMSD (no move) | sup.set_atoms(ref, mob) → sup.rms |
| Apply superposition | sup.apply(mobile.get_atoms()) |
| Get rotation matrix | rot, tran = sup.rotran |
| Cα atoms from chain | [r["CA"] for r in chain.get_residues() if r.id[0]==" " and "CA" in r] |
| Pair by residue number | Build {res_num: atom} dicts, find common keys with set & set |
| Manual RMSD (no alignment) | np.sqrt(np.mean(np.sum((coords1 - coords2)**2, axis=1))) |
| Save aligned structure | PDBIO().set_structure(mobile); io.save("aligned.pdb") |
BioPython RMSD in one paragraph
Create a Superimposer(), call set_atoms(reference_atoms, mobile_atoms) with two equal-length lists of paired Atom objects, then read sup.rms for the RMSD value. Call sup.apply(mobile.get_atoms()) to physically move the mobile structure. The critical step before all of this is atom pairing — build dictionaries keyed by residue number and find the intersection to handle structures with different numbers of residues. For multi-domain proteins, align on the stable domain and then measure RMSD of the mobile domain without re-aligning to quantify the conformational change. Save aligned structures with PDBIO for direct loading into PyMOL — no re-alignment needed because the coordinates are already superimposed.