Common BioPython and MDAnalysis Mistakes and How to Fix Them
Most BioPython and MDAnalysis errors are silent — the code runs without crashing but produces wrong residue counts, inflated RMSD values, or empty selections. These are the eight mistakes that appear most often in help threads, and the fixes that prevent all of them.
The AlphaFold Database and the RCSB PDB now distribute many structures in mmCIF format (.cif). Passing a CIF file to PDBParser produces a cryptic error or parses exactly zero atoms without any useful warning. The reverse — passing a .pdb file to MMCIFParser — behaves the same way.
# ✗ Wrong — CIF file passed to PDB parser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("p", "AF-P04637.cif") # fails silently
# ✓ Right — match parser to file format
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p", "AF-P04637.cif")
# Auto-detect by extension
def load_structure(path, name="s"):
from Bio.PDB import PDBParser, MMCIFParser
parser = MMCIFParser(QUIET=True) if path.endswith((".cif", ".mmcif")) \
else PDBParser(QUIET=True)
return parser.get_structure(name, path)
PDBParser for .pdb and .ent files; use MMCIFParser for .cif and .mmcif files. When downloading from the RCSB or AFDB programmatically, check the file extension before choosing the parser, or standardize on one format in your download function.
MDAnalysis never raises an error for a selection that matches zero atoms — it silently returns an empty AtomGroup. A typo in your residue name, a wrong chain identifier, or using BioPython-style syntax instead of MDAnalysis syntax all produce zero-atom groups that cascade into division by zero, nan results, or wrong analysis without any obvious error message.
# ✗ Common mistakes — all return empty groups silently
lig = u.select_atoms("resname LIG ") # trailing space — no match
ca = u.select_atoms("name C-alpha") # wrong atom name (BioPython style)
ch = u.select_atoms("chain A") # wrong keyword — should be "segid"
# ✓ Correct MDAnalysis syntax
lig = u.select_atoms("resname LIG") # no trailing space
ca = u.select_atoms("name CA") # MDAnalysis uses PDB atom names
ch = u.select_atoms("segid A") # segid, not chain
# ✓ Always check your selection returned atoms
sel = u.select_atoms("resname LIG")
if sel.n_atoms == 0:
raise ValueError(f"Selection matched 0 atoms — check resname spelling")
print(f"Selected {sel.n_atoms} atoms")
# ✓ Inspect what residue names exist in the system
print(set(u.atoms.resnames)) # shows all residue names as loaded
PDB files use author residue numbering, which is often not sequential from 1. Residues can start at 0, at 100, or skip numbers entirely at chain breaks. BioPython uses the residue number from the PDB file directly — so chain[100] means “the residue labelled 100 in the PDB,” not “the 100th residue.” Researchers who assume sequential numbering get wrong residues without a KeyError if the number happens to exist, or a KeyError if it doesn’t.
# ✗ Assuming 1-based sequential numbering
binding_site_residue = chain[100] # wrong if PDB starts at 0 or has gaps
# ✓ Always inspect actual residue numbers first
resids = [res.id[1] for res in chain.get_residues() if res.id[0] == " "]
print(f"Residue numbers: {resids[:10]} … {resids[-5:]}")
# ✓ Or build a dict keyed by residue number for safe access
res_dict = {res.id[1]: res for res in chain.get_residues() if res.id[0] == " "}
if 100 in res_dict:
r = res_dict[100]
else:
print("Residue 100 not found — check the actual numbering")
This is the most consequential silent error in MD analysis. RMSD and RMSF measure structural deviation from a reference position — but a protein in a periodic boundary simulation also translates and rotates as a rigid body. Without alignment, these whole-protein motions are included in the RMSD, inflating values dramatically and making flexible loops appear rigid (because everything moves together) and stable cores appear flexible.
# ✗ Wrong — RMSD calculated on raw, unaligned trajectory
rmsd_analysis = rms.RMSD(u, select="backbone").run()
# Values include whole-protein translation — meaningless
# ✓ Always align first
from MDAnalysis.analysis import align, rms
# Step 1: align (modifies coordinates in the Universe in-place)
align.AlignTraj(u, u, select="backbone", in_memory=True).run()
# Step 2: THEN calculate RMSD on the aligned Universe
rmsd_analysis = rms.RMSD(u, select="backbone").run()
AlignTraj.
Using in_memory=True in AlignTraj loads the entire aligned trajectory into RAM. For a 100 ns simulation of a 100,000-atom system with 10,000 frames, this can require 100+ GB of memory. The result is either an out-of-memory crash or severe performance degradation as the system swaps to disk.
# ✗ In-memory alignment of a large system — may crash
align.AlignTraj(u, u, select="backbone", in_memory=True).run()
# ✓ Write aligned trajectory to disk instead
align.AlignTraj(u, u, select="backbone",
filename="aligned.xtc").run()
# Then reload the aligned trajectory
u_aligned = mda.Universe("topology.gro", "aligned.xtc")
# ✓ For analysis, use step to skip frames
rmsd_analysis = rms.RMSD(u, select="backbone").run(step=5)
# Analyzes every 5th frame — 5× less memory, usually sufficient
# ✓ Estimate memory before loading
n_atoms = u.atoms.n_atoms
n_frames = len(u.trajectory)
mem_gb = n_atoms * n_frames * 3 * 4 / 1e9 # float32 coordinates
print(f"Estimated in-memory size: {mem_gb:.1f} GB")
MDAnalysis analysis classes expect a Universe as input, not an AtomGroup. Beginners who select protein atoms first and then pass the selection to RMSD() or AlignTraj() get a TypeError or AttributeError that isn’t obviously explained by the error message.
# ✗ Wrong — passing an AtomGroup where a Universe is expected
protein = u.select_atoms("protein")
rmsd_analysis = rms.RMSD(protein, select="backbone").run()
# TypeError: 'AtomGroup' object has no attribute 'trajectory'
# ✓ Pass the Universe — use select= to restrict which atoms are measured
rmsd_analysis = rms.RMSD(u, select="backbone").run()
# The Universe u contains the trajectory; select= restricts the atoms used
# ✓ Rule of thumb:
# Universe (u) → analysis classes (RMSD, RMSF, HydrogenBondAnalysis)
# AtomGroup → positions, iteration, distances, writing
BioPython’s chain.get_residues() returns all residues including HETATM records — water molecules (HOH), ligands, ions, and crystallographic additives. A loop expecting only amino acids silently processes water molecules, causing wrong amino acid counts, KeyErrors when looking for CA atoms (waters don’t have one), and incorrect pLDDT extraction from AlphaFold structures.
# ✗ Wrong — includes waters and ligands
for res in chain.get_residues():
ca = res["CA"] # KeyError for HOH, ligands, ions
# ✓ Filter by het_flag — " " means standard amino acid
for res in chain.get_residues():
if res.id[0] != " ": # skip HETATM (waters, ligands)
continue
if "CA" not in res: # skip residues missing the alpha carbon
continue
ca = res["CA"]
# ... your analysis ...
# ✓ Or build a filtered list once
aa_residues = [r for r in chain.get_residues()
if r.id[0] == " " and "CA" in r]
print(f"Standard amino acid residues: {len(aa_residues)}")
Two opposite errors from the same mistake. Calling distance-based selections like "protein and around 5 resname LIG" inside a loop is intentional and correct — it updates each frame. But calling fixed selections like "resid 100" inside a loop is accidentally correct (same atoms every time) but drastically slow — it re-parses the selection string on every single frame, which can add minutes to long trajectory analysis.
# ✗ Slow — fixed selection re-parsed every frame
for ts in u.trajectory:
active_site = u.select_atoms("resid 85 100 173") # parsed 10,000 times
# ✓ Fast — select once before the loop, reuse the AtomGroup
active_site = u.select_atoms("resid 85 100 173") # parsed once
for ts in u.trajectory:
positions = active_site.positions # positions update automatically
# ✓ Distance-based selections SHOULD stay inside the loop
lig = u.select_atoms("resname LIG") # fixed — pre-select
for ts in u.trajectory:
# Dynamic: which protein atoms are near the ligand THIS frame?
pocket = u.select_atoms("protein and around 5 resname LIG")
around, sphlayer) since those depend on current frame coordinates.
Quick reference
| Symptom | Library | Fix |
|---|---|---|
| Parser error or zero atoms from CIF file | BioPython | MMCIFParser() for .cif, PDBParser() for .pdb |
| Empty AtomGroup, no error raised | MDAnalysis | Check n_atoms == 0; inspect set(u.atoms.resnames) |
| KeyError on chain[100] | BioPython | Print actual residue numbers with get_residues() |
| RMSD > 5 Å for a stable protein | MDAnalysis | Run AlignTraj before RMSD/RMSF calculation |
| MemoryError or system swap on analysis | MDAnalysis | Write aligned trajectory to disk; use step=N |
| TypeError: no attribute ‘trajectory’ | MDAnalysis | Pass Universe u, not an AtomGroup, to analysis classes |
| KeyError on res[“CA”] in residue loop | BioPython | Filter res.id[0] == ” ” to skip HETATM records |
| Analysis much slower than expected | MDAnalysis | Pre-select fixed atoms before the trajectory loop |
print(set(u.atoms.resnames)) to confirm what residue names actually exist in your system, print(sel.n_atoms) immediately after every selection to confirm it matched atoms, and print(len(u.trajectory), u.trajectory.totaltime) to confirm the trajectory loaded correctly. Most BioPython and MDAnalysis bugs reveal themselves within these three print statements.
The pattern behind most BioPython and MDAnalysis bugs
Six of these eight problems are silent — the code runs without crashing but produces wrong results. BioPython silently parses nothing when given the wrong file format. MDAnalysis silently returns an empty AtomGroup. Unaligned RMSD values are silently inflated. The defence is simple: add explicit checks after every critical operation. Check that the parser returned a structure with atoms. Check that every selection matched more than zero atoms. Check that RMSD values look physically reasonable before proceeding. These three habits catch 90% of structural biology Python bugs before they reach the analysis stage.