Common BioPython and MDAnalysis Mistakes and How to Fix Them

Common BioPython and MDAnalysis Mistakes and How to Fix Them

Most BioPython and MDAnalysis errors are silent — the code runs without crashing but produces wrong residue counts, inflated RMSD values, or empty selections. These are the eight mistakes that appear most often in help threads, and the fixes that prevent all of them.

01
BioPython — very common
Using PDBParser on an mmCIF file, or MMCIFParser on a PDB file
BioPython

The AlphaFold Database and the RCSB PDB now distribute many structures in mmCIF format (.cif). Passing a CIF file to PDBParser produces a cryptic error or parses exactly zero atoms without any useful warning. The reverse — passing a .pdb file to MMCIFParser — behaves the same way.

Python — wrong and right
# ✗ Wrong — CIF file passed to PDB parser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("p", "AF-P04637.cif")  # fails silently

# ✓ Right — match parser to file format
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p", "AF-P04637.cif")

# Auto-detect by extension
def load_structure(path, name="s"):
    from Bio.PDB import PDBParser, MMCIFParser
    parser = MMCIFParser(QUIET=True) if path.endswith((".cif", ".mmcif")) \
             else PDBParser(QUIET=True)
    return parser.get_structure(name, path)
The fix
Use PDBParser for .pdb and .ent files; use MMCIFParser for .cif and .mmcif files. When downloading from the RCSB or AFDB programmatically, check the file extension before choosing the parser, or standardize on one format in your download function.
02
MDAnalysis — most common error
Atom selection syntax errors — getting an empty AtomGroup with no warning
MDAnalysis

MDAnalysis never raises an error for a selection that matches zero atoms — it silently returns an empty AtomGroup. A typo in your residue name, a wrong chain identifier, or using BioPython-style syntax instead of MDAnalysis syntax all produce zero-atom groups that cascade into division by zero, nan results, or wrong analysis without any obvious error message.

Python
# ✗ Common mistakes — all return empty groups silently
lig = u.select_atoms("resname LIG ")      # trailing space — no match
ca  = u.select_atoms("name C-alpha")      # wrong atom name (BioPython style)
ch  = u.select_atoms("chain A")           # wrong keyword — should be "segid"

# ✓ Correct MDAnalysis syntax
lig = u.select_atoms("resname LIG")       # no trailing space
ca  = u.select_atoms("name CA")           # MDAnalysis uses PDB atom names
ch  = u.select_atoms("segid A")           # segid, not chain

# ✓ Always check your selection returned atoms
sel = u.select_atoms("resname LIG")
if sel.n_atoms == 0:
    raise ValueError(f"Selection matched 0 atoms — check resname spelling")
print(f"Selected {sel.n_atoms} atoms")

# ✓ Inspect what residue names exist in the system
print(set(u.atoms.resnames))   # shows all residue names as loaded
03
BioPython — silent wrong results
Off-by-one residue numbering — KeyError or wrong residue selected
BioPython

PDB files use author residue numbering, which is often not sequential from 1. Residues can start at 0, at 100, or skip numbers entirely at chain breaks. BioPython uses the residue number from the PDB file directly — so chain[100] means “the residue labelled 100 in the PDB,” not “the 100th residue.” Researchers who assume sequential numbering get wrong residues without a KeyError if the number happens to exist, or a KeyError if it doesn’t.

Python
# ✗ Assuming 1-based sequential numbering
binding_site_residue = chain[100]   # wrong if PDB starts at 0 or has gaps

# ✓ Always inspect actual residue numbers first
resids = [res.id[1] for res in chain.get_residues() if res.id[0] == " "]
print(f"Residue numbers: {resids[:10]} … {resids[-5:]}")

# ✓ Or build a dict keyed by residue number for safe access
res_dict = {res.id[1]: res for res in chain.get_residues() if res.id[0] == " "}
if 100 in res_dict:
    r = res_dict[100]
else:
    print("Residue 100 not found — check the actual numbering")
04
MDAnalysis — inflated values, wrong science
Calculating RMSD or RMSF without aligning the trajectory first
MDAnalysis

This is the most consequential silent error in MD analysis. RMSD and RMSF measure structural deviation from a reference position — but a protein in a periodic boundary simulation also translates and rotates as a rigid body. Without alignment, these whole-protein motions are included in the RMSD, inflating values dramatically and making flexible loops appear rigid (because everything moves together) and stable cores appear flexible.

Python
# ✗ Wrong — RMSD calculated on raw, unaligned trajectory
rmsd_analysis = rms.RMSD(u, select="backbone").run()
# Values include whole-protein translation — meaningless

# ✓ Always align first
from MDAnalysis.analysis import align, rms

# Step 1: align (modifies coordinates in the Universe in-place)
align.AlignTraj(u, u, select="backbone", in_memory=True).run()

# Step 2: THEN calculate RMSD on the aligned Universe
rmsd_analysis = rms.RMSD(u, select="backbone").run()
Diagnosis
If your backbone RMSD plot shows values above 5 Å for a protein that you expect to be stable, or if RMSD rises steadily without plateauing for a short simulation, you likely forgot to align. Recalculate after running AlignTraj.
05
MDAnalysis — crashes or swaps to disk
Memory errors when loading large trajectories with in_memory=True
MDAnalysis

Using in_memory=True in AlignTraj loads the entire aligned trajectory into RAM. For a 100 ns simulation of a 100,000-atom system with 10,000 frames, this can require 100+ GB of memory. The result is either an out-of-memory crash or severe performance degradation as the system swaps to disk.

Python
# ✗ In-memory alignment of a large system — may crash
align.AlignTraj(u, u, select="backbone", in_memory=True).run()

# ✓ Write aligned trajectory to disk instead
align.AlignTraj(u, u, select="backbone",
                filename="aligned.xtc").run()
# Then reload the aligned trajectory
u_aligned = mda.Universe("topology.gro", "aligned.xtc")

# ✓ For analysis, use step to skip frames
rmsd_analysis = rms.RMSD(u, select="backbone").run(step=5)
# Analyzes every 5th frame — 5× less memory, usually sufficient

# ✓ Estimate memory before loading
n_atoms  = u.atoms.n_atoms
n_frames = len(u.trajectory)
mem_gb   = n_atoms * n_frames * 3 * 4 / 1e9  # float32 coordinates
print(f"Estimated in-memory size: {mem_gb:.1f} GB")
06
MDAnalysis — TypeError or AttributeError
Mixing up Universe and AtomGroup — passing the wrong object to analysis functions
MDAnalysis

MDAnalysis analysis classes expect a Universe as input, not an AtomGroup. Beginners who select protein atoms first and then pass the selection to RMSD() or AlignTraj() get a TypeError or AttributeError that isn’t obviously explained by the error message.

Python
# ✗ Wrong — passing an AtomGroup where a Universe is expected
protein = u.select_atoms("protein")
rmsd_analysis = rms.RMSD(protein, select="backbone").run()
# TypeError: 'AtomGroup' object has no attribute 'trajectory'

# ✓ Pass the Universe — use select= to restrict which atoms are measured
rmsd_analysis = rms.RMSD(u, select="backbone").run()
# The Universe u contains the trajectory; select= restricts the atoms used

# ✓ Rule of thumb:
#   Universe (u)      → analysis classes (RMSD, RMSF, HydrogenBondAnalysis)
#   AtomGroup         → positions, iteration, distances, writing
07
BioPython — wrong residue counts
Iterating residues and getting HETATM records — waters and ligands in your loop
BioPython

BioPython’s chain.get_residues() returns all residues including HETATM records — water molecules (HOH), ligands, ions, and crystallographic additives. A loop expecting only amino acids silently processes water molecules, causing wrong amino acid counts, KeyErrors when looking for CA atoms (waters don’t have one), and incorrect pLDDT extraction from AlphaFold structures.

Python
# ✗ Wrong — includes waters and ligands
for res in chain.get_residues():
    ca = res["CA"]   # KeyError for HOH, ligands, ions

# ✓ Filter by het_flag — " " means standard amino acid
for res in chain.get_residues():
    if res.id[0] != " ":   # skip HETATM (waters, ligands)
        continue
    if "CA" not in res:    # skip residues missing the alpha carbon
        continue
    ca = res["CA"]
    # ... your analysis ...

# ✓ Or build a filtered list once
aa_residues = [r for r in chain.get_residues()
               if r.id[0] == " " and "CA" in r]
print(f"Standard amino acid residues: {len(aa_residues)}")
08
MDAnalysis — slow code or wrong contact lists
Calling select_atoms() inside a trajectory loop — slow and wrong for fixed selections
Both

Two opposite errors from the same mistake. Calling distance-based selections like "protein and around 5 resname LIG" inside a loop is intentional and correct — it updates each frame. But calling fixed selections like "resid 100" inside a loop is accidentally correct (same atoms every time) but drastically slow — it re-parses the selection string on every single frame, which can add minutes to long trajectory analysis.

Python
# ✗ Slow — fixed selection re-parsed every frame
for ts in u.trajectory:
    active_site = u.select_atoms("resid 85 100 173")   # parsed 10,000 times

# ✓ Fast — select once before the loop, reuse the AtomGroup
active_site = u.select_atoms("resid 85 100 173")       # parsed once
for ts in u.trajectory:
    positions = active_site.positions   # positions update automatically

# ✓ Distance-based selections SHOULD stay inside the loop
lig = u.select_atoms("resname LIG")    # fixed — pre-select
for ts in u.trajectory:
    # Dynamic: which protein atoms are near the ligand THIS frame?
    pocket = u.select_atoms("protein and around 5 resname LIG")
Rule of thumb
Pre-select anything that doesn’t change between frames (residue IDs, atom names, chain identifiers). Keep inside the loop only selections that use spatial criteria (around, sphlayer) since those depend on current frame coordinates.

Quick reference

SymptomLibraryFix
Parser error or zero atoms from CIF fileBioPythonMMCIFParser() for .cif, PDBParser() for .pdb
Empty AtomGroup, no error raisedMDAnalysisCheck n_atoms == 0; inspect set(u.atoms.resnames)
KeyError on chain[100]BioPythonPrint actual residue numbers with get_residues()
RMSD > 5 Å for a stable proteinMDAnalysisRun AlignTraj before RMSD/RMSF calculation
MemoryError or system swap on analysisMDAnalysisWrite aligned trajectory to disk; use step=N
TypeError: no attribute ‘trajectory’MDAnalysisPass Universe u, not an AtomGroup, to analysis classes
KeyError on res[“CA”] in residue loopBioPythonFilter res.id[0] == ” ” to skip HETATM records
Analysis much slower than expectedMDAnalysisPre-select fixed atoms before the trajectory loop
The universal debugging step
Before any analysis, run these three diagnostic checks: print(set(u.atoms.resnames)) to confirm what residue names actually exist in your system, print(sel.n_atoms) immediately after every selection to confirm it matched atoms, and print(len(u.trajectory), u.trajectory.totaltime) to confirm the trajectory loaded correctly. Most BioPython and MDAnalysis bugs reveal themselves within these three print statements.

The pattern behind most BioPython and MDAnalysis bugs

Six of these eight problems are silent — the code runs without crashing but produces wrong results. BioPython silently parses nothing when given the wrong file format. MDAnalysis silently returns an empty AtomGroup. Unaligned RMSD values are silently inflated. The defence is simple: add explicit checks after every critical operation. Check that the parser returned a structure with atoms. Check that every selection matched more than zero atoms. Check that RMSD values look physically reasonable before proceeding. These three habits catch 90% of structural biology Python bugs before they reach the analysis stage.

Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *