Python for Structural Biology Cheat Sheet: BioPython and MDAnalysis Essential Commands

Python for Structural Biology Cheat Sheet: BioPython and MDAnalysis Essential Commands

Every essential command for BioPython and MDAnalysis — organized by task, not alphabetically. Bookmark this and keep it open while you code.

Environment setup
CommandWhat it does
conda create -n structbio python=3.10 -yCreate dedicated conda environment
conda activate structbioActivate before every session
conda install -c conda-forge biopython mdanalysis numpy pandas matplotlib -yInstall all essential packages at once
conda install -c conda-forge pymol-open-source -yAdd PyMOL to the same environment
conda env export > structbio.ymlSave environment spec for reproducibility
conda env create -f structbio.ymlRecreate environment from saved spec
BioPython — loading structures
CommandWhat it does
Parsers
PDBParser(QUIET=True)Parser for .pdb and .ent files
MMCIFParser(QUIET=True)Parser for .cif and .mmcif files — use for AlphaFold and newer PDB downloads
parser.get_structure(“name”, “file.pdb”)Load a structure — returns a Structure object
Downloading
PDBList().retrieve_pdb_file(“4XYZ”, pdir=”.”)Download PDB file from RCSB
requests.get(f”https://alphafold.ebi.ac.uk/files/AF-{uid}-F1-model_v4.pdb”)Download AlphaFold structure via API — no auth required
BioPython — navigation and access
CommandWhat it does
SMCRA hierarchy — Structure → Model → Chain → Residue → Atom
structure[0]First model (almost always the only one for X-ray / AF)
model[“A”]Chain A by identifier string
chain[100]Residue by PDB author number — not zero-indexed
residue[“CA”]Alpha carbon atom by name
structure[0][“A”][185][“OG”]Full path to a specific atom in one expression
Iteration
model.get_chains()Iterate all chains
chain.get_residues()Iterate all residues including HETATM — filter with res.id[0]==” “
residue.get_atoms()Iterate all atoms in a residue
structure.get_atoms()Iterate every atom in the entire structure
Residue identity
residue.idTuple (het_flag, seq_num, icode) — ” ” = standard AA, “W” = water
residue.id[0] == ” “True for standard amino acids — use to filter HETATM
residue.get_resname()Three-letter residue name e.g. “GLU”
residue.id[1]Residue sequence number from PDB
Atom properties
atom.get_coord()NumPy array [x, y, z] in ångströms
atom.get_bfactor()B-factor — holds pLDDT in AlphaFold structures
atom.get_name()Atom name string e.g. “CA”, “OG”, “NZ”
atom.elementElement symbol e.g. “C”, “N”, “O”
atom1 – atom2Distance in ångströms between two atoms
BioPython — calculations
CommandWhat it does
Neighbor search — use instead of nested loops
NeighborSearch(atom_list)Build k-d tree from list of Atom objects
ns.search(coords, radius, level=”R”)Find residues within radius of a point — “A”=atoms, “R”=residues
Structure superposition
Superimposer()Create superimposer object
sup.set_atoms(ref_atoms, mob_atoms)Calculate optimal superposition and RMSD — does not move anything
sup.rmsRMSD value in ångströms after set_atoms()
sup.apply(mobile.get_atoms())Apply transformation to move mobile structure in-place
Protein analysis (sequence-based)
ProteinAnalysis(sequence_str)Create analysis object from one-letter sequence string
analysis.molecular_weight()Molecular weight in Daltons
analysis.isoelectric_point()Theoretical isoelectric point (pH)
analysis.gravy()GRAVY hydrophobicity score — negative = soluble
analysis.instability_index()Predicted stability — <40 stable, ≥40 unstable
analysis.secondary_structure_fraction()(helix, turn, sheet) fractions — Chou-Fasman, sequence-based
BioPython — file I/O
CommandWhat it does
PDBIO().set_structure(structure); io.save(“out.pdb”)Write whole structure to PDB file
io.save(“chain_A.pdb”, ChainASelect())Write subset using a Select subclass — override accept_residue()
PPBuilder().build_peptides(structure)Extract polypeptide sequences from structure — returns list of Polypeptide objects
str(pp.get_sequence())Get sequence string from a Polypeptide object
MDAnalysis — loading
CommandWhat it does
mda.Universe(“topology.gro”, “traj.xtc”)Load GROMACS topology + trajectory
mda.Universe(“topology.prmtop”, “traj.nc”)Load AMBER topology + trajectory
mda.Universe(“topology.psf”, “traj.dcd”)Load NAMD/CHARMM topology + trajectory
mda.Universe(“protein.pdb”)Single structure with one frame — no trajectory
mda.Universe(“top.gro”, [“r1.xtc”, “r2.xtc”])Concatenate multiple trajectory files
Trajectory info
u.atoms.n_atomsTotal atom count
len(u.trajectory)Number of frames
u.trajectory.dtTimestep in picoseconds
u.trajectory.totaltimeTotal simulation time in picoseconds
u.trajectory[N]Seek to frame N
MDAnalysis — selections
Selection stringWhat it selects
Molecular type keywords
“protein”All protein residues
“backbone”Protein backbone: N, CA, C, O
“water”Water molecules (TIP3P, SPC etc.)
“not protein and not resname SOL NA CL”Ligands and non-standard residues only
Residue and atom identity
“resname GLU”All glutamate residues
“resid 100:200”Residues 100 through 200
“resid 85 100 173”Specific residues by ID (space-separated list)
“name CA”Alpha carbon atoms only
“name CA CB”Multiple atom names
“not type H”Heavy atoms only (no hydrogens)
Spatial and boolean
“protein and around 5 resname LIG”Protein atoms within 5 Å of ligand — updates each frame
“(resname ASP GLU) or (resname LYS ARG)”Charged residues — parentheses group OR clauses
“protein and not water”Boolean NOT — everything protein that isn’t water
AtomGroup properties
ag.positionsNumPy array (n_atoms, 3) — coordinates at current frame
ag.n_atomsNumber of atoms — check this is > 0 after any selection
ag.residsArray of residue numbers for each atom
ag.resnamesArray of residue names for each atom
ag.center_of_mass()Center of mass [x, y, z]
ag.radius_of_gyration()Radius of gyration in ångströms
MDAnalysis — trajectory analysis
CommandWhat it does
Alignment (run before RMSD/RMSF)
align.AlignTraj(u, u, select=”backbone”, in_memory=True).run()Align trajectory on backbone — required before RMSD/RMSF
align.AlignTraj(u, u, select=”backbone”, filename=”aligned.xtc”).run()Align and write to disk — use for large systems
RMSD and RMSF
rms.RMSD(u, select=”backbone”).run()RMSD over trajectory — results in .results.rmsd[:, 2] (Å)
rms.RMSD(u, select=”backbone”).run(start=50)Skip equilibration frames — start from frame 50
rms.RMSF(ca_atoms).run()Per-residue RMSF — results in .results.rmsf (Å)
H-bond analysis
HBA(universe=u).run()All intra-system H-bonds (requires H atoms in topology)
HBA(universe=u, between=[[“protein”], [“resname LIG”]]).run()Protein-ligand H-bonds only
hba.results.hbondsArray of (frame, donor_idx, H_idx, acceptor_idx, dist, angle)
Distance calculation
distance_array(ag1.positions, ag2.positions)Fast pairwise distance matrix — returns (n1, n2) NumPy array
np.linalg.norm(atom1.positions[0] – atom2.positions[0])Distance between two specific atoms at current frame
Trajectory iteration
for ts in u.trajectory:Iterate every frame — positions update automatically
for ts in u.trajectory[::5]:Every 5th frame (stride) — faster for large trajectories
for ts in u.trajectory[50:]:From frame 50 onward — skip equilibration
ts.frame, ts.time, ts.dimensionsCurrent frame index, time (ps), box dimensions [a,b,c,α,β,γ]
MDAnalysis — writing output
CommandWhat it does
protein.write(“protein_only.pdb”)Write current frame of selection to PDB
with mda.Writer(“out.xtc”, n_atoms) as w: w.write(ag)Write trajectory frames to XTC file
for ts in u.trajectory[::10]: w.write(protein)Write every 10th frame — reduces file size
u.trajectory[50]; protein.write(“frame50.pdb”)Write a specific frame as PDB for PyMOL
Cross-library workflows
AlphaFold pLDDT extraction (BioPython)
from Bio.PDB import PDBParser parser = PDBParser(QUIET=True) struct = parser.get_structure(“af”, “AF-P04637-F1-model_v4.pdb”) plddt = {res.id[1]: res[“CA”].get_bfactor() for res in struct[0][“A”].get_residues() if res.id[0] == ” “ and “CA” in res} # plddt = {resid: score, …} — filter: {k: v for k, v in plddt.items() if v >= 70}
Standard MD analysis setup (MDAnalysis)
import MDAnalysis as mda from MDAnalysis.analysis import align, rms u = mda.Universe(“topology.gro”, “md_vis.xtc”) align.AlignTraj(u, u, select=“backbone”, in_memory=True).run() # align first rmsd_data = rms.RMSD(u, select=“backbone”).run().results.rmsd[:, 2] rmsf_data = rms.RMSF(u.select_atoms(“backbone and name CA”)).run().results.rmsf
Interface residues — BioPython NeighborSearch
from Bio.PDB import NeighborSearch ns_b = NeighborSearch(list(model[“B”].get_atoms())) iface_a = {res_a for res_a in model[“A”].get_residues() if res_a.id[0] == ” “ for atom in res_a.get_atoms() if ns_b.search(atom.get_coord(), 5.0, level=“R”)}
Protein-ligand contact occupancy — MDAnalysis
from MDAnalysis.lib.distances import distance_array from collections import Counter lig = u.select_atoms(“resname LIG”) protein = u.select_atoms(“protein”) counts = Counter() for ts in u.trajectory: mask = (distance_array(lig.positions, protein.positions) < 5.0).any(axis=0) counts.update(protein.atoms[mask].resids) # occupancy_pct = {r: c/len(u.trajectory)*100 for r, c in counts.items()}
RMSD comparison — BioPython Superimposer
from Bio.PDB import Superimposer sup = Superimposer() # Build equal-length lists of paired Cα atoms ca_ref = [chain_ref[r][“CA”] for r in common_resids] ca_mob = [chain_mob[r][“CA”] for r in common_resids] sup.set_atoms(ca_ref, ca_mob) print(f“RMSD: {sup.rms:.3f} Å over {len(ca_ref)} Cα atoms”) sup.apply(mobile.get_atoms()) # move mobile to aligned position
Three checks before any analysis
(1) After every BioPython parse: print(len(list(structure.get_atoms()))) — confirm atoms loaded. (2) After every MDAnalysis selection: assert sel.n_atoms > 0, "Empty selection" — catch silent mismatches. (3) After alignment: check RMSD is physically reasonable (backbone RMSD < 5 Å for a stable protein) before running RMSF or H-bond analysis.
Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *