BioPython Tutorial for Beginners: Parsing PDB Files and Working with Protein Structures
BioPython’s structure module turns a PDB file — which is just a formatted text file — into a Python object you can query, iterate, and calculate with. This tutorial covers everything you need to go from a PDB accession code to meaningful per-residue data in under 20 lines of code.
Installing BioPython
If you followed the environment setup tutorial, BioPython is already installed in your structbio conda environment. If not, install it now:
# With your structbio environment active:
conda install -c conda-forge biopython -y
# Verify
python -c "import Bio; print(Bio.__version__)"
1.84
Loading a PDB file
BioPython provides two main parsers for structure files: PDBParser for the classic .pdb format and MMCIFParser for the modern .cif format. PDBParser is the right starting point — virtually every PDB download, AlphaFold structure, and GROMACS output uses the PDB format.
from Bio.PDB import PDBParser, PDBList
# Option A: parse a local PDB file
parser = PDBParser(QUIET=True) # QUIET=True suppresses non-fatal warnings
structure = parser.get_structure("my_protein", "protein.pdb")
# Option B: download directly from the RCSB PDB
pdbl = PDBList()
pdbl.retrieve_pdb_file("4XYZ", file_type="pdb", pdir=".")
structure = parser.get_structure("4XYZ", "./4x/pdb4xyz.ent")
# Option C: load an AlphaFold structure
structure = parser.get_structure("af_p53", "AF-P04637-F1-model_v4.pdb")
print(structure.id) # "my_protein" — the name you gave it
print(structure.header) # dict of PDB header information
QUIET=True suppresses these warnings so they don’t flood your output. If you’re debugging a parsing problem, set QUIET=False temporarily to see what’s being flagged.
The Structure → Model → Chain → Residue → Atom hierarchy
Every BioPython structure object follows a strict five-level hierarchy. Understanding this is the single most important concept in Bio.PDB — every access pattern and iteration follows it.
Each level is indexed differently: models by integer, chains by string ID, residues by their residue number (as an integer for standard residues), and atoms by atom name string. The model index is almost always 0 for X-ray crystal structures — NMR structures that contain multiple models are the exception.
# Navigate directly to a specific atom
model = structure[0] # first model
chain = model["A"] # chain A
residue = chain[185] # residue number 185
atom = residue["CA"] # alpha carbon
# Or chain navigation in one line
ca_185 = structure[0]["A"][185]["CA"]
# What type is each object?
print(type(structure)) # Bio.PDB.Structure.Structure
print(type(chain)) # Bio.PDB.Chain.Chain
print(type(residue)) # Bio.PDB.Residue.Residue
print(type(atom)) # Bio.PDB.Atom.Atom
Accessing chains
model = structure[0]
# List all chains in the structure
for chain in model.get_chains():
n_residues = len(list(chain.get_residues()))
print(f"Chain {chain.id}: {n_residues} residues")
# Chain A: 283 residues
# Chain B: 278 residues
# Access a specific chain
chain_a = model["A"]
chain_b = model["B"]
# Count only standard amino acid residues (exclude HETATM)
aa_residues = [r for r in chain_a.get_residues()
if r.get_id()[0] == " "] # het flag " " = standard residue
print(f"Standard residues in chain A: {len(aa_residues)}")
(het_flag, sequence_number, insertion_code). Standard amino acids have het_flag = " " (a space). HETATM records (ligands, waters, ions) have het_flag = "H_LIG" or "W" for water. The insertion code is usually " " but can be a letter like "A" for residues 100A. When iterating residues and getting unexpected results, print residue.get_id() to inspect all three parts.
Accessing and iterating residues
chain = structure[0]["A"]
# Iterate over all residues
for res in chain.get_residues():
het, seq, icode = res.get_id()
print(res.get_resname(), seq, het)
# Get residue name (3-letter code)
res = chain[100]
print(res.get_resname()) # e.g. "GLU"
print(res.get_id()[1]) # sequence number: 100
# Filter: only standard amino acids
aa_only = [r for r in chain.get_residues() if r.id[0] == " "]
# Filter: only heteroatoms (ligands, waters)
hetatm = [r for r in chain.get_residues() if r.id[0] != " "]
# Get sequence as one-letter codes
from Bio.PDB.Polypeptide import three_to_one
sequence = "".join(
three_to_one(r.get_resname())
for r in aa_only
if r.get_resname() in three_to_one.__self__._three_to_one
)
print(sequence[:20]) # First 20 residues as single letters
Accessing atoms and coordinates
residue = structure[0]["A"][185]
# List all atoms in a residue
for atom in residue.get_atoms():
print(atom.get_name(), atom.get_coord())
# Access specific atoms by name
ca = residue["CA"] # alpha carbon
cb = residue["CB"] # beta carbon (not in Gly)
n = residue["N"] # backbone nitrogen
c = residue["C"] # backbone carbonyl carbon
# Get 3D coordinates as a NumPy-compatible array
coords = ca.get_coord() # returns array([x, y, z])
x, y, z = coords
print(f"CA position: ({x:.3f}, {y:.3f}, {z:.3f}) Å")
# Get B-factor — this holds pLDDT in AlphaFold structures
bfactor = ca.get_bfactor()
print(f"B-factor / pLDDT: {bfactor:.2f}")
# Get element
print(ca.element) # "C"
Calculating distances between atoms
BioPython’s Vector objects support direct subtraction and the minus operator returns the distance. For pairwise distance calculations across many atoms, NeighborSearch is much faster than nested loops.
from Bio.PDB import NeighborSearch
import numpy as np
chain = structure[0]["A"]
# Distance between two specific atoms
atom1 = structure[0]["A"][185]["OG"] # Ser 185 OG
atom2 = structure[0]["A"][273]["NE2"] # His 273 NE2
distance = atom1 - atom2 # BioPython overloads minus for distance
print(f"Distance: {distance:.2f} Å")
# Find all atoms within 5 Å of a target atom using NeighborSearch
all_atoms = list(structure.get_atoms())
ns = NeighborSearch(all_atoms)
target_atom = structure[0]["A"][300]["CA"]
nearby_atoms = ns.search(target_atom.get_coord(), 5.0, level="A")
print(f"Atoms within 5 Å: {len(nearby_atoms)}")
# Find all residues within 5 Å (level="R")
nearby_residues = ns.search(target_atom.get_coord(), 5.0, level="R")
for res in nearby_residues:
print(res.get_resname(), res.get_id()[1])
NeighborSearch uses a k-d tree internally and runs in O(n log n), making it orders of magnitude faster for any contact analysis involving more than a few dozen atoms.
Calculating protein properties
BioPython’s ProteinAnalysis class works from a sequence string and computes a wide range of physicochemical properties:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# Build a sequence string from the structure
from Bio.PDB.Polypeptide import PPBuilder
ppb = PPBuilder()
sequence_str = ""
for pp in ppb.build_peptides(structure):
sequence_str += str(pp.get_sequence())
# Analyze the sequence
analysis = ProteinAnalysis(sequence_str)
print(f"Length: {len(sequence_str)} residues")
print(f"Molecular weight: {analysis.molecular_weight():.1f} Da")
print(f"Isoelectric point: {analysis.isoelectric_point():.2f}")
print(f"GRAVY score: {analysis.gravy():.3f}")
print(f"Instability index: {analysis.instability_index():.1f}")
# Amino acid composition as a dictionary
composition = analysis.get_amino_acids_percent()
for aa, pct in sorted(composition.items(), key=lambda x: -x[1])[:5]:
print(f" {aa}: {pct*100:.1f}%")
Writing modified structures to PDB
After modifying a structure — removing water molecules, extracting a single chain, or trimming to a binding site — write it back to a PDB file with PDBIO:
from Bio.PDB import PDBIO, Select
# Write the whole structure
io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")
# Write only chain A using a custom Select class
class ChainASelect(Select):
def accept_chain(self, chain):
return chain.id == "A"
io.save("chain_A_only.pdb", ChainASelect())
# Write only non-water HETATM residues (i.e. remove solvent)
class NoWaterSelect(Select):
def accept_residue(self, residue):
return residue.id[0] != "W" # W = water het flag
io.save("no_water.pdb", NoWaterSelect())
| Method | Returns |
|---|---|
| Structure-level | |
| structure.get_models() | Iterator over Model objects |
| structure.get_chains() | Iterator over all Chain objects |
| structure.get_residues() | Iterator over all Residue objects |
| structure.get_atoms() | Iterator over every Atom object |
| Atom-level | |
| atom.get_coord() | NumPy array [x, y, z] in ångströms |
| atom.get_bfactor() | B-factor value (pLDDT for AlphaFold) |
| atom.get_name() | Atom name string e.g. “CA”, “OG” |
| atom.element | Element symbol e.g. “C”, “N”, “O” |
| atom1 – atom2 | Distance in ångströms between two atoms |
| Residue-level | |
| residue.get_resname() | Three-letter residue name e.g. “GLU” |
| residue.get_id() | Tuple (het_flag, seq_num, icode) |
| residue.get_atoms() | Iterator over Atom objects in residue |
| residue.is_disordered() | True if residue has alternate conformations |
BioPython PDB parsing in one paragraph
Load any PDB file with PDBParser(QUIET=True).get_structure("name", "file.pdb"). Navigate the five-level hierarchy — Structure → Model → Chain → Residue → Atom — using bracket indexing: structure[0]["A"][185]["CA"]. Iterate over any level with the corresponding get_chains(), get_residues(), or get_atoms() generator. Filter standard residues from HETATM records using the het flag: residue.id[0] == " " for standard amino acids. Get atom coordinates with atom.get_coord() and B-factors (pLDDT in AlphaFold files) with atom.get_bfactor(). Find nearby atoms efficiently with NeighborSearch — never nested loops. Write modified structures back to PDB with PDBIO and a custom Select class.