BioPython Tutorial for Beginners: Parsing PDB Files and Working with Protein Structures

BioPython Tutorial for Beginners: Parsing PDB Files and Working with Protein Structures

BioPython’s structure module turns a PDB file — which is just a formatted text file — into a Python object you can query, iterate, and calculate with. This tutorial covers everything you need to go from a PDB accession code to meaningful per-residue data in under 20 lines of code.

Installing BioPython

If you followed the environment setup tutorial, BioPython is already installed in your structbio conda environment. If not, install it now:

Terminal
# With your structbio environment active:
conda install -c conda-forge biopython -y

# Verify
python -c "import Bio; print(Bio.__version__)"
1.84

Loading a PDB file

BioPython provides two main parsers for structure files: PDBParser for the classic .pdb format and MMCIFParser for the modern .cif format. PDBParser is the right starting point — virtually every PDB download, AlphaFold structure, and GROMACS output uses the PDB format.

Python
from Bio.PDB import PDBParser, PDBList

# Option A: parse a local PDB file
parser = PDBParser(QUIET=True)   # QUIET=True suppresses non-fatal warnings
structure = parser.get_structure("my_protein", "protein.pdb")

# Option B: download directly from the RCSB PDB
pdbl = PDBList()
pdbl.retrieve_pdb_file("4XYZ", file_type="pdb", pdir=".")
structure = parser.get_structure("4XYZ", "./4x/pdb4xyz.ent")

# Option C: load an AlphaFold structure
structure = parser.get_structure("af_p53", "AF-P04637-F1-model_v4.pdb")

print(structure.id)          # "my_protein" — the name you gave it
print(structure.header)      # dict of PDB header information
QUIET=True — always use it
PDB files frequently contain minor formatting inconsistencies that BioPython reports as warnings — discontinuous chains, non-standard residues, missing atoms. Most of these are harmless. Setting QUIET=True suppresses these warnings so they don’t flood your output. If you’re debugging a parsing problem, set QUIET=False temporarily to see what’s being flagged.

The Structure → Model → Chain → Residue → Atom hierarchy

Every BioPython structure object follows a strict five-level hierarchy. Understanding this is the single most important concept in Bio.PDB — every access pattern and iteration follows it.

BioPython SMCRA hierarchy
Structure
structure
The whole file
Model
structure[0]
One conformer (NMR: many; X-ray: one)
Chain
model[“A”]
One polypeptide chain
Residue
chain[100]
One amino acid or ligand
Atom
residue[“CA”]
One atom with 3D coordinates

Each level is indexed differently: models by integer, chains by string ID, residues by their residue number (as an integer for standard residues), and atoms by atom name string. The model index is almost always 0 for X-ray crystal structures — NMR structures that contain multiple models are the exception.

Python
# Navigate directly to a specific atom
model   = structure[0]           # first model
chain   = model["A"]             # chain A
residue = chain[185]             # residue number 185
atom    = residue["CA"]           # alpha carbon

# Or chain navigation in one line
ca_185 = structure[0]["A"][185]["CA"]

# What type is each object?
print(type(structure))   # Bio.PDB.Structure.Structure
print(type(chain))       # Bio.PDB.Chain.Chain
print(type(residue))     # Bio.PDB.Residue.Residue
print(type(atom))        # Bio.PDB.Atom.Atom

Accessing chains

Python
model = structure[0]

# List all chains in the structure
for chain in model.get_chains():
    n_residues = len(list(chain.get_residues()))
    print(f"Chain {chain.id}: {n_residues} residues")
# Chain A: 283 residues
# Chain B: 278 residues

# Access a specific chain
chain_a = model["A"]
chain_b = model["B"]

# Count only standard amino acid residues (exclude HETATM)
aa_residues = [r for r in chain_a.get_residues()
               if r.get_id()[0] == " "]    # het flag " " = standard residue
print(f"Standard residues in chain A: {len(aa_residues)}")
The residue ID tuple — the most common confusion in BioPython
Every residue has a three-part ID: (het_flag, sequence_number, insertion_code). Standard amino acids have het_flag = " " (a space). HETATM records (ligands, waters, ions) have het_flag = "H_LIG" or "W" for water. The insertion code is usually " " but can be a letter like "A" for residues 100A. When iterating residues and getting unexpected results, print residue.get_id() to inspect all three parts.

Accessing and iterating residues

Python
chain = structure[0]["A"]

# Iterate over all residues
for res in chain.get_residues():
    het, seq, icode = res.get_id()
    print(res.get_resname(), seq, het)

# Get residue name (3-letter code)
res = chain[100]
print(res.get_resname())      # e.g. "GLU"
print(res.get_id()[1])        # sequence number: 100

# Filter: only standard amino acids
aa_only = [r for r in chain.get_residues() if r.id[0] == " "]

# Filter: only heteroatoms (ligands, waters)
hetatm  = [r for r in chain.get_residues() if r.id[0] != " "]

# Get sequence as one-letter codes
from Bio.PDB.Polypeptide import three_to_one
sequence = "".join(
    three_to_one(r.get_resname())
    for r in aa_only
    if r.get_resname() in three_to_one.__self__._three_to_one
)
print(sequence[:20])           # First 20 residues as single letters

Accessing atoms and coordinates

Python
residue = structure[0]["A"][185]

# List all atoms in a residue
for atom in residue.get_atoms():
    print(atom.get_name(), atom.get_coord())

# Access specific atoms by name
ca  = residue["CA"]    # alpha carbon
cb  = residue["CB"]    # beta carbon (not in Gly)
n   = residue["N"]     # backbone nitrogen
c   = residue["C"]     # backbone carbonyl carbon

# Get 3D coordinates as a NumPy-compatible array
coords = ca.get_coord()        # returns array([x, y, z])
x, y, z = coords
print(f"CA position: ({x:.3f}, {y:.3f}, {z:.3f}) Å")

# Get B-factor — this holds pLDDT in AlphaFold structures
bfactor = ca.get_bfactor()
print(f"B-factor / pLDDT: {bfactor:.2f}")

# Get element
print(ca.element)              # "C"

Calculating distances between atoms

BioPython’s Vector objects support direct subtraction and the minus operator returns the distance. For pairwise distance calculations across many atoms, NeighborSearch is much faster than nested loops.

Python
from Bio.PDB import NeighborSearch
import numpy as np

chain = structure[0]["A"]

# Distance between two specific atoms
atom1 = structure[0]["A"][185]["OG"]   # Ser 185 OG
atom2 = structure[0]["A"][273]["NE2"]  # His 273 NE2
distance = atom1 - atom2                  # BioPython overloads minus for distance
print(f"Distance: {distance:.2f} Å")

# Find all atoms within 5 Å of a target atom using NeighborSearch
all_atoms = list(structure.get_atoms())
ns = NeighborSearch(all_atoms)

target_atom = structure[0]["A"][300]["CA"]
nearby_atoms = ns.search(target_atom.get_coord(), 5.0, level="A")
print(f"Atoms within 5 Å: {len(nearby_atoms)}")

# Find all residues within 5 Å (level="R")
nearby_residues = ns.search(target_atom.get_coord(), 5.0, level="R")
for res in nearby_residues:
    print(res.get_resname(), res.get_id()[1])
Use NeighborSearch for contact analysis — never nested loops
A naive nested loop to find all atom pairs within 5 Å of each other has O(n²) complexity — for a 5,000-atom protein that’s 25 million comparisons. NeighborSearch uses a k-d tree internally and runs in O(n log n), making it orders of magnitude faster for any contact analysis involving more than a few dozen atoms.

Calculating protein properties

BioPython’s ProteinAnalysis class works from a sequence string and computes a wide range of physicochemical properties:

Python
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Build a sequence string from the structure
from Bio.PDB.Polypeptide import PPBuilder
ppb = PPBuilder()
sequence_str = ""
for pp in ppb.build_peptides(structure):
    sequence_str += str(pp.get_sequence())

# Analyze the sequence
analysis = ProteinAnalysis(sequence_str)

print(f"Length:              {len(sequence_str)} residues")
print(f"Molecular weight:    {analysis.molecular_weight():.1f} Da")
print(f"Isoelectric point:   {analysis.isoelectric_point():.2f}")
print(f"GRAVY score:         {analysis.gravy():.3f}")
print(f"Instability index:   {analysis.instability_index():.1f}")

# Amino acid composition as a dictionary
composition = analysis.get_amino_acids_percent()
for aa, pct in sorted(composition.items(), key=lambda x: -x[1])[:5]:
    print(f"  {aa}: {pct*100:.1f}%")

Writing modified structures to PDB

After modifying a structure — removing water molecules, extracting a single chain, or trimming to a binding site — write it back to a PDB file with PDBIO:

Python
from Bio.PDB import PDBIO, Select

# Write the whole structure
io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")

# Write only chain A using a custom Select class
class ChainASelect(Select):
    def accept_chain(self, chain):
        return chain.id == "A"

io.save("chain_A_only.pdb", ChainASelect())

# Write only non-water HETATM residues (i.e. remove solvent)
class NoWaterSelect(Select):
    def accept_residue(self, residue):
        return residue.id[0] != "W"   # W = water het flag

io.save("no_water.pdb", NoWaterSelect())
MethodReturns
Structure-level
structure.get_models()Iterator over Model objects
structure.get_chains()Iterator over all Chain objects
structure.get_residues()Iterator over all Residue objects
structure.get_atoms()Iterator over every Atom object
Atom-level
atom.get_coord()NumPy array [x, y, z] in ångströms
atom.get_bfactor()B-factor value (pLDDT for AlphaFold)
atom.get_name()Atom name string e.g. “CA”, “OG”
atom.elementElement symbol e.g. “C”, “N”, “O”
atom1 – atom2Distance in ångströms between two atoms
Residue-level
residue.get_resname()Three-letter residue name e.g. “GLU”
residue.get_id()Tuple (het_flag, seq_num, icode)
residue.get_atoms()Iterator over Atom objects in residue
residue.is_disordered()True if residue has alternate conformations

BioPython PDB parsing in one paragraph

Load any PDB file with PDBParser(QUIET=True).get_structure("name", "file.pdb"). Navigate the five-level hierarchy — Structure → Model → Chain → Residue → Atom — using bracket indexing: structure[0]["A"][185]["CA"]. Iterate over any level with the corresponding get_chains(), get_residues(), or get_atoms() generator. Filter standard residues from HETATM records using the het flag: residue.id[0] == " " for standard amino acids. Get atom coordinates with atom.get_coord() and B-factors (pLDDT in AlphaFold files) with atom.get_bfactor(). Find nearby atoms efficiently with NeighborSearch — never nested loops. Write modified structures back to PDB with PDBIO and a custom Select class.

Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *