BioPython Tutorial for Beginners: Parsing PDB Files and Working with Protein Structures

12 min read Beginner — basic Python assumed

BioPython’s structure module turns a PDB file — which is just a formatted text file — into a Python object you can query, iterate, and calculate with. This tutorial covers everything you need to go from a PDB accession code to meaningful per-residue data in under 20 lines of code.

Installing BioPython

If you followed the environment setup tutorial, BioPython is already installed in your structbio conda environment. If not, install it now:

Terminal

# With your structbio environment active:
conda install -c conda-forge biopython -y

# Verify
python -c "import Bio; print(Bio.__version__)"
1.84

Loading a PDB file

BioPython provides two main parsers for structure files: PDBParser for the classic .pdb format and MMCIFParser for the modern .cif format. PDBParser is the right starting point — virtually every PDB download, AlphaFold structure, and GROMACS output uses the PDB format.

Python

from Bio.PDB import PDBParser, PDBList

# Option A: parse a local PDB file
parser = PDBParser(QUIET=True)   # QUIET=True suppresses non-fatal warnings
structure = parser.get_structure("my_protein", "protein.pdb")

# Option B: download directly from the RCSB PDB
pdbl = PDBList()
pdbl.retrieve_pdb_file("4XYZ", file_type="pdb", pdir=".")
structure = parser.get_structure("4XYZ", "./4x/pdb4xyz.ent")

# Option C: load an AlphaFold structure
structure = parser.get_structure("af_p53", "AF-P04637-F1-model_v4.pdb")

print(structure.id)          # "my_protein" — the name you gave it
print(structure.header)      # dict of PDB header information

QUIET=True — always use it

PDB files frequently contain minor formatting inconsistencies that BioPython reports as warnings — discontinuous chains, non-standard residues, missing atoms. Most of these are harmless. Setting QUIET=True suppresses these warnings so they don’t flood your output. If you’re debugging a parsing problem, set QUIET=False temporarily to see what’s being flagged.

The Structure → Model → Chain → Residue → Atom hierarchy

Every BioPython structure object follows a strict five-level hierarchy. Understanding this is the single most important concept in Bio.PDB — every access pattern and iteration follows it.

BioPython SMCRA hierarchy

Structure

structure

The whole file

Model

structure[0]

One conformer (NMR: many; X-ray: one)

Chain

model[“A”]

One polypeptide chain

Residue

chain[100]

One amino acid or ligand

Atom

residue[“CA”]

One atom with 3D coordinates

Each level is indexed differently: models by integer, chains by string ID, residues by their residue number (as an integer for standard residues), and atoms by atom name string. The model index is almost always 0 for X-ray crystal structures — NMR structures that contain multiple models are the exception.

Python

# Navigate directly to a specific atom
model   = structure[0]           # first model
chain   = model["A"]             # chain A
residue = chain[185]             # residue number 185
atom    = residue["CA"]           # alpha carbon

# Or chain navigation in one line
ca_185 = structure[0]["A"][185]["CA"]

# What type is each object?
print(type(structure))   # Bio.PDB.Structure.Structure
print(type(chain))       # Bio.PDB.Chain.Chain
print(type(residue))     # Bio.PDB.Residue.Residue
print(type(atom))        # Bio.PDB.Atom.Atom

Accessing chains

Python

model = structure[0]

# List all chains in the structure
for chain in model.get_chains():
    n_residues = len(list(chain.get_residues()))
    print(f"Chain {chain.id}: {n_residues} residues")
# Chain A: 283 residues
# Chain B: 278 residues

# Access a specific chain
chain_a = model["A"]
chain_b = model["B"]

# Count only standard amino acid residues (exclude HETATM)
aa_residues = [r for r in chain_a.get_residues()
               if r.get_id()[0] == " "]    # het flag " " = standard residue
print(f"Standard residues in chain A: {len(aa_residues)}")

The residue ID tuple — the most common confusion in BioPython

Every residue has a three-part ID: (het_flag, sequence_number, insertion_code). Standard amino acids have het_flag = " " (a space). HETATM records (ligands, waters, ions) have het_flag = "H_LIG" or "W" for water. The insertion code is usually " " but can be a letter like "A" for residues 100A. When iterating residues and getting unexpected results, print residue.get_id() to inspect all three parts.

Accessing and iterating residues

Python

chain = structure[0]["A"]

# Iterate over all residues
for res in chain.get_residues():
    het, seq, icode = res.get_id()
    print(res.get_resname(), seq, het)

# Get residue name (3-letter code)
res = chain[100]
print(res.get_resname())      # e.g. "GLU"
print(res.get_id()[1])        # sequence number: 100

# Filter: only standard amino acids
aa_only = [r for r in chain.get_residues() if r.id[0] == " "]

# Filter: only heteroatoms (ligands, waters)
hetatm  = [r for r in chain.get_residues() if r.id[0] != " "]

# Get sequence as one-letter codes
from Bio.PDB.Polypeptide import three_to_one
sequence = "".join(
    three_to_one(r.get_resname())
    for r in aa_only
    if r.get_resname() in three_to_one.__self__._three_to_one
)
print(sequence[:20])           # First 20 residues as single letters

Accessing atoms and coordinates

Python

residue = structure[0]["A"][185]

# List all atoms in a residue
for atom in residue.get_atoms():
    print(atom.get_name(), atom.get_coord())

# Access specific atoms by name
ca  = residue["CA"]    # alpha carbon
cb  = residue["CB"]    # beta carbon (not in Gly)
n   = residue["N"]     # backbone nitrogen
c   = residue["C"]     # backbone carbonyl carbon

# Get 3D coordinates as a NumPy-compatible array
coords = ca.get_coord()        # returns array([x, y, z])
x, y, z = coords
print(f"CA position: ({x:.3f}, {y:.3f}, {z:.3f}) Å")

# Get B-factor — this holds pLDDT in AlphaFold structures
bfactor = ca.get_bfactor()
print(f"B-factor / pLDDT: {bfactor:.2f}")

# Get element
print(ca.element)              # "C"

Calculating distances between atoms

BioPython’s Vector objects support direct subtraction and the minus operator returns the distance. For pairwise distance calculations across many atoms, NeighborSearch is much faster than nested loops.

Python

from Bio.PDB import NeighborSearch
import numpy as np

chain = structure[0]["A"]

# Distance between two specific atoms
atom1 = structure[0]["A"][185]["OG"]   # Ser 185 OG
atom2 = structure[0]["A"][273]["NE2"]  # His 273 NE2
distance = atom1 - atom2                  # BioPython overloads minus for distance
print(f"Distance: {distance:.2f} Å")

# Find all atoms within 5 Å of a target atom using NeighborSearch
all_atoms = list(structure.get_atoms())
ns = NeighborSearch(all_atoms)

target_atom = structure[0]["A"][300]["CA"]
nearby_atoms = ns.search(target_atom.get_coord(), 5.0, level="A")
print(f"Atoms within 5 Å: {len(nearby_atoms)}")

# Find all residues within 5 Å (level="R")
nearby_residues = ns.search(target_atom.get_coord(), 5.0, level="R")
for res in nearby_residues:
    print(res.get_resname(), res.get_id()[1])

Use NeighborSearch for contact analysis — never nested loops

A naive nested loop to find all atom pairs within 5 Å of each other has O(n²) complexity — for a 5,000-atom protein that’s 25 million comparisons. NeighborSearch uses a k-d tree internally and runs in O(n log n), making it orders of magnitude faster for any contact analysis involving more than a few dozen atoms.

Calculating protein properties

BioPython’s ProteinAnalysis class works from a sequence string and computes a wide range of physicochemical properties:

Python

from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Build a sequence string from the structure
from Bio.PDB.Polypeptide import PPBuilder
ppb = PPBuilder()
sequence_str = ""
for pp in ppb.build_peptides(structure):
    sequence_str += str(pp.get_sequence())

# Analyze the sequence
analysis = ProteinAnalysis(sequence_str)

print(f"Length:              {len(sequence_str)} residues")
print(f"Molecular weight:    {analysis.molecular_weight():.1f} Da")
print(f"Isoelectric point:   {analysis.isoelectric_point():.2f}")
print(f"GRAVY score:         {analysis.gravy():.3f}")
print(f"Instability index:   {analysis.instability_index():.1f}")

# Amino acid composition as a dictionary
composition = analysis.get_amino_acids_percent()
for aa, pct in sorted(composition.items(), key=lambda x: -x[1])[:5]:
    print(f"  {aa}: {pct*100:.1f}%")

Writing modified structures to PDB

After modifying a structure — removing water molecules, extracting a single chain, or trimming to a binding site — write it back to a PDB file with PDBIO:

Python

from Bio.PDB import PDBIO, Select

# Write the whole structure
io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")

# Write only chain A using a custom Select class
class ChainASelect(Select):
    def accept_chain(self, chain):
        return chain.id == "A"

io.save("chain_A_only.pdb", ChainASelect())

# Write only non-water HETATM residues (i.e. remove solvent)
class NoWaterSelect(Select):
    def accept_residue(self, residue):
        return residue.id[0] != "W"   # W = water het flag

io.save("no_water.pdb", NoWaterSelect())

Method	Returns
Structure-level
structure.get_models()	Iterator over Model objects
structure.get_chains()	Iterator over all Chain objects
structure.get_residues()	Iterator over all Residue objects
structure.get_atoms()	Iterator over every Atom object
Atom-level
atom.get_coord()	NumPy array [x, y, z] in ångströms
atom.get_bfactor()	B-factor value (pLDDT for AlphaFold)
atom.get_name()	Atom name string e.g. “CA”, “OG”
atom.element	Element symbol e.g. “C”, “N”, “O”
atom1 – atom2	Distance in ångströms between two atoms
Residue-level
residue.get_resname()	Three-letter residue name e.g. “GLU”
residue.get_id()	Tuple (het_flag, seq_num, icode)
residue.get_atoms()	Iterator over Atom objects in residue
residue.is_disordered()	True if residue has alternate conformations

BioPython PDB parsing in one paragraph

Load any PDB file with PDBParser(QUIET=True).get_structure("name", "file.pdb"). Navigate the five-level hierarchy — Structure → Model → Chain → Residue → Atom — using bracket indexing: structure[0]["A"][185]["CA"]. Iterate over any level with the corresponding get_chains(), get_residues(), or get_atoms() generator. Filter standard residues from HETATM records using the het flag: residue.id[0] == " " for standard amino acids. Get atom coordinates with atom.get_coord() and B-factors (pLDDT in AlphaFold files) with atom.get_bfactor(). Find nearby atoms efficiently with NeighborSearch — never nested loops. Write modified structures back to PDB with PDBIO and a custom Select class.

BioPython Tutorial for Beginners: Parsing PDB Files and Working with Protein Structures

Installing BioPython

Loading a PDB file

The Structure → Model → Chain → Residue → Atom hierarchy

Accessing chains

Accessing and iterating residues

Accessing atoms and coordinates

Calculating distances between atoms

Calculating protein properties

Writing modified structures to PDB

BioPython PDB parsing in one paragraph

Python for Structural Biology: The Complete Guide to BioPython and MDAnalysis

How to Calculate RMSD and RMSF from MD Simulations with MDAnalysis

Common BioPython and MDAnalysis Mistakes and How to Fix Them

How to Set Up a Python Environment for Structural Biology (2026)

How to Find and Analyze Protein-Protein Interfaces with BioPython

How to Calculate Protein Properties with BioPython: Molecular Weight, GRAVY and Secondary Structure

Leave a Reply Cancel reply

Installing BioPython

Loading a PDB file

The Structure → Model → Chain → Residue → Atom hierarchy

Accessing chains

Accessing and iterating residues

Accessing atoms and coordinates

Calculating distances between atoms

Calculating protein properties

Writing modified structures to PDB

BioPython PDB parsing in one paragraph

Similar Posts

Leave a Reply Cancel reply