How to Analyze Protein-Ligand Interactions over an MD Trajectory with MDAnalysis
Running a protein-ligand MD simulation is only half the work. The analysis — does the ligand stay in the pocket? Which contacts persist? Which key distances are maintained? — is what produces the data that goes into the paper. This tutorial covers the complete analysis workflow in Python.
The standard protein-ligand MD analysis workflow follows a clear sequence:
- Align the trajectory on the protein backbone so the binding site stays fixed — without this, all distance measurements include whole-protein drift.
- Track ligand RMSD relative to the binding site — the primary stability metric. Rising RMSD means the ligand is drifting or leaving.
- Count contacts between the ligand and binding site residues each frame — gives a per-frame count and identifies which residues interact.
- Track key atomic distances — specific donor-acceptor pairs, H-bond distances, and any chemically important atom pair distances over the full trajectory.
- Calculate occupancy for each contact — the fraction of frames with a contact present. Report contacts above 60% occupancy as persistent.
Loading the system and selecting the ligand
import MDAnalysis as mda
from MDAnalysis.analysis import align, rms
import numpy as np
import pandas as pd
u = mda.Universe("protein_ligand.gro", "trajectory.xtc")
# Inspect what heteroatom residue names are present
hetatm = u.select_atoms("not protein and not resname SOL NA CL")
print("Non-solvent resnames:", set(hetatm.resnames))
# Non-solvent resnames: {'LIG'}
# Define key selections — do this once before the loop
ligand = u.select_atoms("resname LIG")
protein = u.select_atoms("protein")
pocket = u.select_atoms("protein and around 5 resname LIG") # dynamic
print(f"Ligand atoms: {ligand.n_atoms}")
print(f"Protein atoms: {protein.n_atoms}")
print(f"Frames: {len(u.trajectory)}")
# Align trajectory on protein backbone (critical — do before all analysis)
align.AlignTraj(u, u, select="backbone", in_memory=True).run()
LIG, MOL, UNK, or a custom 3-letter code from the parameterization step. If you’re not sure, run set(u.select_atoms("not protein and not resname SOL NA CL MG").resnames) to list all non-standard residue names in the system.
Tracking ligand RMSD relative to the binding site
Ligand RMSD measures how much the ligand moves relative to the binding site — the clearest indicator of whether the ligand is staying bound. The key is to calculate RMSD after aligning on the protein, not on the ligand itself, so the reference frame is the binding site rather than the ligand’s own starting position.
# Ligand RMSD — after backbone alignment, binding site is the reference
lig_rmsd_vals = []
times = []
for ts in u.trajectory:
# RMSD of ligand heavy atoms vs their position in frame 0
if ts.frame == 0:
ref_positions = ligand.positions.copy()
lig_rmsd = np.sqrt(np.mean(np.sum(
(ligand.positions - ref_positions) ** 2, axis=1
)))
lig_rmsd_vals.append(lig_rmsd)
times.append(ts.time / 1000) # ps → ns
lig_rmsd_arr = np.array(lig_rmsd_vals)
print(f"Mean ligand RMSD: {lig_rmsd_arr.mean():.2f} ± {lig_rmsd_arr.std():.2f} Å")
print(f"Max ligand RMSD: {lig_rmsd_arr.max():.2f} Å at {times[lig_rmsd_arr.argmax()]:.1f} ns")
# Alternatively use MDAnalysis RMSD class (cleaner for multiple selections)
rmsd_analysis = rms.RMSD(u, select="resname LIG", ref_frame=0).run()
lig_rmsd_arr = rmsd_analysis.results.rmsd[:, 2]
- RMSD plateaus at 1–2 Å from frame 0
- No sustained upward drift
- May fluctuate but returns to baseline
- Mean RMSD low (< 2 Å for rigid ligand)
- RMSD increases steadily without plateau
- Sudden jump to > 5 Å indicates dissociation
- Contact count drops to zero simultaneously
- Check this frame in PyMOL to confirm
Contact analysis over time
Counting how many protein atoms are within a cutoff distance of the ligand each frame gives a quick read on binding stability — a maintained contact count means the ligand is staying in the pocket, while a dropping count indicates partial or full dissociation.
from MDAnalysis.lib.distances import distance_array
cutoff = 5.0 # Å
contact_counts = []
contact_residues_per_frame = []
for ts in u.trajectory:
# Fast pairwise distance calculation between ligand and protein atoms
dists = distance_array(ligand.positions, protein.positions)
# Contact = any ligand atom within cutoff of any protein atom
in_contact = (dists < cutoff).any(axis=0)
contact_atoms = protein.atoms[in_contact]
# Count and record
contact_counts.append(in_contact.sum())
# Track which residues are in contact this frame
contact_residues_per_frame.append(set(contact_atoms.resids))
contact_counts = np.array(contact_counts)
print(f"Mean protein atoms in contact: {contact_counts.mean():.1f}")
print(f"Frames with 0 contacts: {(contact_counts == 0).sum()} ({(contact_counts==0).mean()*100:.1f}%)")
distance_array from MDAnalysis.lib.distances uses optimized C code for pairwise distance computation — it’s much faster than writing nested Python loops. For a ligand with 30 atoms and a binding site with 200 atoms, this function computes all 6,000 distances in microseconds versus potentially seconds with pure Python.
Key atomic distances over the trajectory
Beyond contact counts, tracking the distance between specific atom pairs — a key H-bond donor-acceptor pair, a salt bridge, a π-stacking interaction — shows whether individual important contacts are maintained throughout the simulation.
# Define key atom pairs to track
# Each tuple: (selection_string_atom1, selection_string_atom2, label)
key_pairs = [
("resname LIG and name O3", "resid 185 and name OG", "LIG-O3···Ser185-OG"),
("resname LIG and name N1", "resid 91 and name NZ", "LIG-N1···Lys91-NZ"),
("resname LIG and name O1", "resid 273 and name NH1", "LIG-O1···Arg273-NH1"),
]
# Track each pair over all frames
distance_traces = {label: [] for _, _, label in key_pairs}
for ts in u.trajectory:
for sel1, sel2, label in key_pairs:
atom1 = u.select_atoms(sel1)
atom2 = u.select_atoms(sel2)
if len(atom1) == 0 or len(atom2) == 0:
distance_traces[label].append(np.nan)
else:
dist = np.linalg.norm(atom1.positions[0] - atom2.positions[0])
distance_traces[label].append(dist)
# Convert to arrays and report
for label, trace in distance_traces.items():
arr = np.array(trace)
mean_d = np.nanmean(arr)
occ = (arr < 3.5).mean() * 100 # % frames within H-bond distance
print(f"{label}: mean={mean_d:.2f} Å, occ(≤3.5Å)={occ:.1f}%")
u.select_atoms() inside a trajectory loop re-parses the selection string every frame. For fixed atom pairs (not distance-based selections), select before the loop and store the AtomGroup objects. This can speed up a large trajectory analysis by 10× or more:atom1 = u.select_atoms("resname LIG and name O3") # before loopfor ts in u.trajectory: dist = np.linalg.norm(atom1.positions[0] - atom2.positions[0])
Calculating and plotting contact occupancy
import matplotlib.pyplot as plt
from collections import Counter
n_frames = len(u.trajectory)
# Count how many frames each residue was in contact with the ligand
residue_counts = Counter()
for frame_residues in contact_residues_per_frame:
residue_counts.update(frame_residues)
# Build occupancy DataFrame
occ_data = []
for resid, count in residue_counts.items():
resname = u.select_atoms(f"resid {resid}").resnames[0]
occ_data.append({
"resid": resid,
"resname": resname,
"occupancy": count / n_frames * 100,
})
df_occ = (pd.DataFrame(occ_data)
.sort_values("occupancy", ascending=False)
.reset_index(drop=True))
persistent = df_occ[df_occ["occupancy"] >= 60]
print(f"Persistent contacts (≥60%): {len(persistent)}")
print(persistent[["resname", "resid", "occupancy"]].to_string(index=False))
# Export
df_occ.to_csv("contact_occupancy.csv", index=False)
Complete analysis pipeline
import MDAnalysis as mda
from MDAnalysis.analysis import align
from MDAnalysis.lib.distances import distance_array
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
u = mda.Universe("protein_ligand.gro", "trajectory.xtc")
align.AlignTraj(u, u, select="backbone", in_memory=True).run()
ligand = u.select_atoms("resname LIG")
protein = u.select_atoms("protein")
# Pre-select key atom pairs
key_atoms = [
(u.select_atoms("resname LIG and name O3"),
u.select_atoms("resid 185 and name OG"), "O3···Ser185"),
(u.select_atoms("resname LIG and name N1"),
u.select_atoms("resid 91 and name NZ"), "N1···Lys91"),
]
times, lig_rmsd, contact_n = [], [], []
dist_traces = {label: [] for _, _, label in key_atoms}
res_counts = Counter()
ref_pos = None
for ts in u.trajectory:
if ref_pos is None: ref_pos = ligand.positions.copy()
times.append(ts.time / 1000)
lig_rmsd.append(np.sqrt(np.mean(np.sum(
(ligand.positions - ref_pos)**2, axis=1))))
dists = distance_array(ligand.positions, protein.positions)
mask = (dists < 5.0).any(axis=0)
contact_n.append(mask.sum())
res_counts.update(protein.atoms[mask].resids)
for a1, a2, label in key_atoms:
dist_traces[label].append(np.linalg.norm(
a1.positions[0] - a2.positions[0]))
times = np.array(times); lig_rmsd = np.array(lig_rmsd)
n_frames = len(u.trajectory)
# Figure
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
axes[0].plot(times, lig_rmsd, "#0d6e54", lw=1.1)
axes[0].set_ylabel("Ligand RMSD (Å)")
axes[0].set_title("(A) Ligand RMSD", fontweight="bold", loc="left")
axes[1].plot(times, contact_n, "#185fa5", lw=1.0, alpha=0.8)
axes[1].set_ylabel("Protein atoms in contact")
axes[1].set_title("(B) Binding site contacts (≤5 Å)", fontweight="bold", loc="left")
colors = ["#0d6e54", "#ba7517", "#534ab7"]
for (label, trace), col in zip(dist_traces.items(), colors):
axes[2].plot(times, trace, color=col, lw=0.9, alpha=0.8, label=label)
axes[2].axhline(3.5, color="gray", ls="--", lw=0.7, label="3.5 Å H-bond cutoff")
axes[2].set_ylabel("Distance (Å)")
axes[2].set_xlabel("Time (ns)")
axes[2].set_title("(C) Key interaction distances", fontweight="bold", loc="left")
axes[2].legend(fontsize=8)
plt.tight_layout()
plt.savefig("protein_ligand_analysis.png", dpi=300, bbox_inches="tight")
plt.close()
# Occupancy CSV
occ = [(f"res{r}", c / n_frames * 100) for r, c in res_counts.most_common()]
pd.DataFrame(occ, columns=["residue", "occupancy_pct"]).to_csv(
"contact_occupancy.csv", index=False)
print("Analysis complete. Saved figures and CSVs.")
Protein-ligand analysis in one paragraph
Align the trajectory on the protein backbone first, then select the ligand by residue name and the protein separately. Track ligand RMSD by comparing positions each frame to frame 0 — plateaued RMSD means the ligand is bound, rising RMSD means it’s leaving. Count binding site contacts per frame with distance_array — fast, vectorized, no Python loops needed. Track specific key atom pair distances pre-selected before the loop for maximum speed. Calculate per-residue contact occupancy as frames-in-contact divided by total frames, sorted by occupancy, and report contacts above 60% as persistent. The three-panel figure — RMSD, contact count, key distances — is the standard protein-ligand MD summary figure for any drug discovery or structural biology paper.