How to Analyze Protein-Ligand Interactions over an MD Trajectory with MDAnalysis

11 min read Intermediate — basic MDAnalysis assumed

Running a protein-ligand MD simulation is only half the work. The analysis — does the ligand stay in the pocket? Which contacts persist? Which key distances are maintained? — is what produces the data that goes into the paper. This tutorial covers the complete analysis workflow in Python.

The standard protein-ligand MD analysis workflow follows a clear sequence:

Align the trajectory on the protein backbone so the binding site stays fixed — without this, all distance measurements include whole-protein drift.
Track ligand RMSD relative to the binding site — the primary stability metric. Rising RMSD means the ligand is drifting or leaving.
Count contacts between the ligand and binding site residues each frame — gives a per-frame count and identifies which residues interact.
Track key atomic distances — specific donor-acceptor pairs, H-bond distances, and any chemically important atom pair distances over the full trajectory.
Calculate occupancy for each contact — the fraction of frames with a contact present. Report contacts above 60% occupancy as persistent.

Loading the system and selecting the ligand

Python

import MDAnalysis as mda
from MDAnalysis.analysis import align, rms
import numpy as np
import pandas as pd

u = mda.Universe("protein_ligand.gro", "trajectory.xtc")

# Inspect what heteroatom residue names are present
hetatm = u.select_atoms("not protein and not resname SOL NA CL")
print("Non-solvent resnames:", set(hetatm.resnames))
# Non-solvent resnames: {'LIG'}

# Define key selections — do this once before the loop
ligand  = u.select_atoms("resname LIG")
protein = u.select_atoms("protein")
pocket  = u.select_atoms("protein and around 5 resname LIG")  # dynamic

print(f"Ligand atoms:  {ligand.n_atoms}")
print(f"Protein atoms: {protein.n_atoms}")
print(f"Frames:        {len(u.trajectory)}")

# Align trajectory on protein backbone (critical — do before all analysis)
align.AlignTraj(u, u, select="backbone", in_memory=True).run()

Finding your ligand’s residue name

GROMACS assigns residue names from the force field parameterization. Common ligand names: LIG, MOL, UNK, or a custom 3-letter code from the parameterization step. If you’re not sure, run set(u.select_atoms("not protein and not resname SOL NA CL MG").resnames) to list all non-standard residue names in the system.

Tracking ligand RMSD relative to the binding site

Ligand RMSD measures how much the ligand moves relative to the binding site — the clearest indicator of whether the ligand is staying bound. The key is to calculate RMSD after aligning on the protein, not on the ligand itself, so the reference frame is the binding site rather than the ligand’s own starting position.

Python

# Ligand RMSD — after backbone alignment, binding site is the reference
lig_rmsd_vals = []
times = []

for ts in u.trajectory:
    # RMSD of ligand heavy atoms vs their position in frame 0
    if ts.frame == 0:
        ref_positions = ligand.positions.copy()
    lig_rmsd = np.sqrt(np.mean(np.sum(
        (ligand.positions - ref_positions) ** 2, axis=1
    )))
    lig_rmsd_vals.append(lig_rmsd)
    times.append(ts.time / 1000)   # ps → ns

lig_rmsd_arr = np.array(lig_rmsd_vals)
print(f"Mean ligand RMSD: {lig_rmsd_arr.mean():.2f} ± {lig_rmsd_arr.std():.2f} Å")
print(f"Max ligand RMSD:  {lig_rmsd_arr.max():.2f} Å at {times[lig_rmsd_arr.argmax()]:.1f} ns")

# Alternatively use MDAnalysis RMSD class (cleaner for multiple selections)
rmsd_analysis = rms.RMSD(u, select="resname LIG", ref_frame=0).run()
lig_rmsd_arr = rmsd_analysis.results.rmsd[:, 2]

Ligand staying bound

RMSD plateaus at 1–2 Å from frame 0
No sustained upward drift
May fluctuate but returns to baseline
Mean RMSD low (< 2 Å for rigid ligand)

Ligand leaving the pocket

RMSD increases steadily without plateau
Sudden jump to > 5 Å indicates dissociation
Contact count drops to zero simultaneously
Check this frame in PyMOL to confirm

Contact analysis over time

Counting how many protein atoms are within a cutoff distance of the ligand each frame gives a quick read on binding stability — a maintained contact count means the ligand is staying in the pocket, while a dropping count indicates partial or full dissociation.

Python

from MDAnalysis.lib.distances import distance_array

cutoff = 5.0   # Å
contact_counts = []
contact_residues_per_frame = []

for ts in u.trajectory:
    # Fast pairwise distance calculation between ligand and protein atoms
    dists = distance_array(ligand.positions, protein.positions)
    # Contact = any ligand atom within cutoff of any protein atom
    in_contact = (dists < cutoff).any(axis=0)
    contact_atoms = protein.atoms[in_contact]

    # Count and record
    contact_counts.append(in_contact.sum())

    # Track which residues are in contact this frame
    contact_residues_per_frame.append(set(contact_atoms.resids))

contact_counts = np.array(contact_counts)
print(f"Mean protein atoms in contact: {contact_counts.mean():.1f}")
print(f"Frames with 0 contacts: {(contact_counts == 0).sum()} ({(contact_counts==0).mean()*100:.1f}%)")

Use distance_array for fast pairwise calculations

MDAnalysis’s distance_array from MDAnalysis.lib.distances uses optimized C code for pairwise distance computation — it’s much faster than writing nested Python loops. For a ligand with 30 atoms and a binding site with 200 atoms, this function computes all 6,000 distances in microseconds versus potentially seconds with pure Python.

Key atomic distances over the trajectory

Beyond contact counts, tracking the distance between specific atom pairs — a key H-bond donor-acceptor pair, a salt bridge, a π-stacking interaction — shows whether individual important contacts are maintained throughout the simulation.

Python

# Define key atom pairs to track
# Each tuple: (selection_string_atom1, selection_string_atom2, label)
key_pairs = [
    ("resname LIG and name O3",    "resid 185 and name OG",   "LIG-O3···Ser185-OG"),
    ("resname LIG and name N1",    "resid 91 and name NZ",    "LIG-N1···Lys91-NZ"),
    ("resname LIG and name O1",    "resid 273 and name NH1",  "LIG-O1···Arg273-NH1"),
]

# Track each pair over all frames
distance_traces = {label: [] for _, _, label in key_pairs}

for ts in u.trajectory:
    for sel1, sel2, label in key_pairs:
        atom1 = u.select_atoms(sel1)
        atom2 = u.select_atoms(sel2)
        if len(atom1) == 0 or len(atom2) == 0:
            distance_traces[label].append(np.nan)
        else:
            dist = np.linalg.norm(atom1.positions[0] - atom2.positions[0])
            distance_traces[label].append(dist)

# Convert to arrays and report
for label, trace in distance_traces.items():
    arr = np.array(trace)
    mean_d = np.nanmean(arr)
    occ    = (arr < 3.5).mean() * 100   # % frames within H-bond distance
    print(f"{label}: mean={mean_d:.2f} Å, occ(≤3.5Å)={occ:.1f}%")

Pre-select atoms outside the loop for speed

Calling u.select_atoms() inside a trajectory loop re-parses the selection string every frame. For fixed atom pairs (not distance-based selections), select before the loop and store the AtomGroup objects. This can speed up a large trajectory analysis by 10× or more:

atom1 = u.select_atoms("resname LIG and name O3") # before loop
for ts in u.trajectory: dist = np.linalg.norm(atom1.positions[0] - atom2.positions[0])

Calculating and plotting contact occupancy

Python

import matplotlib.pyplot as plt
from collections import Counter

n_frames = len(u.trajectory)

# Count how many frames each residue was in contact with the ligand
residue_counts = Counter()
for frame_residues in contact_residues_per_frame:
    residue_counts.update(frame_residues)

# Build occupancy DataFrame
occ_data = []
for resid, count in residue_counts.items():
    resname = u.select_atoms(f"resid {resid}").resnames[0]
    occ_data.append({
        "resid":    resid,
        "resname":  resname,
        "occupancy": count / n_frames * 100,
    })

df_occ = (pd.DataFrame(occ_data)
          .sort_values("occupancy", ascending=False)
          .reset_index(drop=True))

persistent = df_occ[df_occ["occupancy"] >= 60]
print(f"Persistent contacts (≥60%): {len(persistent)}")
print(persistent[["resname", "resid", "occupancy"]].to_string(index=False))

# Export
df_occ.to_csv("contact_occupancy.csv", index=False)

Complete analysis pipeline

protein_ligand_analysis.py — full pipeline with publication plots

Aligned trajectory → RMSD + contacts + distances → figures + CSV

import MDAnalysis as mda
from MDAnalysis.analysis import align
from MDAnalysis.lib.distances import distance_array
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

u = mda.Universe("protein_ligand.gro", "trajectory.xtc")
align.AlignTraj(u, u, select="backbone", in_memory=True).run()

ligand  = u.select_atoms("resname LIG")
protein = u.select_atoms("protein")

# Pre-select key atom pairs
key_atoms = [
    (u.select_atoms("resname LIG and name O3"),
     u.select_atoms("resid 185 and name OG"), "O3···Ser185"),
    (u.select_atoms("resname LIG and name N1"),
     u.select_atoms("resid 91 and name NZ"),  "N1···Lys91"),
]

times, lig_rmsd, contact_n = [], [], []
dist_traces = {label: [] for _, _, label in key_atoms}
res_counts = Counter()
ref_pos = None

for ts in u.trajectory:
    if ref_pos is None: ref_pos = ligand.positions.copy()
    times.append(ts.time / 1000)
    lig_rmsd.append(np.sqrt(np.mean(np.sum(
        (ligand.positions - ref_pos)**2, axis=1))))
    dists = distance_array(ligand.positions, protein.positions)
    mask  = (dists < 5.0).any(axis=0)
    contact_n.append(mask.sum())
    res_counts.update(protein.atoms[mask].resids)
    for a1, a2, label in key_atoms:
        dist_traces[label].append(np.linalg.norm(
            a1.positions[0] - a2.positions[0]))

times = np.array(times); lig_rmsd = np.array(lig_rmsd)
n_frames = len(u.trajectory)

# Figure
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)

axes[0].plot(times, lig_rmsd, "#0d6e54", lw=1.1)
axes[0].set_ylabel("Ligand RMSD (Å)")
axes[0].set_title("(A) Ligand RMSD", fontweight="bold", loc="left")

axes[1].plot(times, contact_n, "#185fa5", lw=1.0, alpha=0.8)
axes[1].set_ylabel("Protein atoms in contact")
axes[1].set_title("(B) Binding site contacts (≤5 Å)", fontweight="bold", loc="left")

colors = ["#0d6e54", "#ba7517", "#534ab7"]
for (label, trace), col in zip(dist_traces.items(), colors):
    axes[2].plot(times, trace, color=col, lw=0.9, alpha=0.8, label=label)
axes[2].axhline(3.5, color="gray", ls="--", lw=0.7, label="3.5 Å H-bond cutoff")
axes[2].set_ylabel("Distance (Å)")
axes[2].set_xlabel("Time (ns)")
axes[2].set_title("(C) Key interaction distances", fontweight="bold", loc="left")
axes[2].legend(fontsize=8)

plt.tight_layout()
plt.savefig("protein_ligand_analysis.png", dpi=300, bbox_inches="tight")
plt.close()

# Occupancy CSV
occ = [(f"res{r}", c / n_frames * 100) for r, c in res_counts.most_common()]
pd.DataFrame(occ, columns=["residue", "occupancy_pct"]).to_csv(
    "contact_occupancy.csv", index=False)
print("Analysis complete. Saved figures and CSVs.")

Protein-ligand analysis in one paragraph

Align the trajectory on the protein backbone first, then select the ligand by residue name and the protein separately. Track ligand RMSD by comparing positions each frame to frame 0 — plateaued RMSD means the ligand is bound, rising RMSD means it’s leaving. Count binding site contacts per frame with distance_array — fast, vectorized, no Python loops needed. Track specific key atom pair distances pre-selected before the loop for maximum speed. Calculate per-residue contact occupancy as frames-in-contact divided by total frames, sorted by occupancy, and report contacts above 60% as persistent. The three-panel figure — RMSD, contact count, key distances — is the standard protein-ligand MD summary figure for any drug discovery or structural biology paper.

How to Analyze Protein-Ligand Interactions over an MD Trajectory with MDAnalysis

Loading the system and selecting the ligand

Tracking ligand RMSD relative to the binding site

Contact analysis over time

Key atomic distances over the trajectory

Calculating and plotting contact occupancy

Complete analysis pipeline

Protein-ligand analysis in one paragraph

How to Find and Analyze Protein-Protein Interfaces with BioPython

How to Calculate RMSD and RMSF from MD Simulations with MDAnalysis

How to Parse AutoDock Vina Docking Results with Python: From PDBQT to Ranked Hit List

Python for Structural Biology: The Complete Guide to BioPython and MDAnalysis

BioPython Tutorial for Beginners: Parsing PDB Files and Working with Protein Structures

MDAnalysis Tutorial for Beginners: Loading and Working with MD Trajectories in Python

Leave a Reply Cancel reply

Loading the system and selecting the ligand

Tracking ligand RMSD relative to the binding site

Contact analysis over time

Key atomic distances over the trajectory

Calculating and plotting contact occupancy

Complete analysis pipeline

Protein-ligand analysis in one paragraph

Similar Posts

Leave a Reply Cancel reply