How to Calculate RMSD and RMSF from MD Simulations with MDAnalysis

How to Calculate RMSD and RMSF from MD Simulations with MDAnalysis

RMSD and RMSF are the two most reported metrics from MD simulations — the first tells you whether the simulation is stable, the second tells you which parts of the protein are flexible. This tutorial covers both calculations in Python with MDAnalysis, how to plot them clearly, and how to interpret what you see.

RMSD vs RMSF — what each measures

RMSD
Root Mean Square Deviation
  • One value per frame — a time series
  • Deviation of the whole structure (or selection) from a reference
  • Tells you: has the simulation drifted from the starting structure?
  • High and rising = structure still changing
  • Plateau = structure has converged (equilibrated)
  • Reference is almost always the first frame or the crystal structure
RMSF
Root Mean Square Fluctuation
  • One value per atom (or residue) — a per-position profile
  • Fluctuation of each atom around its average position over the simulation
  • Tells you: which parts of the protein are flexible or rigid?
  • High RMSF = flexible loop or disordered region
  • Low RMSF = rigid secondary structure or constrained core
  • Related to the experimental B-factor in crystal structures

Both metrics are calculated over the full trajectory (or the equilibrated portion of it) and both require that the trajectory first be aligned to a reference structure — otherwise translational and rotational motion of the whole protein inflates the values.

Step 1 — aligning the trajectory

Aligning removes overall protein translation and rotation from the trajectory before calculating RMSD or RMSF. MDAnalysis’s AlignTraj class does this in one call and can write an aligned trajectory to disk or hold it in memory:

Python
import MDAnalysis as mda
from MDAnalysis.analysis import align

# Load your system
u = mda.Universe("topology.gro", "trajectory.xtc")

# Align on backbone atoms, using frame 0 as the reference
aligner = align.AlignTraj(
    u,                      # mobile (the Universe to align)
    u,                      # reference (same Universe → aligns to first frame)
    select="backbone",     # atoms used for alignment
    in_memory=True         # hold aligned trajectory in RAM (fast for analysis)
).run()

# Or align to an external reference structure
ref = mda.Universe("reference.pdb")
aligner = align.AlignTraj(u, ref, select="backbone", in_memory=True).run()

# Write aligned trajectory to disk (for very large systems)
aligner_disk = align.AlignTraj(
    u, u, select="backbone",
    filename="aligned.xtc"
).run()
Always align before calculating RMSD or RMSF
Skipping alignment is the most common RMSD mistake. Without alignment, the metric includes translational drift of the protein’s center of mass and overall rotation — which in a typical periodic boundary simulation can be much larger than the actual structural changes you’re trying to measure. The values become meaningless. Always run AlignTraj first.

Calculating RMSD over the trajectory

Python
from MDAnalysis.analysis import rms
import numpy as np

# Calculate backbone RMSD relative to frame 0
rmsd_analysis = rms.RMSD(
    u,
    select="backbone",     # atoms to calculate RMSD over
    ref_frame=0            # reference frame (default: 0)
).run()

# Results are stored in rmsd_analysis.results.rmsd
# Shape: (n_frames, 3) — columns: [frame, time_ps, rmsd_Å]
data = rmsd_analysis.results.rmsd
frames    = data[:, 0]
times_ns  = data[:, 1] / 1000    # ps → ns
rmsd_vals = data[:, 2]

print(f"Mean backbone RMSD: {rmsd_vals.mean():.2f} ± {rmsd_vals.std():.2f} Å")
print(f"Final RMSD:         {rmsd_vals[-1]:.2f} Å")
print(f"Max RMSD:           {rmsd_vals.max():.2f} Å at frame {rmsd_vals.argmax()}")

# Calculate RMSD for multiple selections simultaneously
rmsd_multi = rms.RMSD(
    u,
    select="backbone",
    groupselections=[
        "backbone",               # full backbone
        "resid 1:100 and backbone",  # N-terminal domain
        "resid 101:200 and backbone", # C-terminal domain
    ]
).run()
# Each groupselection adds a column to results.rmsd

Calculating per-residue RMSF

Python
# Calculate RMSF — run on the ALIGNED trajectory
ca_atoms = u.select_atoms("backbone and name CA")

rmsf_analysis = rms.RMSF(ca_atoms).run()

# rmsf_analysis.results.rmsf: shape (n_atoms,) — one value per atom
rmsf_vals = rmsf_analysis.results.rmsf
resids    = ca_atoms.resids     # residue numbers corresponding to each RMSF value
resnames  = ca_atoms.resnames   # residue names

print(f"Mean RMSF:  {rmsf_vals.mean():.2f} Å")
print(f"Max RMSF:   {rmsf_vals.max():.2f} Å (residue {resids[rmsf_vals.argmax()]})")

# Find the most flexible residues
import pandas as pd
df = pd.DataFrame({"resid": resids, "resname": resnames, "rmsf": rmsf_vals})
top_flex = df.nlargest(10, "rmsf")
print("\nMost flexible residues:")
print(top_flex.to_string(index=False))
RMSF must be calculated on an already-aligned trajectory
rms.RMSF measures fluctuation around the average position — but the average position is only meaningful if the protein isn’t also translating and rotating. Run AlignTraj with in_memory=True before running RMSF, and make sure the RMSF analysis uses the same Universe that was aligned. If you see implausibly large RMSF values (more than 5–10 Å for structured regions), alignment was likely missed.

Choosing the right atom selection

The atom selection defines what you’re measuring structural change in. The right choice depends on what question you’re answering:

  • Backbone RMSD ("backbone") — most common. Captures overall fold stability without being affected by side chain rotamers. Use for the headline convergence plot.
  • Cα RMSD ("name CA") — equivalent to backbone for most purposes; sometimes preferred for direct comparison with PyMOL alignment RMSD values.
  • All heavy atoms ("protein and not type H") — includes side chains. Higher values than backbone RMSD; more sensitive to local conformational changes.
  • Binding site only ("resid 85 100 173 and backbone") — RMSD of just the pocket residues. Essential when asking whether the binding site is stable enough for docking or drug design.
  • Ligand RMSD ("resname LIG") — measures ligand drift within the binding site. A rising ligand RMSD indicates the ligand is leaving the pocket.

Plotting with matplotlib

Python — publication-ready RMSD + RMSF figure
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

fig = plt.figure(figsize=(12, 8))
gs  = gridspec.GridSpec(2, 1, hspace=0.4)

# ── Panel A: RMSD over time ────────────────────────────────
ax1 = fig.add_subplot(gs[0])
ax1.plot(times_ns, rmsd_vals, color="#0d6e54", linewidth=1.2, alpha=0.85)
ax1.axhline(rmsd_vals[len(rmsd_vals)//2:].mean(),  # equilibrated mean
            color="#0d6e54", linestyle="--", linewidth=0.8, alpha=0.5,
            label=f"Equilibrated mean: {rmsd_vals[len(rmsd_vals)//2:].mean():.2f} Å")
ax1.set_xlabel("Time (ns)", fontsize=11)
ax1.set_ylabel("Backbone RMSD (Å)", fontsize=11)
ax1.set_title("(A) Backbone RMSD over simulation", fontsize=12, fontweight="bold", loc="left")
ax1.legend(fontsize=9)

# ── Panel B: RMSF per residue ──────────────────────────────
ax2 = fig.add_subplot(gs[1])
ax2.plot(resids, rmsf_vals, color="#185fa5", linewidth=1.1)
ax2.fill_between(resids, rmsf_vals, alpha=0.15, color="#185fa5")
ax2.axhline(2.0, color="#ba7517", linestyle="--", linewidth=0.8,
            label="2 Å threshold (high flexibility)")
ax2.set_xlabel("Residue number", fontsize=11)
ax2.set_ylabel("RMSF (Å)", fontsize=11)
ax2.set_title("(B) Per-residue RMSF (Cα atoms)", fontsize=12, fontweight="bold", loc="left")
ax2.legend(fontsize=9)

plt.savefig("rmsd_rmsf.png", dpi=300, bbox_inches="tight")
plt.close()
print("Saved rmsd_rmsf.png")

Interpreting the results

What you seeWhat it meansAction
RMSD plateaus at < 2 Å Excellent convergence. Protein maintained its starting fold throughout the simulation. Use full trajectory for RMSF and further analysis.
RMSD plateaus at 2–4 Å Reasonable for a flexible protein or one undergoing conformational change. Check if the plateau is stable. Identify which region contributes most to RMSD. Consider domain-specific analysis.
RMSD still rising at simulation end Simulation hasn’t converged. The protein is still relaxing or changing conformation. Extend the simulation. Discard data before plateau for equilibrium analysis.
RMSD jumps suddenly then stays high Structural event — possibly unfolding, large loop rearrangement, or ligand dissociation. Inspect the trajectory visually at the jump frame in PyMOL or VMD.
RMSF < 1 Å for secondary structure Helices and sheets are rigid and well-defined. High-quality simulation of structured regions. These regions are reliable for docking grid definition or interface analysis.
RMSF peaks at loops and termini Expected and normal — loops are intrinsically flexible. Terminal residues are usually the most mobile. If using the structure for docking, trim disordered termini before grid preparation.
Discard the equilibration period before calculating RMSF
The first portion of any MD simulation is equilibration — the system is still relaxing from the starting structure toward a thermodynamically meaningful ensemble. Including this period in RMSF calculations inflates values. A simple approach: if your RMSD plot shows convergence at frame N, run RMSF analysis only from frame N onward using rms.RMSF(atoms).run(start=N). For most simulations, discarding the first 20–30% of frames is a reasonable default.

Comparing RMSF to AlphaFold pLDDT

When your simulation was seeded from an AlphaFold structure, comparing per-residue RMSF with per-residue pLDDT is a powerful validation: regions that AlphaFold predicted with low confidence (low pLDDT) should generally show high RMSF in simulation, since both reflect intrinsic disorder. If they don’t agree, it’s worth investigating why.

Cross-pillar connection — AlphaFold
The pLDDT values are extracted from the AlphaFold PDB B-factor column using BioPython, as covered in the AlphaFold parsing tutorial. The RMSF values come from this tutorial’s MDAnalysis workflow. Combining them requires matching by residue number — both approaches already use the same residue numbering, so the merge is straightforward.
Python — RMSF vs pLDDT comparison
from Bio.PDB import PDBParser
import pandas as pd
import matplotlib.pyplot as plt

# Extract pLDDT from the AlphaFold PDB (B-factor column)
parser = PDBParser(QUIET=True)
af_structure = parser.get_structure("af", "alphafold_model.pdb")
plddt_data = []
for res in af_structure[0]["A"].get_residues():
    if res.id[0] == " " and "CA" in res:
        plddt_data.append({"resid": res.id[1], "plddt": res["CA"].get_bfactor()})
df_plddt = pd.DataFrame(plddt_data)

# Build RMSF DataFrame from MDAnalysis results
df_rmsf = pd.DataFrame({"resid": resids, "rmsf": rmsf_vals})

# Merge on residue number
df = df_rmsf.merge(df_plddt, on="resid")

# Plot RMSF (left axis) and pLDDT (right axis) vs residue
fig, ax1 = plt.subplots(figsize=(12, 4))
ax2 = ax1.twinx()

ax1.plot(df["resid"], df["rmsf"],  color="#185fa5", linewidth=1.2, label="RMSF (MD)")
ax2.plot(df["resid"], df["plddt"], color="#993c1d", linewidth=1.0,
         linestyle="--", alpha=0.7, label="pLDDT (AlphaFold)")

ax1.set_xlabel("Residue number")
ax1.set_ylabel("RMSF (Å)", color="#185fa5")
ax2.set_ylabel("pLDDT", color="#993c1d")
ax2.invert_yaxis()   # invert pLDDT so low confidence aligns with high RMSF

lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, fontsize=9)
plt.title("RMSF vs AlphaFold pLDDT confidence")
plt.tight_layout()
plt.savefig("rmsf_vs_plddt.png", dpi=300, bbox_inches="tight")
plt.close()

# Pearson correlation: should be strongly negative (low pLDDT ↔ high RMSF)
corr = df["rmsf"].corr(df["plddt"])
print(f"RMSF vs pLDDT Pearson r: {corr:.3f}")
# Expected: strongly negative (e.g. -0.60 to -0.80)

Reporting values for publication

When reporting RMSD and RMSF in a paper, include the following — reviewers at journals like JCTC, JPCB, and Structure specifically look for this information:

  • Reference structure — what frame was used as the reference for RMSD calculation (typically frame 0 or the experimental structure)
  • Atom selection — backbone, Cα, or all heavy atoms; which chains were included
  • Equilibration period discarded — how many frames or nanoseconds were excluded from RMSF analysis
  • Mean ± SD RMSD over the equilibrated portion: e.g., “Mean backbone RMSD of 1.84 ± 0.21 Å over the final 50 ns”
  • Most flexible regions from the RMSF profile by residue number, with interpretation

RMSD and RMSF in one paragraph

Always align the trajectory first with align.AlignTraj(u, u, select="backbone", in_memory=True).run() — skipping this inflates both metrics with whole-protein translation and rotation. Then calculate RMSD with rms.RMSD(u, select="backbone").run() — results are in results.rmsd[:, 2] — and RMSF with rms.RMSF(ca_atoms).run() on the aligned Universe. Plot RMSD as a time series to assess convergence; plot RMSF as a per-residue profile to identify flexible loops and rigid secondary structures. Discard the equilibration period (first 20–30% of frames) before calculating RMSF. For AlphaFold-seeded simulations, compare RMSF with pLDDT — the negative correlation validates both the simulation and the confidence prediction.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *