How to Calculate RMSD and RMSF from MD Simulations with MDAnalysis
RMSD and RMSF are the two most reported metrics from MD simulations — the first tells you whether the simulation is stable, the second tells you which parts of the protein are flexible. This tutorial covers both calculations in Python with MDAnalysis, how to plot them clearly, and how to interpret what you see.
RMSD vs RMSF — what each measures
- One value per frame — a time series
- Deviation of the whole structure (or selection) from a reference
- Tells you: has the simulation drifted from the starting structure?
- High and rising = structure still changing
- Plateau = structure has converged (equilibrated)
- Reference is almost always the first frame or the crystal structure
- One value per atom (or residue) — a per-position profile
- Fluctuation of each atom around its average position over the simulation
- Tells you: which parts of the protein are flexible or rigid?
- High RMSF = flexible loop or disordered region
- Low RMSF = rigid secondary structure or constrained core
- Related to the experimental B-factor in crystal structures
Both metrics are calculated over the full trajectory (or the equilibrated portion of it) and both require that the trajectory first be aligned to a reference structure — otherwise translational and rotational motion of the whole protein inflates the values.
Step 1 — aligning the trajectory
Aligning removes overall protein translation and rotation from the trajectory before calculating RMSD or RMSF. MDAnalysis’s AlignTraj class does this in one call and can write an aligned trajectory to disk or hold it in memory:
import MDAnalysis as mda
from MDAnalysis.analysis import align
# Load your system
u = mda.Universe("topology.gro", "trajectory.xtc")
# Align on backbone atoms, using frame 0 as the reference
aligner = align.AlignTraj(
u, # mobile (the Universe to align)
u, # reference (same Universe → aligns to first frame)
select="backbone", # atoms used for alignment
in_memory=True # hold aligned trajectory in RAM (fast for analysis)
).run()
# Or align to an external reference structure
ref = mda.Universe("reference.pdb")
aligner = align.AlignTraj(u, ref, select="backbone", in_memory=True).run()
# Write aligned trajectory to disk (for very large systems)
aligner_disk = align.AlignTraj(
u, u, select="backbone",
filename="aligned.xtc"
).run()
AlignTraj first.
Calculating RMSD over the trajectory
from MDAnalysis.analysis import rms
import numpy as np
# Calculate backbone RMSD relative to frame 0
rmsd_analysis = rms.RMSD(
u,
select="backbone", # atoms to calculate RMSD over
ref_frame=0 # reference frame (default: 0)
).run()
# Results are stored in rmsd_analysis.results.rmsd
# Shape: (n_frames, 3) — columns: [frame, time_ps, rmsd_Å]
data = rmsd_analysis.results.rmsd
frames = data[:, 0]
times_ns = data[:, 1] / 1000 # ps → ns
rmsd_vals = data[:, 2]
print(f"Mean backbone RMSD: {rmsd_vals.mean():.2f} ± {rmsd_vals.std():.2f} Å")
print(f"Final RMSD: {rmsd_vals[-1]:.2f} Å")
print(f"Max RMSD: {rmsd_vals.max():.2f} Å at frame {rmsd_vals.argmax()}")
# Calculate RMSD for multiple selections simultaneously
rmsd_multi = rms.RMSD(
u,
select="backbone",
groupselections=[
"backbone", # full backbone
"resid 1:100 and backbone", # N-terminal domain
"resid 101:200 and backbone", # C-terminal domain
]
).run()
# Each groupselection adds a column to results.rmsd
Calculating per-residue RMSF
# Calculate RMSF — run on the ALIGNED trajectory
ca_atoms = u.select_atoms("backbone and name CA")
rmsf_analysis = rms.RMSF(ca_atoms).run()
# rmsf_analysis.results.rmsf: shape (n_atoms,) — one value per atom
rmsf_vals = rmsf_analysis.results.rmsf
resids = ca_atoms.resids # residue numbers corresponding to each RMSF value
resnames = ca_atoms.resnames # residue names
print(f"Mean RMSF: {rmsf_vals.mean():.2f} Å")
print(f"Max RMSF: {rmsf_vals.max():.2f} Å (residue {resids[rmsf_vals.argmax()]})")
# Find the most flexible residues
import pandas as pd
df = pd.DataFrame({"resid": resids, "resname": resnames, "rmsf": rmsf_vals})
top_flex = df.nlargest(10, "rmsf")
print("\nMost flexible residues:")
print(top_flex.to_string(index=False))
rms.RMSF measures fluctuation around the average position — but the average position is only meaningful if the protein isn’t also translating and rotating. Run AlignTraj with in_memory=True before running RMSF, and make sure the RMSF analysis uses the same Universe that was aligned. If you see implausibly large RMSF values (more than 5–10 Å for structured regions), alignment was likely missed.
Choosing the right atom selection
The atom selection defines what you’re measuring structural change in. The right choice depends on what question you’re answering:
- Backbone RMSD (
"backbone") — most common. Captures overall fold stability without being affected by side chain rotamers. Use for the headline convergence plot. - Cα RMSD (
"name CA") — equivalent to backbone for most purposes; sometimes preferred for direct comparison with PyMOL alignment RMSD values. - All heavy atoms (
"protein and not type H") — includes side chains. Higher values than backbone RMSD; more sensitive to local conformational changes. - Binding site only (
"resid 85 100 173 and backbone") — RMSD of just the pocket residues. Essential when asking whether the binding site is stable enough for docking or drug design. - Ligand RMSD (
"resname LIG") — measures ligand drift within the binding site. A rising ligand RMSD indicates the ligand is leaving the pocket.
Plotting with matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(12, 8))
gs = gridspec.GridSpec(2, 1, hspace=0.4)
# ── Panel A: RMSD over time ────────────────────────────────
ax1 = fig.add_subplot(gs[0])
ax1.plot(times_ns, rmsd_vals, color="#0d6e54", linewidth=1.2, alpha=0.85)
ax1.axhline(rmsd_vals[len(rmsd_vals)//2:].mean(), # equilibrated mean
color="#0d6e54", linestyle="--", linewidth=0.8, alpha=0.5,
label=f"Equilibrated mean: {rmsd_vals[len(rmsd_vals)//2:].mean():.2f} Å")
ax1.set_xlabel("Time (ns)", fontsize=11)
ax1.set_ylabel("Backbone RMSD (Å)", fontsize=11)
ax1.set_title("(A) Backbone RMSD over simulation", fontsize=12, fontweight="bold", loc="left")
ax1.legend(fontsize=9)
# ── Panel B: RMSF per residue ──────────────────────────────
ax2 = fig.add_subplot(gs[1])
ax2.plot(resids, rmsf_vals, color="#185fa5", linewidth=1.1)
ax2.fill_between(resids, rmsf_vals, alpha=0.15, color="#185fa5")
ax2.axhline(2.0, color="#ba7517", linestyle="--", linewidth=0.8,
label="2 Å threshold (high flexibility)")
ax2.set_xlabel("Residue number", fontsize=11)
ax2.set_ylabel("RMSF (Å)", fontsize=11)
ax2.set_title("(B) Per-residue RMSF (Cα atoms)", fontsize=12, fontweight="bold", loc="left")
ax2.legend(fontsize=9)
plt.savefig("rmsd_rmsf.png", dpi=300, bbox_inches="tight")
plt.close()
print("Saved rmsd_rmsf.png")
Interpreting the results
| What you see | What it means | Action |
|---|---|---|
| RMSD plateaus at < 2 Å | Excellent convergence. Protein maintained its starting fold throughout the simulation. | Use full trajectory for RMSF and further analysis. |
| RMSD plateaus at 2–4 Å | Reasonable for a flexible protein or one undergoing conformational change. Check if the plateau is stable. | Identify which region contributes most to RMSD. Consider domain-specific analysis. |
| RMSD still rising at simulation end | Simulation hasn’t converged. The protein is still relaxing or changing conformation. | Extend the simulation. Discard data before plateau for equilibrium analysis. |
| RMSD jumps suddenly then stays high | Structural event — possibly unfolding, large loop rearrangement, or ligand dissociation. | Inspect the trajectory visually at the jump frame in PyMOL or VMD. |
| RMSF < 1 Å for secondary structure | Helices and sheets are rigid and well-defined. High-quality simulation of structured regions. | These regions are reliable for docking grid definition or interface analysis. |
| RMSF peaks at loops and termini | Expected and normal — loops are intrinsically flexible. Terminal residues are usually the most mobile. | If using the structure for docking, trim disordered termini before grid preparation. |
rms.RMSF(atoms).run(start=N). For most simulations, discarding the first 20–30% of frames is a reasonable default.
Comparing RMSF to AlphaFold pLDDT
When your simulation was seeded from an AlphaFold structure, comparing per-residue RMSF with per-residue pLDDT is a powerful validation: regions that AlphaFold predicted with low confidence (low pLDDT) should generally show high RMSF in simulation, since both reflect intrinsic disorder. If they don’t agree, it’s worth investigating why.
from Bio.PDB import PDBParser
import pandas as pd
import matplotlib.pyplot as plt
# Extract pLDDT from the AlphaFold PDB (B-factor column)
parser = PDBParser(QUIET=True)
af_structure = parser.get_structure("af", "alphafold_model.pdb")
plddt_data = []
for res in af_structure[0]["A"].get_residues():
if res.id[0] == " " and "CA" in res:
plddt_data.append({"resid": res.id[1], "plddt": res["CA"].get_bfactor()})
df_plddt = pd.DataFrame(plddt_data)
# Build RMSF DataFrame from MDAnalysis results
df_rmsf = pd.DataFrame({"resid": resids, "rmsf": rmsf_vals})
# Merge on residue number
df = df_rmsf.merge(df_plddt, on="resid")
# Plot RMSF (left axis) and pLDDT (right axis) vs residue
fig, ax1 = plt.subplots(figsize=(12, 4))
ax2 = ax1.twinx()
ax1.plot(df["resid"], df["rmsf"], color="#185fa5", linewidth=1.2, label="RMSF (MD)")
ax2.plot(df["resid"], df["plddt"], color="#993c1d", linewidth=1.0,
linestyle="--", alpha=0.7, label="pLDDT (AlphaFold)")
ax1.set_xlabel("Residue number")
ax1.set_ylabel("RMSF (Å)", color="#185fa5")
ax2.set_ylabel("pLDDT", color="#993c1d")
ax2.invert_yaxis() # invert pLDDT so low confidence aligns with high RMSF
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, fontsize=9)
plt.title("RMSF vs AlphaFold pLDDT confidence")
plt.tight_layout()
plt.savefig("rmsf_vs_plddt.png", dpi=300, bbox_inches="tight")
plt.close()
# Pearson correlation: should be strongly negative (low pLDDT ↔ high RMSF)
corr = df["rmsf"].corr(df["plddt"])
print(f"RMSF vs pLDDT Pearson r: {corr:.3f}")
# Expected: strongly negative (e.g. -0.60 to -0.80)
Reporting values for publication
When reporting RMSD and RMSF in a paper, include the following — reviewers at journals like JCTC, JPCB, and Structure specifically look for this information:
- Reference structure — what frame was used as the reference for RMSD calculation (typically frame 0 or the experimental structure)
- Atom selection — backbone, Cα, or all heavy atoms; which chains were included
- Equilibration period discarded — how many frames or nanoseconds were excluded from RMSF analysis
- Mean ± SD RMSD over the equilibrated portion: e.g., “Mean backbone RMSD of 1.84 ± 0.21 Å over the final 50 ns”
- Most flexible regions from the RMSF profile by residue number, with interpretation
RMSD and RMSF in one paragraph
Always align the trajectory first with align.AlignTraj(u, u, select="backbone", in_memory=True).run() — skipping this inflates both metrics with whole-protein translation and rotation. Then calculate RMSD with rms.RMSD(u, select="backbone").run() — results are in results.rmsd[:, 2] — and RMSF with rms.RMSF(ca_atoms).run() on the aligned Universe. Plot RMSD as a time series to assess convergence; plot RMSF as a per-residue profile to identify flexible loops and rigid secondary structures. Discard the equilibration period (first 20–30% of frames) before calculating RMSF. For AlphaFold-seeded simulations, compare RMSF with pLDDT — the negative correlation validates both the simulation and the confidence prediction.