How to Align Protein Structures in PyMOL: align, super and cealign Explained
Structure alignment is one of the most frequent tasks in structural biology — comparing homologs, validating predictions against experiment, measuring conformational change. PyMOL has three alignment commands, and choosing the wrong one gives misleading RMSD values. This guide explains when each is appropriate and how to use all three correctly.
The three alignment commands at a glance
- Uses sequence alignment first
- Best for close homologs (>30% identity)
- Iteratively rejects outlier residues
- Fastest of the three
- Can fail with low sequence similarity
- Uses structural superposition directly
- Good for distant homologs
- Works even with low sequence identity
- Slower than align, faster than cealign
- More accurate when sequences diverged
- CE (Combinatorial Extension) algorithm
- Finds structurally similar subregions
- Best for structural analogs with different folds
- Slowest — intensive computation
- Robust when others fail completely
align for structurally similar but sequentially divergent proteins and getting an artificially high RMSD — because the sequence alignment misfires and aligns the wrong secondary structure elements. When in doubt between align and super, use super. It handles a wider range of cases correctly.
align — sequence-guided alignment
align first performs a sequence alignment between the mobile and reference structures, maps the sequence-aligned residues onto each other, then minimizes the RMSD by rotating and translating the mobile structure. After the initial fit, it iteratively removes the worst-fitting residue pairs and recalculates — the reported RMSD reflects the well-fitting core rather than the whole protein.
# Basic alignment: mobile structure aligns to reference
align mobile, reference
# Output example:
Executive: RMSD = 0.742 (197 to 197 atoms)
Executive: RMSD = 0.598 (183 to 183 atoms) after outlier rejection
# Align specific chains
align mobile and chain A, reference and chain A
# Align and store the result object (for examining the alignment)
align mobile, reference, object=alignment_result
# Align without moving the mobile structure (get RMSD only)
align mobile, reference, cycles=0, transform=0
PyMOL prints two RMSD values: the initial alignment RMSD (all matched pairs) and the refined RMSD after outlier rejection. The refined value is typically reported in publications. The number of atoms used appears in parentheses — a significantly lower atom count in the refined alignment suggests many poorly-matching regions were rejected, which can indicate the alignment is unreliable.
super — structure-guided alignment
super skips the sequence alignment step entirely and works directly from structural coordinates. It identifies structurally equivalent residue pairs by comparing local geometry — backbone angles, secondary structure patterns — rather than sequence similarity. This makes it reliable across a much wider range of sequence identities.
# Structure-based alignment — preferred for distant homologs
super mobile, reference
# Output example:
Executive: RMSD = 1.243 (201 to 201 atoms)
Executive: RMSD = 0.891 (187 to 187 atoms) after outlier rejection
# super on backbone atoms only (faster, cleaner RMSD)
super mobile and backbone, reference and backbone
# super on specific chains
super mobile and chain B, reference and chain A
super. It handles close homologs as well as align does, and it handles distant homologs significantly better. The only cases where align is clearly preferred are when you specifically need sequence-position-based alignment — for example, when comparing the same protein in two conformational states where every residue should be paired by position.
cealign — fold topology alignment
cealign implements the Combinatorial Extension algorithm, which searches for the longest continuous path of structurally similar fragments between two proteins. It’s the most computationally intensive of the three but is specifically designed for the hardest cases: proteins with similar folds but very different sequences, or structures where the topological arrangement of secondary structure elements differs between the two proteins.
# CE algorithm alignment — for distantly related or topologically similar proteins
cealign reference, mobile
# Note: argument order is reversed from align and super!
# cealign REFERENCE, MOBILE (not mobile, reference)
# CE alignment and store the transformation
cealign reference, mobile, object=cealign_result
align and super which take mobile, reference, cealign takes reference, mobile. This is a common source of confusion — if you run cealign and the wrong structure moves, swap the arguments.
When to use each — decision guide
| Situation | Best choice |
|---|---|
| Same protein, two conformational states (apo vs bound) | align |
| Close homologs, >40% sequence identity | align or super |
| Moderate homologs, 20–40% identity | super |
| Distant homologs, <20% sequence identity | super or cealign |
| align or super produces poor superposition | cealign |
| AlphaFold model vs experimental homolog | super |
| Comparing two AlphaFold models | super |
| Structurally analogous proteins, different topology | cealign |
Interpreting RMSD values
RMSD (root mean square deviation) measures the average distance between equivalent atom pairs after alignment. Lower values mean the structures are more similar. What counts as “good” depends entirely on the context — comparing crystal structure to AlphaFold prediction is different from comparing two conformational states of the same protein.
Aligning multiple structures
When comparing several structures simultaneously — a set of crystal structures in different conformations, or five AlphaFold models — align them all to a single reference:
# Load multiple structures
fetch 4XYZ 5ABC 6DEF 7GHI
# Align all to 4XYZ as reference
super 5ABC, 4XYZ
super 6DEF, 4XYZ
super 7GHI, 4XYZ
# Color each structure distinctly to compare
color slate, 4XYZ
color salmon, 5ABC
color palegreen, 6DEF
color lightorange, 7GHI
# Show all as cartoons
as cartoon
zoom
For an automated multi-structure alignment in a script, use a Python loop within PyMOL. This is practical when aligning more than four or five structures:
structures = ["5ABC", "6DEF", "7GHI", "8JKL"]
reference = "4XYZ"
for s in structures:
cmd.fetch(s)
cmd.super(s, reference)
Aligning specific domains or regions
When two proteins share one well-conserved domain but differ in others — kinases with conserved catalytic cores and variable regulatory domains, for example — aligning on the whole protein produces a poor overall superposition that obscures the conserved region. Aligning on the specific domain of interest is almost always more informative.
# Align only on the kinase domain (residues 200-450)
super mobile and resi 200-450, reference and resi 200-450
# Align on specific chain of a complex
super mobile and chain A, reference and chain A
# Align on backbone of secondary structure elements only
super mobile and ss h and backbone, reference and ss h and backbone
# Align on a named selection defined earlier
select core, resi 50-150 and name CA
super mobile and core, reference and core
Calculating RMSD without moving structures
Sometimes you want the RMSD value without actually repositioning any structure — when the structures are already in a meaningful orientation, or when you want to measure deviation between pre-aligned models.
# RMSD of backbone atoms without moving anything
rms_cur mobile and backbone, reference and backbone
# RMSD of Cα atoms specifically
rms_cur mobile and name CA, reference and name CA
# align without transformation (RMSD only, no movement)
align mobile, reference, transform=0
Alignment in one paragraph
Use super as your default alignment command — it works reliably across a wider range of sequence identities than align and is faster than cealign. Use align when you specifically need sequence-position-based pairing, such as comparing two conformational states of the same protein. Use cealign when the other two produce obviously wrong superpositions, remembering that its argument order is reversed. Always report the post-outlier-rejection RMSD and the number of atoms used. When comparing multi-domain proteins, align on the domain of interest rather than the whole structure to get a meaningful superposition of the region that matters.