How to Align Protein Structures in PyMOL: align, super and cealign Explained

How to Align Protein Structures in PyMOL: align, super and cealign Explained

Structure alignment is one of the most frequent tasks in structural biology — comparing homologs, validating predictions against experiment, measuring conformational change. PyMOL has three alignment commands, and choosing the wrong one gives misleading RMSD values. This guide explains when each is appropriate and how to use all three correctly.

The three alignment commands at a glance

align
Sequence-guided
  • Uses sequence alignment first
  • Best for close homologs (>30% identity)
  • Iteratively rejects outlier residues
  • Fastest of the three
  • Can fail with low sequence similarity
super
Structure-guided
  • Uses structural superposition directly
  • Good for distant homologs
  • Works even with low sequence identity
  • Slower than align, faster than cealign
  • More accurate when sequences diverged
cealign
Topology alignment
  • CE (Combinatorial Extension) algorithm
  • Finds structurally similar subregions
  • Best for structural analogs with different folds
  • Slowest — intensive computation
  • Robust when others fail completely
The most common mistake
Using align for structurally similar but sequentially divergent proteins and getting an artificially high RMSD — because the sequence alignment misfires and aligns the wrong secondary structure elements. When in doubt between align and super, use super. It handles a wider range of cases correctly.

align — sequence-guided alignment

align first performs a sequence alignment between the mobile and reference structures, maps the sequence-aligned residues onto each other, then minimizes the RMSD by rotating and translating the mobile structure. After the initial fit, it iteratively removes the worst-fitting residue pairs and recalculates — the reported RMSD reflects the well-fitting core rather than the whole protein.

PyMOL command line
# Basic alignment: mobile structure aligns to reference
align mobile, reference

# Output example:
 Executive: RMSD =    0.742 (197 to 197 atoms)
 Executive: RMSD =    0.598 (183 to 183 atoms) after outlier rejection

# Align specific chains
align mobile and chain A, reference and chain A

# Align and store the result object (for examining the alignment)
align mobile, reference, object=alignment_result

# Align without moving the mobile structure (get RMSD only)
align mobile, reference, cycles=0, transform=0

PyMOL prints two RMSD values: the initial alignment RMSD (all matched pairs) and the refined RMSD after outlier rejection. The refined value is typically reported in publications. The number of atoms used appears in parentheses — a significantly lower atom count in the refined alignment suggests many poorly-matching regions were rejected, which can indicate the alignment is unreliable.

super — structure-guided alignment

super skips the sequence alignment step entirely and works directly from structural coordinates. It identifies structurally equivalent residue pairs by comparing local geometry — backbone angles, secondary structure patterns — rather than sequence similarity. This makes it reliable across a much wider range of sequence identities.

PyMOL command line
# Structure-based alignment — preferred for distant homologs
super mobile, reference

# Output example:
 Executive: RMSD =    1.243 (201 to 201 atoms)
 Executive: RMSD =    0.891 (187 to 187 atoms) after outlier rejection

# super on backbone atoms only (faster, cleaner RMSD)
super mobile and backbone, reference and backbone

# super on specific chains
super mobile and chain B, reference and chain A
super is usually the right default
If you’re unsure which alignment command to use, start with super. It handles close homologs as well as align does, and it handles distant homologs significantly better. The only cases where align is clearly preferred are when you specifically need sequence-position-based alignment — for example, when comparing the same protein in two conformational states where every residue should be paired by position.

cealign — fold topology alignment

cealign implements the Combinatorial Extension algorithm, which searches for the longest continuous path of structurally similar fragments between two proteins. It’s the most computationally intensive of the three but is specifically designed for the hardest cases: proteins with similar folds but very different sequences, or structures where the topological arrangement of secondary structure elements differs between the two proteins.

PyMOL command line
# CE algorithm alignment — for distantly related or topologically similar proteins
cealign reference, mobile

# Note: argument order is reversed from align and super!
# cealign REFERENCE, MOBILE (not mobile, reference)

# CE alignment and store the transformation
cealign reference, mobile, object=cealign_result
cealign reverses the argument order
Unlike align and super which take mobile, reference, cealign takes reference, mobile. This is a common source of confusion — if you run cealign and the wrong structure moves, swap the arguments.

When to use each — decision guide

SituationBest choice
Same protein, two conformational states (apo vs bound)align
Close homologs, >40% sequence identityalign or super
Moderate homologs, 20–40% identitysuper
Distant homologs, <20% sequence identitysuper or cealign
align or super produces poor superpositioncealign
AlphaFold model vs experimental homologsuper
Comparing two AlphaFold modelssuper
Structurally analogous proteins, different topologycealign

Interpreting RMSD values

RMSD (root mean square deviation) measures the average distance between equivalent atom pairs after alignment. Lower values mean the structures are more similar. What counts as “good” depends entirely on the context — comparing crystal structure to AlphaFold prediction is different from comparing two conformational states of the same protein.

RMSD interpretation — Cα backbone
< 1.0 Å
Excellent. Structures nearly identical — same protein in same or very similar conformation, or high-quality model vs crystal structure.
1–2 Å
Good. Close homologs or same protein with some conformational flexibility. Typical for comparing a ligand-free vs ligand-bound structure.
2–4 Å
Moderate. Significant conformational differences or moderate sequence divergence. Meaningful similarities remain in conserved core.
> 4 Å
Large. Structures are very different. May indicate wrong alignment, unrelated folds, or large domain rearrangements. Investigate before reporting.
Always report how many atoms were used
An RMSD of 0.6 Å over 80 atoms is very different from 0.6 Å over 300 atoms. When reporting RMSD in a paper, always include the number of aligned residues or atoms: “RMSD = 0.74 Å over 183 Cα atoms.” PyMOL prints this automatically — use the number from the final (post-outlier-rejection) line.

Aligning multiple structures

When comparing several structures simultaneously — a set of crystal structures in different conformations, or five AlphaFold models — align them all to a single reference:

PyMOL command line
# Load multiple structures
fetch 4XYZ 5ABC 6DEF 7GHI

# Align all to 4XYZ as reference
super 5ABC, 4XYZ
super 6DEF, 4XYZ
super 7GHI, 4XYZ

# Color each structure distinctly to compare
color slate,    4XYZ
color salmon,   5ABC
color palegreen, 6DEF
color lightorange, 7GHI

# Show all as cartoons
as cartoon
zoom

For an automated multi-structure alignment in a script, use a Python loop within PyMOL. This is practical when aligning more than four or five structures:

PyMOL Python (paste into command line or save as .pml)
structures = ["5ABC", "6DEF", "7GHI", "8JKL"]
reference  = "4XYZ"

for s in structures:
    cmd.fetch(s)
    cmd.super(s, reference)

Aligning specific domains or regions

When two proteins share one well-conserved domain but differ in others — kinases with conserved catalytic cores and variable regulatory domains, for example — aligning on the whole protein produces a poor overall superposition that obscures the conserved region. Aligning on the specific domain of interest is almost always more informative.

PyMOL command line
# Align only on the kinase domain (residues 200-450)
super mobile and resi 200-450, reference and resi 200-450

# Align on specific chain of a complex
super mobile and chain A, reference and chain A

# Align on backbone of secondary structure elements only
super mobile and ss h and backbone, reference and ss h and backbone

# Align on a named selection defined earlier
select core, resi 50-150 and name CA
super mobile and core, reference and core

Calculating RMSD without moving structures

Sometimes you want the RMSD value without actually repositioning any structure — when the structures are already in a meaningful orientation, or when you want to measure deviation between pre-aligned models.

PyMOL command line
# RMSD of backbone atoms without moving anything
rms_cur mobile and backbone, reference and backbone

# RMSD of Cα atoms specifically
rms_cur mobile and name CA, reference and name CA

# align without transformation (RMSD only, no movement)
align mobile, reference, transform=0

Alignment in one paragraph

Use super as your default alignment command — it works reliably across a wider range of sequence identities than align and is faster than cealign. Use align when you specifically need sequence-position-based pairing, such as comparing two conformational states of the same protein. Use cealign when the other two produce obviously wrong superpositions, remembering that its argument order is reversed. Always report the post-outlier-rejection RMSD and the number of atoms used. When comparing multi-domain proteins, align on the domain of interest rather than the whole structure to get a meaningful superposition of the region that matters.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *