How to Prepare a Protein for Molecular Docking: Complete Step-by-Step Guide

How to Prepare a Protein for Molecular Docking: Complete Step-by-Step Guide

Protein preparation is where most docking projects succeed or fail — and where most beginners spend the most time confused. A raw PDB file cannot go straight into AutoDock Vina. This guide walks through every step from downloading the structure to a verified, docking-ready PDBQT file.

What protein preparation actually involves

When you download a protein structure from the RCSB Protein Data Bank, what you get is a raw crystallographic snapshot. It is missing information that docking software needs, and it contains things that will break your docking run if left in. Preparation fixes both problems.

Here is what needs to happen before a protein is ready for AutoDock Vina:

Step 1
Download from PDB
Step 2
Clean in PyMOL
Step 3
Add hydrogens
Step 4
Generate PDBQT
Step 5
Verify

Each step matters. Skipping or rushing any one of them is the most common cause of bad docking results — not the docking algorithm itself. The old saying in computational chemistry applies here: garbage in, garbage out.

What this guide uses as an example
This tutorial uses 3HTB — a crystal structure of HIV-1 protease with a known inhibitor — as the working example throughout. It’s a classic docking benchmark target, well-characterised, and straightforward to prepare. You can follow along with any PDB structure by substituting the PDB ID wherever 3HTB appears.

Tools you need

You need three programs. All are free for academic use:

  • PyMOL — for visual inspection and cleaning the structure. The open-source version is available via conda; the educational version is free at pymol.org for students and academics.
  • AutoDockTools (ADT) — for assigning Gasteiger charges and generating the PDBQT file. Part of the MGLTools package, available free at ccsb.scripps.edu/mgltools.
  • AutoDock Vina — installed in the previous tutorial. You won’t run a docking calculation here, but you’ll need it installed to verify the output file format.

Install PyMOL via conda if you don’t have it:

Terminal
conda activate docking
conda install -c conda-forge pymol-open-source

Step 1 — Download the structure from the PDB

1
Step 1
Download from the Protein Data Bank

Go to rcsb.org and search for your target. For this tutorial, search for 3HTB. On the structure page, click Download FilesPDB Format. Save the file as 3HTB.pdb in a dedicated working folder — something like ~/docking/3HTB/.

You can also download directly from the command line:

Terminal
mkdir -p ~/docking/3HTB && cd ~/docking/3HTB
wget https://files.rcsb.org/download/3HTB.pdb

Before touching anything else, open the PDB file in a text editor and read the REMARK and HEADER sections at the top. These tell you the resolution of the structure, what organism it comes from, and crucially — what ligands and cofactors are present. You need to know this before you start removing things.

What to look for in the REMARK section
The REMARK entries near the top of a PDB file contain essential context: resolution (lower is better — aim for structures under 2.5 Å), R-factor (a quality indicator, ideally below 0.25), missing residues (gaps in the structure that might affect your binding site), and a list of all HETATM records (heteroatoms — water, ligands, ions, cofactors).

Step 2 — Inspect and clean in PyMOL

2
Step 2
Inspect and clean in PyMOL

Open PyMOL and load your structure:

PyMOL command line
load 3HTB.pdb

Spend two minutes looking at the structure before doing anything. Use the mouse to rotate it. Identify the binding site — in 3HTB, it’s clearly visible as a cavity where the co-crystallized inhibitor sits. Note the following before you start removing things:

  • Is there a co-crystallized ligand? (Yes — 3HTB has the inhibitor ARQ bound. You’ll remove it.)
  • Are there multiple chains? (3HTB is a homodimer with chains A and B — you need both for this target.)
  • Are there structural water molecules near the binding site that might be important? (Advanced — ignore for now.)
  • Are there metal ions or cofactors that are biologically relevant? (Check the literature before removing these.)

Remove water molecules

PDB files include crystallographic water molecules as HETATM records with residue name HOH. For standard docking, these are removed:

PyMOL command line
remove resn HOH

Remove the co-crystallized ligand

The bound inhibitor must be removed — you’re docking your own ligand into the empty pocket. In 3HTB the inhibitor residue name is ARQ:

PyMOL command line
remove resn ARQ

If you don’t know the residue name of the ligand in your structure, list all heteroatoms first:

PyMOL command line
select hetatms, hetatm
iterate hetatms, print(resn)

Remove any other unwanted heteroatoms

Check if any other HETATM records remain — buffer molecules, cryoprotectants, or crystallization additives that are artifacts of the experimental conditions rather than biology. Common ones include SO4 (sulfate), GOL (glycerol), PEG (polyethylene glycol), and EDO (ethanediol). Remove any that are not biologically relevant:

PyMOL command line
remove resn SO4+GOL+EDO

Save the cleaned structure

PyMOL command line
save 3HTB_clean.pdb
Do not remove metal ions blindly
If your target is a metalloenzyme (zinc proteases, cytochrome P450s, carbonic anhydrase, etc.), the catalytic metal is structurally and functionally essential. Removing it will collapse the binding site geometry and your docking results will be meaningless. Check the literature for your target before removing anything that isn’t obviously a crystallization artifact.

Step 3 — Add hydrogens and assign charges

3
Step 3
Add hydrogens and assign partial charges

X-ray crystallography cannot resolve hydrogen atoms — they are too small. But hydrogen atoms are essential for docking because they determine where hydrogen bonds can form. You need to add them computationally.

You also need to assign Gasteiger partial charges to every atom. The docking scoring function uses these charges to estimate electrostatic interactions between ligand and receptor.

Both steps happen in AutoDockTools. Open ADT (the MGLTools GUI):

1
Open AutoDockTools. Go to File → Read Molecule and load 3HTB_clean.pdb.
2
Go to Edit → Hydrogens → Add. In the dialog, select Polar Only (adds hydrogens only to polar atoms — standard for docking). Click OK.
3
Go to Edit → Charges → Compute Gasteiger. This runs in a few seconds. No visible output — the charges are assigned internally.
4
Verify the charges were assigned: go to Edit → Charges → Check Totals on Residues. The total charge should be close to an integer (e.g. 0, −1, −2). A non-integer total (like 0.73) indicates a problem with missing atoms or incomplete residues.

Alternatively, the same steps can be done from the command line using the AutoDockTools Python scripts, which is useful if you’re preparing many proteins:

Terminal
# Using the ADT prepare_receptor script
prepare_receptor4.py \
  -r 3HTB_clean.pdb \
  -o 3HTB_receptor.pdbqt \
  -A hydrogens \
  -U nphs_lps_waters_deleteAltB

The flags here mean: -A hydrogens adds hydrogens, -U nphs merges non-polar hydrogens (standard practice), lps merges lone pairs, waters removes any remaining waters, deleteAltB removes alternate conformations keeping only the primary one.

Protonation state: the step most tutorials skip
The protonation states of histidine, aspartate, and glutamate residues depend on pH and local environment — and they affect docking results significantly. AutoDockTools assigns protonation based on simple rules, which is often wrong for active site residues. For important projects, use H++ (biophysics.cs.vt.edu/H++) or PropKa to predict protonation states at pH 7.4 before adding hydrogens. This is one of the most impactful improvements you can make to a docking workflow.

Step 4 — Generate the PDBQT file

4
Step 4
Generate the PDBQT file for AutoDock Vina

PDBQT is an extended PDB format that adds two extra columns: partial charge (Q) and atom type (T). AutoDock Vina requires receptor files in this format. If you used the prepare_receptor4.py script in Step 3 with the -o 3HTB_receptor.pdbqt flag, you already have it. If you used the GUI:

1
In AutoDockTools, go to Grid → Macromolecule → Choose and select your protein.
2
A dialog will appear asking if you want to save the macromolecule. Click Save and name it 3HTB_receptor.pdbqt.

Open the output file in a text editor and check the last two columns of a few ATOM lines. They should look like this:

3HTB_receptor.pdbqt
ATOM 1 N PRO A 1 29.260 24.054 3.839 0.00 0.00 -0.0549 N
ATOM 2 CA PRO A 1 28.628 22.714 3.754 0.00 0.00 0.0622 C
ATOM 3 C PRO A 1 27.112 22.830 3.694 0.00 0.00 0.2991 C
ATOM 4 O PRO A 1 26.509 23.887 3.856 0.00 0.00 -0.2536 OA

The two highlighted columns at the end are the Gasteiger charge and atom type — the Q and T in PDBQT. Their presence confirms the file was prepared correctly.

The atom types you should see for a correctly prepared protein include: C (non-polar carbon), N (nitrogen), OA (hydrogen-bond acceptor oxygen), HD (hydrogen on donor), S (sulfur). If you see ? as an atom type anywhere in the file, that atom was not recognized and will cause problems in docking.

Step 5 — Verify the output

5
Step 5
Verify the prepared receptor

Before using your prepared receptor in a real docking run, do three quick checks:

Check 1: count the ATOM records

The number of atoms in the PDBQT should be fewer than in the original PDB (waters and ligand removed) but more than just the backbone (hydrogens added). On Linux/macOS:

Terminal
grep -c "^ATOM" 3HTB_receptor.pdbqt

Compare to the original: grep -c "^ATOM" 3HTB.pdb. The cleaned PDBQT should have fewer atoms overall but the protein atom count should be similar, plus polar hydrogens.

Check 2: look for unknown atom types

Terminal
grep " \? " 3HTB_receptor.pdbqt

No output means no unknown atom types. Any lines printed here represent atoms that AutoDockTools couldn’t type — fix before docking.

Check 3: visually inspect in PyMOL

Load the PDBQT in PyMOL and confirm the binding site looks correct — the pocket is empty, the protein backbone is intact, and there are no obvious structural problems:

PyMOL command line
load 3HTB_receptor.pdbqt
show sticks, resi 25+49+50+76+80+81+82+84

Those residue numbers are the key active site residues for HIV-1 protease. For your own target, substitute the relevant binding site residues from the literature.

Common mistakes that ruin docking results

These are the preparation errors that appear most often in troubleshooting forums — and that produce bad docking results without obvious error messages.

Leaving water molecules in the binding site
The most common beginner mistake. remove resn HOH removes all waters, but some workflows only remove bulk solvent and leave “structural” waters. Unless you have strong experimental evidence that a specific water is critical for binding, remove all of them. The scoring function handles solvation implicitly.
Keeping alternate conformations
PDB files sometimes contain alternate conformations for residues (indicated by A and B in column 17). AutoDockTools needs a single conformation. If you don’t remove alternates, you may get duplicated atoms or strange atom types. Use the deleteAltB flag in prepare_receptor4.py, or in PyMOL: remove not (alt ''+A) followed by alter all, alt=''.
Not checking protonation states of active site residues
AutoDockTools adds hydrogens using default rules. For most surface residues this is fine. For active site residues — especially histidines, which can be neutral or positively charged depending on local environment — the default is often wrong. A misprotonated catalytic histidine will give systematically bad docking scores for the correct binding mode.
Using a low-resolution structure without checking it
PDB structures above 3.0 Å resolution have significant positional uncertainty in side-chain conformations. If your structure is low-resolution or has high B-factors in the binding site, the binding pocket geometry may not be reliable. Check the resolution in the REMARK section and prefer structures below 2.5 Å when available.
Skipping the validation step entirely
Many people prepare a receptor, run docking, get a score, and trust it. The correct workflow is to self-dock the co-crystallized ligand back into the prepared receptor before running anything new. If the top-ranked pose reproduces the crystal pose (RMSD < 2.0 Å), your preparation was correct. If it doesn’t, something went wrong in preparation — not in the docking algorithm.

What good preparation looks like

A properly prepared receptor file is clean (no waters, no co-crystallized ligands, no alternate conformations), complete (all residues accounted for, polar hydrogens added), correctly charged (Gasteiger charges assigned, integer total charge), and typed (no ? atom types in the PDBQT). Get those four things right and your docking results will be trustworthy.

The next tutorial covers ligand preparation — the other half of the equation — and then runs the full docking calculation using both prepared files.

  • PDB structure downloaded and REMARK section reviewed
  • Waters, co-crystallized ligand, and crystallization artifacts removed
  • Alternate conformations removed
  • Polar hydrogens added
  • Gasteiger charges assigned — total is close to an integer
  • PDBQT file generated with no ? atom types
  • Structure visually inspected in PyMOL — binding site is empty and intact

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *