How to Prepare a Protein for Molecular Docking: Complete Step-by-Step Guide
Protein preparation is where most docking projects succeed or fail — and where most beginners spend the most time confused. A raw PDB file cannot go straight into AutoDock Vina. This guide walks through every step from downloading the structure to a verified, docking-ready PDBQT file.
What protein preparation actually involves
When you download a protein structure from the RCSB Protein Data Bank, what you get is a raw crystallographic snapshot. It is missing information that docking software needs, and it contains things that will break your docking run if left in. Preparation fixes both problems.
Here is what needs to happen before a protein is ready for AutoDock Vina:
Each step matters. Skipping or rushing any one of them is the most common cause of bad docking results — not the docking algorithm itself. The old saying in computational chemistry applies here: garbage in, garbage out.
Tools you need
You need three programs. All are free for academic use:
- PyMOL — for visual inspection and cleaning the structure. The open-source version is available via conda; the educational version is free at pymol.org for students and academics.
- AutoDockTools (ADT) — for assigning Gasteiger charges and generating the PDBQT file. Part of the MGLTools package, available free at ccsb.scripps.edu/mgltools.
- AutoDock Vina — installed in the previous tutorial. You won’t run a docking calculation here, but you’ll need it installed to verify the output file format.
Install PyMOL via conda if you don’t have it:
conda activate docking
conda install -c conda-forge pymol-open-source
Step 1 — Download the structure from the PDB
Go to rcsb.org and search for your target. For this tutorial, search for 3HTB. On the structure page, click Download Files → PDB Format. Save the file as 3HTB.pdb in a dedicated working folder — something like ~/docking/3HTB/.
You can also download directly from the command line:
mkdir -p ~/docking/3HTB && cd ~/docking/3HTB
wget https://files.rcsb.org/download/3HTB.pdb
Before touching anything else, open the PDB file in a text editor and read the REMARK and HEADER sections at the top. These tell you the resolution of the structure, what organism it comes from, and crucially — what ligands and cofactors are present. You need to know this before you start removing things.
Step 2 — Inspect and clean in PyMOL
Open PyMOL and load your structure:
load 3HTB.pdb
Spend two minutes looking at the structure before doing anything. Use the mouse to rotate it. Identify the binding site — in 3HTB, it’s clearly visible as a cavity where the co-crystallized inhibitor sits. Note the following before you start removing things:
- Is there a co-crystallized ligand? (Yes — 3HTB has the inhibitor ARQ bound. You’ll remove it.)
- Are there multiple chains? (3HTB is a homodimer with chains A and B — you need both for this target.)
- Are there structural water molecules near the binding site that might be important? (Advanced — ignore for now.)
- Are there metal ions or cofactors that are biologically relevant? (Check the literature before removing these.)
Remove water molecules
PDB files include crystallographic water molecules as HETATM records with residue name HOH. For standard docking, these are removed:
remove resn HOH
Remove the co-crystallized ligand
The bound inhibitor must be removed — you’re docking your own ligand into the empty pocket. In 3HTB the inhibitor residue name is ARQ:
remove resn ARQ
If you don’t know the residue name of the ligand in your structure, list all heteroatoms first:
select hetatms, hetatm
iterate hetatms, print(resn)
Remove any other unwanted heteroatoms
Check if any other HETATM records remain — buffer molecules, cryoprotectants, or crystallization additives that are artifacts of the experimental conditions rather than biology. Common ones include SO4 (sulfate), GOL (glycerol), PEG (polyethylene glycol), and EDO (ethanediol). Remove any that are not biologically relevant:
remove resn SO4+GOL+EDO
Save the cleaned structure
save 3HTB_clean.pdb
Step 3 — Add hydrogens and assign charges
X-ray crystallography cannot resolve hydrogen atoms — they are too small. But hydrogen atoms are essential for docking because they determine where hydrogen bonds can form. You need to add them computationally.
You also need to assign Gasteiger partial charges to every atom. The docking scoring function uses these charges to estimate electrostatic interactions between ligand and receptor.
Both steps happen in AutoDockTools. Open ADT (the MGLTools GUI):
3HTB_clean.pdb.Alternatively, the same steps can be done from the command line using the AutoDockTools Python scripts, which is useful if you’re preparing many proteins:
# Using the ADT prepare_receptor script
prepare_receptor4.py \
-r 3HTB_clean.pdb \
-o 3HTB_receptor.pdbqt \
-A hydrogens \
-U nphs_lps_waters_deleteAltB
The flags here mean: -A hydrogens adds hydrogens, -U nphs merges non-polar hydrogens (standard practice), lps merges lone pairs, waters removes any remaining waters, deleteAltB removes alternate conformations keeping only the primary one.
Step 4 — Generate the PDBQT file
PDBQT is an extended PDB format that adds two extra columns: partial charge (Q) and atom type (T). AutoDock Vina requires receptor files in this format. If you used the prepare_receptor4.py script in Step 3 with the -o 3HTB_receptor.pdbqt flag, you already have it. If you used the GUI:
3HTB_receptor.pdbqt.Open the output file in a text editor and check the last two columns of a few ATOM lines. They should look like this:
ATOM 2 CA PRO A 1 28.628 22.714 3.754 0.00 0.00 0.0622 C
ATOM 3 C PRO A 1 27.112 22.830 3.694 0.00 0.00 0.2991 C
ATOM 4 O PRO A 1 26.509 23.887 3.856 0.00 0.00 -0.2536 OA
The two highlighted columns at the end are the Gasteiger charge and atom type — the Q and T in PDBQT. Their presence confirms the file was prepared correctly.
The atom types you should see for a correctly prepared protein include: C (non-polar carbon), N (nitrogen), OA (hydrogen-bond acceptor oxygen), HD (hydrogen on donor), S (sulfur). If you see ? as an atom type anywhere in the file, that atom was not recognized and will cause problems in docking.
Step 5 — Verify the output
Before using your prepared receptor in a real docking run, do three quick checks:
Check 1: count the ATOM records
The number of atoms in the PDBQT should be fewer than in the original PDB (waters and ligand removed) but more than just the backbone (hydrogens added). On Linux/macOS:
grep -c "^ATOM" 3HTB_receptor.pdbqt
Compare to the original: grep -c "^ATOM" 3HTB.pdb. The cleaned PDBQT should have fewer atoms overall but the protein atom count should be similar, plus polar hydrogens.
Check 2: look for unknown atom types
grep " \? " 3HTB_receptor.pdbqt
No output means no unknown atom types. Any lines printed here represent atoms that AutoDockTools couldn’t type — fix before docking.
Check 3: visually inspect in PyMOL
Load the PDBQT in PyMOL and confirm the binding site looks correct — the pocket is empty, the protein backbone is intact, and there are no obvious structural problems:
load 3HTB_receptor.pdbqt
show sticks, resi 25+49+50+76+80+81+82+84
Those residue numbers are the key active site residues for HIV-1 protease. For your own target, substitute the relevant binding site residues from the literature.
Common mistakes that ruin docking results
These are the preparation errors that appear most often in troubleshooting forums — and that produce bad docking results without obvious error messages.
remove resn HOH removes all waters, but some workflows only remove bulk solvent and leave “structural” waters. Unless you have strong experimental evidence that a specific water is critical for binding, remove all of them. The scoring function handles solvation implicitly.deleteAltB flag in prepare_receptor4.py, or in PyMOL: remove not (alt ''+A) followed by alter all, alt=''.What good preparation looks like
A properly prepared receptor file is clean (no waters, no co-crystallized ligands, no alternate conformations), complete (all residues accounted for, polar hydrogens added), correctly charged (Gasteiger charges assigned, integer total charge), and typed (no ? atom types in the PDBQT). Get those four things right and your docking results will be trustworthy.
The next tutorial covers ligand preparation — the other half of the equation — and then runs the full docking calculation using both prepared files.
- PDB structure downloaded and REMARK section reviewed
- Waters, co-crystallized ligand, and crystallization artifacts removed
- Alternate conformations removed
- Polar hydrogens added
- Gasteiger charges assigned — total is close to an integer
- PDBQT file generated with no
?atom types - Structure visually inspected in PyMOL — binding site is empty and intact