How to Find and Analyze Protein-Protein Interfaces with BioPython

How to Find and Analyze Protein-Protein Interfaces with BioPython

Protein-protein interfaces determine binding specificity, drug targetability, and the functional logic of multi-subunit complexes. BioPython’s NeighborSearch class finds interface residues in milliseconds — and from there you can extract contact pairs, compute buried surface area, and produce a complete interface report programmatically.

The contact-distance approach to interface finding

The standard way to define a protein-protein interface computationally is by distance: any residue in chain A with at least one atom within a defined cutoff distance of any atom in chain B is an interface residue. The choice of cutoff affects what you capture.

4 Å
Tight contacts
Direct atomic contacts. H-bonds, salt bridges. Only the closest interacting atoms.
5 Å
Standard interface
The most common cutoff in the literature. Captures all direct contacts and close van der Waals.
8 Å
Extended interface
Includes second-shell residues that influence binding indirectly. Used for hotspot prediction.

Five ångströms is the most widely used cutoff in published interface analyses and is the default we’ll use here. Use 4 Å when you specifically want direct atomic contacts (useful for identifying H-bond partners before calculating distances), and 8 Å when studying allosteric networks or predicting mutation effects that propagate through the interface.

NeighborSearch — the right tool for the job

NeighborSearch builds a k-d tree from a list of atoms and then finds all atoms within a specified radius of a query point in O(log n) time. For interface analysis with thousands of atoms, this is orders of magnitude faster than a nested loop.

Python
from Bio.PDB import PDBParser, NeighborSearch

parser = PDBParser(QUIET=True)
structure = parser.get_structure("complex", "complex.pdb")
model = structure[0]

# Get all atoms from chain B to search against
chain_b_atoms = list(model["B"].get_atoms())

# Build the k-d tree from chain B atoms
ns = NeighborSearch(chain_b_atoms)

# Query: which chain B atoms are within 5 Å of this specific atom?
query_atom = model["A"][100]["CA"]
nearby = ns.search(query_atom.get_coord(), 5.0, level="A")
print(f"Chain B atoms within 5Å of A-100-CA: {len(nearby)}")

The level argument controls what search() returns. Use "A" for individual atoms, "R" for complete residues, or "C" for chains. For interface analysis, "R" is usually the most useful — it returns complete residue objects rather than individual atoms, so you don’t need to deduplicate.

level=ReturnsBest for
“A”Individual Atom objectsFinding specific contact atoms, H-bond donors/acceptors
“R”Complete Residue objectsInterface residue lists, contact maps
“C”Chain objectsChecking which chains are in contact
“M”Model objectsRarely useful for interface analysis
“S”Structure objectsOnly returns the whole structure if any atom is in range

Finding interface residues

Python
def get_interface_residues(structure, chain_a_id, chain_b_id, cutoff=5.0):
    """
    Find residues at the interface between two chains.
    Returns (interface_A, interface_B) — sets of Residue objects.
    """
    model = structure[0]
    chain_a = model[chain_a_id]
    chain_b = model[chain_b_id]

    # Build NeighborSearch from chain B atoms
    ns_b = NeighborSearch(list(chain_b.get_atoms()))
    # Build NeighborSearch from chain A atoms (for finding chain B contacts)
    ns_a = NeighborSearch(list(chain_a.get_atoms()))

    interface_a = set()
    interface_b = set()

    # For each residue in chain A, check if any of its atoms are within cutoff of chain B
    for res_a in chain_a.get_residues():
        if res_a.id[0] != " ": continue   # skip HETATM
        for atom in res_a.get_atoms():
            nearby_b = ns_b.search(atom.get_coord(), cutoff, level="R")
            if nearby_b:
                interface_a.add(res_a)
                interface_b.update(nearby_b)
                break   # one atom contact is enough to flag the residue

    return interface_a, interface_b

# Run the analysis
iface_a, iface_b = get_interface_residues(structure, "A", "B", cutoff=5.0)

print(f"Interface residues — Chain A: {len(iface_a)}")
print(f"Interface residues — Chain B: {len(iface_b)}")

# Print residue identities
print("\nChain A interface:")
for res in sorted(iface_a, key=lambda r: r.id[1]):
    print(f"  {res.get_resname()} {res.id[1]}")
# SER 18
# TYR 36
# GLU 72 ...

Extracting specific contact pairs

For a detailed interface report — especially when you need to identify which specific residues in chain A contact which specific residues in chain B — extract the contacts as pairs rather than as flat sets:

Python
def get_contact_pairs(structure, chain_a_id, chain_b_id, cutoff=5.0):
    """Return a list of (resA, resB, min_distance) contact pairs."""
    model = structure[0]
    chain_a = model[chain_a_id]
    chain_b = model[chain_b_id]
    ns_b = NeighborSearch(list(chain_b.get_atoms()))

    pairs = {}   # key: (resi_A, resi_B) → min distance seen

    for res_a in chain_a.get_residues():
        if res_a.id[0] != " ": continue
        for atom_a in res_a.get_atoms():
            nearby_atoms = ns_b.search(atom_a.get_coord(), cutoff, level="A")
            for atom_b in nearby_atoms:
                res_b = atom_b.get_parent()
                if res_b.id[0] != " ": continue
                dist = atom_a - atom_b
                key = (res_a.id[1], res_b.id[1])
                if key not in pairs or dist < pairs[key]["min_dist"]:
                    pairs[key] = {
                        "resn_a":   res_a.get_resname(),
                        "resi_a":   res_a.id[1],
                        "resn_b":   res_b.get_resname(),
                        "resi_b":   res_b.id[1],
                        "min_dist": round(dist, 2),
                    }

    return sorted(pairs.values(), key=lambda x: x["min_dist"])

contact_pairs = get_contact_pairs(structure, "A", "B", cutoff=5.0)
print(f"Total contact pairs: {len(contact_pairs)}")
print("Closest 5 contacts:")
for p in contact_pairs[:5]:
    print(f"  {p['resn_a']}{p['resi_a']} — {p['resn_b']}{p['resi_b']}: {p['min_dist']} Å")
# GLU72 — LYS156: 2.81 Å
# TYR36 — ASN201: 3.14 Å

Estimating interface area

A rough estimate of buried surface area — the gold-standard metric for interface size — can be approximated by counting interface atoms, since each atom contributes roughly 10–25 ų of buried area. For a rigorous buried surface area calculation, use DSSP or FreeSASA (a Python binding available via pip install freesasa). The atom-count approach below is suitable for comparative analysis when precise values aren’t required:

Python
def estimate_interface_area(interface_residues):
    """
    Rough BSA estimate: ~25 Ų per heavy atom at interface.
    Suitable for relative comparisons; use FreeSASA for publication values.
    """
    total_atoms = sum(
        sum(1 for atom in res.get_atoms() if atom.element != "H")
        for res in interface_residues
    )
    return total_atoms * 25   # ų

area_a = estimate_interface_area(iface_a)
area_b = estimate_interface_area(iface_b)
print(f"Estimated interface area (chain A): ~{area_a} Ų")
print(f"Estimated interface area (chain B): ~{area_b} Ų")
print(f"Total interface area estimate:      ~{area_a + area_b} Ų")

# Typical values for reference:
# Small interface (transient):  ~500 – 1500 Ų
# Medium interface:            ~1500 – 3000 Ų
# Large, stable interface:     > 3000 Ų
For publication-quality buried surface area values
Install FreeSASA (pip install freesasa) and calculate the actual solvent-accessible surface area of each chain alone vs in the complex. BSA = (SASA_A + SASA_B − SASA_complex) / 2. This gives the standard two-chain BSA value that is expected in protein interface papers. The atom-count estimate above is useful for quick comparisons but should not be reported as BSA in a publication.

Classifying contacts by type

Python
# Classify contacts by interaction type based on residue chemistry
CHARGED_POS = {"ARG", "LYS", "HIS"}
CHARGED_NEG = {"ASP", "GLU"}
POLAR       = {"SER", "THR", "ASN", "GLN", "TYR", "CYS"}
HYDROPHOBIC = {"ALA", "VAL", "ILE", "LEU", "MET", "PHE", "TRP", "PRO"}

def classify_contact(resn_a, resn_b):
    if (resn_a in CHARGED_POS and resn_b in CHARGED_NEG) or \
       (resn_a in CHARGED_NEG and resn_b in CHARGED_POS):
        return "salt_bridge"
    if resn_a in HYDROPHOBIC and resn_b in HYDROPHOBIC:
        return "hydrophobic"
    if resn_a in POLAR or resn_b in POLAR:
        return "polar"
    return "other"

for p in contact_pairs:
    p["contact_type"] = classify_contact(p["resn_a"], p["resn_b"])

Tabulating and plotting results

Python
import pandas as pd
import matplotlib.pyplot as plt

# Build a DataFrame from contact pairs
df = pd.DataFrame(contact_pairs)
print(df.head(10).to_string(index=False))
df.to_csv("interface_contacts.csv", index=False)

# Contact type breakdown
type_counts = df["contact_type"].value_counts()
print("\nContact type breakdown:")
print(type_counts)

# Plot: distance distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

ax1.hist(df["min_dist"], bins=20, color="#4a7fc1", edgecolor="white")
ax1.set_xlabel("Minimum distance (Å)")
ax1.set_ylabel("Number of contact pairs")
ax1.set_title("Interface contact distance distribution")
ax1.axvline(3.5, color="#0d6e54", linestyle="--", label="H-bond threshold")
ax1.legend(fontsize=9)

colors = {"salt_bridge": "#E24B4A", "hydrophobic": "#ba7517",
          "polar": "#4a7fc1", "other": "#928e85"}
ax2.bar(type_counts.index, type_counts.values,
        color=[colors.get(t, "#928e85") for t in type_counts.index])
ax2.set_xlabel("Contact type")
ax2.set_ylabel("Count")
ax2.set_title("Contact types at interface")

plt.tight_layout()
plt.savefig("interface_analysis.png", dpi=300, bbox_inches="tight")
plt.close()
print("Saved interface_analysis.png")

Complete interface analysis function

analyze_interface.py — end-to-end pipeline
Load PDB → find interface → extract contacts → classify → export CSV + plots
from Bio.PDB import PDBParser, NeighborSearch
import pandas as pd

def analyze_interface(pdb_path, chain_a="A", chain_b="B", cutoff=5.0):
    """Full interface analysis: returns summary dict and contact DataFrame."""
    structure = PDBParser(QUIET=True).get_structure("s", pdb_path)
    model = structure[0]
    ns_b = NeighborSearch(list(model[chain_b].get_atoms()))

    contacts, iface_a, iface_b = [], set(), set()

    for res_a in model[chain_a].get_residues():
        if res_a.id[0] != " ": continue
        min_dist_this_res = cutoff + 1
        contact_res_b = None
        for atom_a in res_a.get_atoms():
            for atom_b in ns_b.search(atom_a.get_coord(), cutoff, "A"):
                d = atom_a - atom_b
                res_b = atom_b.get_parent()
                if res_b.id[0] != " ": continue
                iface_a.add(res_a); iface_b.add(res_b)
                if d < min_dist_this_res:
                    min_dist_this_res = d; contact_res_b = res_b
        if contact_res_b:
            contacts.append({
                "resn_a": res_a.get_resname(), "resi_a": res_a.id[1],
                "resn_b": contact_res_b.get_resname(),
                "resi_b": contact_res_b.id[1],
                "min_dist": round(min_dist_this_res, 2),
            })

    df = pd.DataFrame(contacts).sort_values("min_dist").reset_index(drop=True)
    summary = {
        "interface_residues_A": len(iface_a),
        "interface_residues_B": len(iface_b),
        "total_contacts": len(df),
        "mean_contact_dist": round(df["min_dist"].mean(), 2),
        "close_contacts_lt4": (df["min_dist"] < 4.0).sum(),
    }
    return summary, df

summary, contacts_df = analyze_interface("complex.pdb")
for k, v in summary.items():
    print(f"{k}: {v}")
contacts_df.to_csv("interface_contacts.csv", index=False)
Asymmetric contacts — always analyze both directions
The function above finds all chain A residues that contact chain B. Symmetric analysis (also finding all chain B residues that contact chain A) is handled by the iface_b.add(res_b) line inside the loop. If you only search in one direction — atoms of A against all of B — you may miss chain B residues at the edge of the interface that have close contacts with chain A but whose atoms aren’t the closest ones. The complete function above handles this correctly because it searches A against B and records both sides.

Interface analysis in one paragraph

Build a NeighborSearch from chain B atoms, then iterate over chain A residues — for each atom in each chain A residue, search for chain B atoms within your cutoff (5 Å is standard). Any chain A residue with at least one atom hit is an interface residue; collect the chain B residue on the other side of each contact to build the full picture. Extract contact pairs as (resA, resB, min_distance) tuples, classify them by residue chemistry, load into a pandas DataFrame, and export to CSV for the paper. For publication-quality buried surface area values use FreeSASA — the atom-count approximation is suitable only for comparative screening.

Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *