How to Calculate Protein Properties with BioPython: Molecular Weight, GRAVY and Secondary Structure

How to Calculate Protein Properties with BioPython: Molecular Weight, GRAVY and Secondary Structure

BioPython’s ProteinAnalysis class turns any protein sequence into a dashboard of physicochemical properties with a handful of method calls. This tutorial covers every property the class computes, what each value means biologically, and how to run batch analysis across a set of proteins.

Getting the sequence from a PDB structure

ProteinAnalysis works from a plain amino acid sequence string — one-letter codes, uppercase, no gaps or special characters. If you’re starting from a PDB file, BioPython’s PPBuilder extracts the sequence from the structure coordinates:

Python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder

# Load structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "protein.pdb")

# Extract sequence from 3D coordinates
ppb = PPBuilder()
sequence = ""
for pp in ppb.build_peptides(structure):
    sequence += str(pp.get_sequence())

print(sequence[:30])           # first 30 residues
print(f"Length: {len(sequence)} residues")

# Alternatively, use a raw sequence string directly
sequence = "MGSSHHHHHHSSGLVPRGSHMKKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"
PPBuilder vs direct sequence input
PPBuilder reconstructs the sequence from atom coordinates — it only includes residues that are actually present in the structure file. This means missing loops, disordered regions, or residues with no electron density won’t appear. If you need the complete canonical sequence (including disordered regions), fetch it from UniProt instead: from Bio import SeqIO; record = SeqIO.read("uniprot.fasta", "fasta").

The ProteinAnalysis class

All property calculations flow through a single class: ProteinAnalysis from Bio.SeqUtils.ProtParam. Create one instance per protein and call as many methods as you need from it — each method does a lightweight calculation from the sequence, so there’s no cost to calling multiple methods on the same object.

Python
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Create the analysis object — pass a plain uppercase sequence string
analysis = ProteinAnalysis(sequence)

# All properties are now available as method calls
print(analysis.molecular_weight())
print(analysis.isoelectric_point())
print(analysis.gravy())
Molecular weight
molecular_weight()
Mass of the protein in Daltons, accounting for amino acid composition and water loss at peptide bonds.
Typical range: 10,000–150,000 Da
Isoelectric point
isoelectric_point()
pH at which the protein carries no net charge. Key for gel electrophoresis, crystallization conditions, and solubility.
Most proteins: pH 4.5–8.5
GRAVY score
gravy()
Grand average of hydropathicity. Positive = hydrophobic (membrane proteins). Negative = hydrophilic (soluble proteins).
Soluble: −2 to 0 · Membrane: 0 to +2
Instability index
instability_index()
Predicts in vitro stability. Index below 40 = stable; above 40 = unstable. Based on dipeptide content.
Stable: < 40 · Unstable: ≥ 40
Amino acid composition
get_amino_acids_percent()
Fraction of each of the 20 standard amino acids as a dictionary. Values sum to 1.0.
Useful for comparing protein families
Secondary structure
secondary_structure_fraction()
Predicted fractions of helix, turn, and sheet based on the Chou-Fasman statistical model.
Returns (helix, turn, sheet) tuple

Molecular weight

Python
mw = analysis.molecular_weight()
print(f"Molecular weight: {mw:.1f} Da")
print(f"Molecular weight: {mw/1000:.1f} kDa")
# Molecular weight: 28642.3 Da
# Molecular weight: 28.6 kDa

BioPython calculates MW by summing the monoisotopic residue masses of each amino acid and subtracting water (18.02 Da) for each peptide bond formed. The result matches what you’d report as the “theoretical molecular weight” in a methods section — the same value that protein expression databases like UniProt report and that you’d compare against an SDS-PAGE band or mass spectrometry result.

Isoelectric point

Python
pi = analysis.isoelectric_point()
print(f"Isoelectric point: {pi:.2f}")
# Isoelectric point: 6.84

# Charge at a specific pH
charge_at_7 = analysis.charge_at_pH(7.0)
print(f"Charge at pH 7: {charge_at_7:.2f}")

# Practical use: will this protein bind to an anion exchange column at pH 8?
if pi < 8.0:
    print("Protein is negatively charged at pH 8 — will bind anion exchanger")
else:
    print("Protein is positively charged at pH 8 — use cation exchanger")

Amino acid composition

Python
import matplotlib.pyplot as plt

composition = analysis.get_amino_acids_percent()

# Print ranked by abundance
print("Amino acid composition (top 5):")
for aa, pct in sorted(composition.items(), key=lambda x: -x[1])[:5]:
    print(f"  {aa}: {pct*100:.1f}%")
# LEU: 9.8%
# GLY: 8.1%
# ALA: 7.3%

# Count specific residue types
total_charged = (composition.get("ASP", 0) + composition.get("GLU", 0) +
                 composition.get("LYS", 0) + composition.get("ARG", 0))
print(f"Charged residue fraction: {total_charged*100:.1f}%")

# Bar chart of full composition
aas = sorted(composition.keys())
fracs = [composition[aa] * 100 for aa in aas]
plt.figure(figsize=(12, 4))
plt.bar(aas, fracs, color="#4a7fc1")
plt.ylabel("Percentage (%)")
plt.title("Amino acid composition")
plt.tight_layout()
plt.savefig("aa_composition.png", dpi=150)
plt.close()

GRAVY score — hydrophobicity

The GRAVY (Grand Average of Hydropathicity) score is calculated by averaging the Kyte-Doolittle hydropathy values of every residue in the sequence. It tells you at a glance whether a protein is likely to be membrane-associated or soluble.

−2.0 (very hydrophilic)0.0+2.0 (very hydrophobic)
Negative GRAVY
Soluble, cytoplasmic proteins. E.g. GFP (−0.5), most kinases (−0.4 to −0.7).
Near zero
Peripheral membrane proteins, some secreted proteins. Borderline solubility.
Positive GRAVY
Integral membrane proteins. E.g. GPCRs (+0.3 to +0.7), ion channels (+0.2 to +0.8).
Python
gravy = analysis.gravy()
print(f"GRAVY score: {gravy:.3f}")
# GRAVY score: -0.412

# Interpret automatically
if gravy < -0.2:
    classification = "soluble / hydrophilic"
elif gravy > 0.2:
    classification = "membrane / hydrophobic"
else:
    classification = "borderline"
print(f"Classification: {classification}")

Instability index

Python
ii = analysis.instability_index()
print(f"Instability index: {ii:.1f}")
stable = "stable" if ii < 40 else "unstable"
print(f"Predicted stability: {stable}")
# Instability index: 34.2
# Predicted stability: stable
Instability index is a rough predictor
The instability index was calibrated on a set of proteins measured in specific in vitro conditions. It performs well for obvious cases but has limited accuracy for borderline values near the threshold of 40. For proteins where stability is a critical experimental decision, complement this calculation with expression testing and thermal shift assays rather than relying solely on the index.

Secondary structure fraction

secondary_structure_fraction() uses the Chou-Fasman statistical propensity values — each amino acid has a known tendency to adopt helix, sheet, or turn conformation — and returns the predicted fraction of residues in each class as a three-tuple: (helix, turn, sheet). These are sequence-based predictions, not structure-derived values.

Python
helix, turn, sheet = analysis.secondary_structure_fraction()
coil = 1.0 - helix - turn - sheet

print(f"Helix fraction: {helix*100:.1f}%")
print(f"Turn fraction:  {turn*100:.1f}%")
print(f"Sheet fraction: {sheet*100:.1f}%")
# Helix fraction: 38.4%
# Turn fraction:  26.1%
# Sheet fraction: 14.7%
Example output — alpha-helical protein
Helix 38.4%
Sheet 14.7%
Turn + coil 46.9%
Sequence-based vs structure-based fractions
secondary_structure_fraction() predicts secondary structure from sequence propensities — it’s useful as a quick sanity check but should not be reported as the actual secondary structure content of a solved structure. For structure-derived secondary structure fractions (from X-ray or cryo-EM data), use DSSP: from Bio.PDB.DSSP import DSSP, which assigns secondary structure from the actual 3D coordinates.

Batch analysis across multiple proteins

The most practical use of ProteinAnalysis is running the same set of calculations across every protein in a dataset and collecting the results into a pandas DataFrame for comparison or export:

Python
import pandas as pd
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
from Bio.SeqUtils.ProtParam import ProteinAnalysis

def get_sequence(pdb_path):
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure("p", pdb_path)
    seq = ""
    for pp in PPBuilder().build_peptides(structure):
        seq += str(pp.get_sequence())
    return seq

def analyze_protein(name, pdb_path):
    seq = get_sequence(pdb_path)
    if not seq:
        return None
    a = ProteinAnalysis(seq)
    helix, turn, sheet = a.secondary_structure_fraction()
    return {
        "name":      name,
        "length":    len(seq),
        "mw_kda":   round(a.molecular_weight() / 1000, 1),
        "pI":        round(a.isoelectric_point(), 2),
        "gravy":     round(a.gravy(), 3),
        "instability": round(a.instability_index(), 1),
        "helix_pct": round(helix * 100, 1),
        "sheet_pct": round(sheet * 100, 1),
    }

# Run across a set of structures
proteins = {
    "p53":  "p53.pdb",
    "EGFR": "egfr.pdb",
    "GFP":  "gfp.pdb",
}

results = [analyze_protein(name, path) for name, path in proteins.items()]
df = pd.DataFrame([r for r in results if r])
print(df.to_string(index=False))
df.to_csv("protein_properties.csv", index=False)
MethodReturnsUnits / notes
molecular_weight()floatDaltons (Da)
isoelectric_point()floatpH units, 0–14
charge_at_pH(ph)floatNet charge at given pH
gravy()floatKyte-Doolittle scale, typically −2 to +2
instability_index()floatDimensionless; <40 = stable
aromaticity()floatFraction of Phe, Trp, Tyr residues
get_amino_acids_percent()dictKeys = 3-letter codes; values = fractions summing to 1.0
secondary_structure_fraction()tuple(helix, turn, sheet) as fractions; sequence-based prediction
molar_extinction_coefficient()tuple(reduced, oxidized) ε in M⁻¹cm⁻¹ based on Trp, Tyr, Cys

Protein properties in one paragraph

Build a sequence string from your PDB file with PPBuilder, pass it to ProteinAnalysis(sequence), and call any property method you need — molecular_weight() for mass in Daltons, isoelectric_point() for the pH of zero net charge, gravy() for hydrophobicity (negative = soluble, positive = membrane), instability_index() for a predicted stability flag, and secondary_structure_fraction() for the predicted helix/turn/sheet breakdown from sequence propensities. For batch work, wrap the analysis in a function, iterate over a list of PDB files, and collect results into a pandas DataFrame. All these calculations take milliseconds per protein and produce the exact values you’d report in a methods section alongside experimental characterization.

Last updated on

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *