How to Calculate Protein Properties with BioPython: Molecular Weight, GRAVY and Secondary Structure
BioPython’s ProteinAnalysis class turns any protein sequence into a dashboard of physicochemical properties with a handful of method calls. This tutorial covers every property the class computes, what each value means biologically, and how to run batch analysis across a set of proteins.
Getting the sequence from a PDB structure
ProteinAnalysis works from a plain amino acid sequence string — one-letter codes, uppercase, no gaps or special characters. If you’re starting from a PDB file, BioPython’s PPBuilder extracts the sequence from the structure coordinates:
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
# Load structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "protein.pdb")
# Extract sequence from 3D coordinates
ppb = PPBuilder()
sequence = ""
for pp in ppb.build_peptides(structure):
sequence += str(pp.get_sequence())
print(sequence[:30]) # first 30 residues
print(f"Length: {len(sequence)} residues")
# Alternatively, use a raw sequence string directly
sequence = "MGSSHHHHHHSSGLVPRGSHMKKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"
PPBuilder reconstructs the sequence from atom coordinates — it only includes residues that are actually present in the structure file. This means missing loops, disordered regions, or residues with no electron density won’t appear. If you need the complete canonical sequence (including disordered regions), fetch it from UniProt instead: from Bio import SeqIO; record = SeqIO.read("uniprot.fasta", "fasta").
The ProteinAnalysis class
All property calculations flow through a single class: ProteinAnalysis from Bio.SeqUtils.ProtParam. Create one instance per protein and call as many methods as you need from it — each method does a lightweight calculation from the sequence, so there’s no cost to calling multiple methods on the same object.
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# Create the analysis object — pass a plain uppercase sequence string
analysis = ProteinAnalysis(sequence)
# All properties are now available as method calls
print(analysis.molecular_weight())
print(analysis.isoelectric_point())
print(analysis.gravy())
Molecular weight
mw = analysis.molecular_weight()
print(f"Molecular weight: {mw:.1f} Da")
print(f"Molecular weight: {mw/1000:.1f} kDa")
# Molecular weight: 28642.3 Da
# Molecular weight: 28.6 kDa
BioPython calculates MW by summing the monoisotopic residue masses of each amino acid and subtracting water (18.02 Da) for each peptide bond formed. The result matches what you’d report as the “theoretical molecular weight” in a methods section — the same value that protein expression databases like UniProt report and that you’d compare against an SDS-PAGE band or mass spectrometry result.
Isoelectric point
pi = analysis.isoelectric_point()
print(f"Isoelectric point: {pi:.2f}")
# Isoelectric point: 6.84
# Charge at a specific pH
charge_at_7 = analysis.charge_at_pH(7.0)
print(f"Charge at pH 7: {charge_at_7:.2f}")
# Practical use: will this protein bind to an anion exchange column at pH 8?
if pi < 8.0:
print("Protein is negatively charged at pH 8 — will bind anion exchanger")
else:
print("Protein is positively charged at pH 8 — use cation exchanger")
Amino acid composition
import matplotlib.pyplot as plt
composition = analysis.get_amino_acids_percent()
# Print ranked by abundance
print("Amino acid composition (top 5):")
for aa, pct in sorted(composition.items(), key=lambda x: -x[1])[:5]:
print(f" {aa}: {pct*100:.1f}%")
# LEU: 9.8%
# GLY: 8.1%
# ALA: 7.3%
# Count specific residue types
total_charged = (composition.get("ASP", 0) + composition.get("GLU", 0) +
composition.get("LYS", 0) + composition.get("ARG", 0))
print(f"Charged residue fraction: {total_charged*100:.1f}%")
# Bar chart of full composition
aas = sorted(composition.keys())
fracs = [composition[aa] * 100 for aa in aas]
plt.figure(figsize=(12, 4))
plt.bar(aas, fracs, color="#4a7fc1")
plt.ylabel("Percentage (%)")
plt.title("Amino acid composition")
plt.tight_layout()
plt.savefig("aa_composition.png", dpi=150)
plt.close()
GRAVY score — hydrophobicity
The GRAVY (Grand Average of Hydropathicity) score is calculated by averaging the Kyte-Doolittle hydropathy values of every residue in the sequence. It tells you at a glance whether a protein is likely to be membrane-associated or soluble.
gravy = analysis.gravy()
print(f"GRAVY score: {gravy:.3f}")
# GRAVY score: -0.412
# Interpret automatically
if gravy < -0.2:
classification = "soluble / hydrophilic"
elif gravy > 0.2:
classification = "membrane / hydrophobic"
else:
classification = "borderline"
print(f"Classification: {classification}")
Instability index
ii = analysis.instability_index()
print(f"Instability index: {ii:.1f}")
stable = "stable" if ii < 40 else "unstable"
print(f"Predicted stability: {stable}")
# Instability index: 34.2
# Predicted stability: stable
Secondary structure fraction
secondary_structure_fraction() uses the Chou-Fasman statistical propensity values — each amino acid has a known tendency to adopt helix, sheet, or turn conformation — and returns the predicted fraction of residues in each class as a three-tuple: (helix, turn, sheet). These are sequence-based predictions, not structure-derived values.
helix, turn, sheet = analysis.secondary_structure_fraction()
coil = 1.0 - helix - turn - sheet
print(f"Helix fraction: {helix*100:.1f}%")
print(f"Turn fraction: {turn*100:.1f}%")
print(f"Sheet fraction: {sheet*100:.1f}%")
# Helix fraction: 38.4%
# Turn fraction: 26.1%
# Sheet fraction: 14.7%
secondary_structure_fraction() predicts secondary structure from sequence propensities — it’s useful as a quick sanity check but should not be reported as the actual secondary structure content of a solved structure. For structure-derived secondary structure fractions (from X-ray or cryo-EM data), use DSSP: from Bio.PDB.DSSP import DSSP, which assigns secondary structure from the actual 3D coordinates.
Batch analysis across multiple proteins
The most practical use of ProteinAnalysis is running the same set of calculations across every protein in a dataset and collecting the results into a pandas DataFrame for comparison or export:
import pandas as pd
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
from Bio.SeqUtils.ProtParam import ProteinAnalysis
def get_sequence(pdb_path):
parser = PDBParser(QUIET=True)
structure = parser.get_structure("p", pdb_path)
seq = ""
for pp in PPBuilder().build_peptides(structure):
seq += str(pp.get_sequence())
return seq
def analyze_protein(name, pdb_path):
seq = get_sequence(pdb_path)
if not seq:
return None
a = ProteinAnalysis(seq)
helix, turn, sheet = a.secondary_structure_fraction()
return {
"name": name,
"length": len(seq),
"mw_kda": round(a.molecular_weight() / 1000, 1),
"pI": round(a.isoelectric_point(), 2),
"gravy": round(a.gravy(), 3),
"instability": round(a.instability_index(), 1),
"helix_pct": round(helix * 100, 1),
"sheet_pct": round(sheet * 100, 1),
}
# Run across a set of structures
proteins = {
"p53": "p53.pdb",
"EGFR": "egfr.pdb",
"GFP": "gfp.pdb",
}
results = [analyze_protein(name, path) for name, path in proteins.items()]
df = pd.DataFrame([r for r in results if r])
print(df.to_string(index=False))
df.to_csv("protein_properties.csv", index=False)
| Method | Returns | Units / notes |
|---|---|---|
| molecular_weight() | float | Daltons (Da) |
| isoelectric_point() | float | pH units, 0–14 |
| charge_at_pH(ph) | float | Net charge at given pH |
| gravy() | float | Kyte-Doolittle scale, typically −2 to +2 |
| instability_index() | float | Dimensionless; <40 = stable |
| aromaticity() | float | Fraction of Phe, Trp, Tyr residues |
| get_amino_acids_percent() | dict | Keys = 3-letter codes; values = fractions summing to 1.0 |
| secondary_structure_fraction() | tuple | (helix, turn, sheet) as fractions; sequence-based prediction |
| molar_extinction_coefficient() | tuple | (reduced, oxidized) ε in M⁻¹cm⁻¹ based on Trp, Tyr, Cys |
Protein properties in one paragraph
Build a sequence string from your PDB file with PPBuilder, pass it to ProteinAnalysis(sequence), and call any property method you need — molecular_weight() for mass in Daltons, isoelectric_point() for the pH of zero net charge, gravy() for hydrophobicity (negative = soluble, positive = membrane), instability_index() for a predicted stability flag, and secondary_structure_fraction() for the predicted helix/turn/sheet breakdown from sequence propensities. For batch work, wrap the analysis in a function, iterate over a list of PDB files, and collect results into a pandas DataFrame. All these calculations take milliseconds per protein and produce the exact values you’d report in a methods section alongside experimental characterization.