How to Calculate Protein Properties with BioPython: Molecular Weight, GRAVY and Secondary Structure

11 min read Beginner — basic BioPython assumed

BioPython’s ProteinAnalysis class turns any protein sequence into a dashboard of physicochemical properties with a handful of method calls. This tutorial covers every property the class computes, what each value means biologically, and how to run batch analysis across a set of proteins.

Getting the sequence from a PDB structure

ProteinAnalysis works from a plain amino acid sequence string — one-letter codes, uppercase, no gaps or special characters. If you’re starting from a PDB file, BioPython’s PPBuilder extracts the sequence from the structure coordinates:

Python

from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder

# Load structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "protein.pdb")

# Extract sequence from 3D coordinates
ppb = PPBuilder()
sequence = ""
for pp in ppb.build_peptides(structure):
    sequence += str(pp.get_sequence())

print(sequence[:30])           # first 30 residues
print(f"Length: {len(sequence)} residues")

# Alternatively, use a raw sequence string directly
sequence = "MGSSHHHHHHSSGLVPRGSHMKKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"

PPBuilder vs direct sequence input

PPBuilder reconstructs the sequence from atom coordinates — it only includes residues that are actually present in the structure file. This means missing loops, disordered regions, or residues with no electron density won’t appear. If you need the complete canonical sequence (including disordered regions), fetch it from UniProt instead: from Bio import SeqIO; record = SeqIO.read("uniprot.fasta", "fasta").

The ProteinAnalysis class

All property calculations flow through a single class: ProteinAnalysis from Bio.SeqUtils.ProtParam. Create one instance per protein and call as many methods as you need from it — each method does a lightweight calculation from the sequence, so there’s no cost to calling multiple methods on the same object.

Python

from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Create the analysis object — pass a plain uppercase sequence string
analysis = ProteinAnalysis(sequence)

# All properties are now available as method calls
print(analysis.molecular_weight())
print(analysis.isoelectric_point())
print(analysis.gravy())

Molecular weight

molecular_weight()

Mass of the protein in Daltons, accounting for amino acid composition and water loss at peptide bonds.

Typical range: 10,000–150,000 Da

Isoelectric point

isoelectric_point()

pH at which the protein carries no net charge. Key for gel electrophoresis, crystallization conditions, and solubility.

Most proteins: pH 4.5–8.5

GRAVY score

gravy()

Grand average of hydropathicity. Positive = hydrophobic (membrane proteins). Negative = hydrophilic (soluble proteins).

Soluble: −2 to 0 · Membrane: 0 to +2

Instability index

instability_index()

Predicts in vitro stability. Index below 40 = stable; above 40 = unstable. Based on dipeptide content.

Stable: < 40 · Unstable: ≥ 40

Amino acid composition

get_amino_acids_percent()

Fraction of each of the 20 standard amino acids as a dictionary. Values sum to 1.0.

Useful for comparing protein families

Secondary structure

secondary_structure_fraction()

Predicted fractions of helix, turn, and sheet based on the Chou-Fasman statistical model.

Returns (helix, turn, sheet) tuple

Molecular weight

Python

mw = analysis.molecular_weight()
print(f"Molecular weight: {mw:.1f} Da")
print(f"Molecular weight: {mw/1000:.1f} kDa")
# Molecular weight: 28642.3 Da
# Molecular weight: 28.6 kDa

BioPython calculates MW by summing the monoisotopic residue masses of each amino acid and subtracting water (18.02 Da) for each peptide bond formed. The result matches what you’d report as the “theoretical molecular weight” in a methods section — the same value that protein expression databases like UniProt report and that you’d compare against an SDS-PAGE band or mass spectrometry result.

Isoelectric point

Python

pi = analysis.isoelectric_point()
print(f"Isoelectric point: {pi:.2f}")
# Isoelectric point: 6.84

# Charge at a specific pH
charge_at_7 = analysis.charge_at_pH(7.0)
print(f"Charge at pH 7: {charge_at_7:.2f}")

# Practical use: will this protein bind to an anion exchange column at pH 8?
if pi < 8.0:
    print("Protein is negatively charged at pH 8 — will bind anion exchanger")
else:
    print("Protein is positively charged at pH 8 — use cation exchanger")

Amino acid composition

Python

import matplotlib.pyplot as plt

composition = analysis.get_amino_acids_percent()

# Print ranked by abundance
print("Amino acid composition (top 5):")
for aa, pct in sorted(composition.items(), key=lambda x: -x[1])[:5]:
    print(f"  {aa}: {pct*100:.1f}%")
# LEU: 9.8%
# GLY: 8.1%
# ALA: 7.3%

# Count specific residue types
total_charged = (composition.get("ASP", 0) + composition.get("GLU", 0) +
                 composition.get("LYS", 0) + composition.get("ARG", 0))
print(f"Charged residue fraction: {total_charged*100:.1f}%")

# Bar chart of full composition
aas = sorted(composition.keys())
fracs = [composition[aa] * 100 for aa in aas]
plt.figure(figsize=(12, 4))
plt.bar(aas, fracs, color="#4a7fc1")
plt.ylabel("Percentage (%)")
plt.title("Amino acid composition")
plt.tight_layout()
plt.savefig("aa_composition.png", dpi=150)
plt.close()

GRAVY score — hydrophobicity

The GRAVY (Grand Average of Hydropathicity) score is calculated by averaging the Kyte-Doolittle hydropathy values of every residue in the sequence. It tells you at a glance whether a protein is likely to be membrane-associated or soluble.

−2.0 (very hydrophilic)0.0+2.0 (very hydrophobic)

Negative GRAVY

Soluble, cytoplasmic proteins. E.g. GFP (−0.5), most kinases (−0.4 to −0.7).

Near zero

Peripheral membrane proteins, some secreted proteins. Borderline solubility.

Positive GRAVY

Integral membrane proteins. E.g. GPCRs (+0.3 to +0.7), ion channels (+0.2 to +0.8).

Python

gravy = analysis.gravy()
print(f"GRAVY score: {gravy:.3f}")
# GRAVY score: -0.412

# Interpret automatically
if gravy < -0.2:
    classification = "soluble / hydrophilic"
elif gravy > 0.2:
    classification = "membrane / hydrophobic"
else:
    classification = "borderline"
print(f"Classification: {classification}")

Instability index

Python

ii = analysis.instability_index()
print(f"Instability index: {ii:.1f}")
stable = "stable" if ii < 40 else "unstable"
print(f"Predicted stability: {stable}")
# Instability index: 34.2
# Predicted stability: stable

Instability index is a rough predictor

The instability index was calibrated on a set of proteins measured in specific in vitro conditions. It performs well for obvious cases but has limited accuracy for borderline values near the threshold of 40. For proteins where stability is a critical experimental decision, complement this calculation with expression testing and thermal shift assays rather than relying solely on the index.

Secondary structure fraction

secondary_structure_fraction() uses the Chou-Fasman statistical propensity values — each amino acid has a known tendency to adopt helix, sheet, or turn conformation — and returns the predicted fraction of residues in each class as a three-tuple: (helix, turn, sheet). These are sequence-based predictions, not structure-derived values.

Python

helix, turn, sheet = analysis.secondary_structure_fraction()
coil = 1.0 - helix - turn - sheet

print(f"Helix fraction: {helix*100:.1f}%")
print(f"Turn fraction:  {turn*100:.1f}%")
print(f"Sheet fraction: {sheet*100:.1f}%")
# Helix fraction: 38.4%
# Turn fraction:  26.1%
# Sheet fraction: 14.7%

Example output — alpha-helical protein

Helix 38.4%

Sheet 14.7%

Turn + coil 46.9%

Sequence-based vs structure-based fractions

secondary_structure_fraction() predicts secondary structure from sequence propensities — it’s useful as a quick sanity check but should not be reported as the actual secondary structure content of a solved structure. For structure-derived secondary structure fractions (from X-ray or cryo-EM data), use DSSP: from Bio.PDB.DSSP import DSSP, which assigns secondary structure from the actual 3D coordinates.

Batch analysis across multiple proteins

The most practical use of ProteinAnalysis is running the same set of calculations across every protein in a dataset and collecting the results into a pandas DataFrame for comparison or export:

Python

import pandas as pd
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
from Bio.SeqUtils.ProtParam import ProteinAnalysis

def get_sequence(pdb_path):
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure("p", pdb_path)
    seq = ""
    for pp in PPBuilder().build_peptides(structure):
        seq += str(pp.get_sequence())
    return seq

def analyze_protein(name, pdb_path):
    seq = get_sequence(pdb_path)
    if not seq:
        return None
    a = ProteinAnalysis(seq)
    helix, turn, sheet = a.secondary_structure_fraction()
    return {
        "name":      name,
        "length":    len(seq),
        "mw_kda":   round(a.molecular_weight() / 1000, 1),
        "pI":        round(a.isoelectric_point(), 2),
        "gravy":     round(a.gravy(), 3),
        "instability": round(a.instability_index(), 1),
        "helix_pct": round(helix * 100, 1),
        "sheet_pct": round(sheet * 100, 1),
    }

# Run across a set of structures
proteins = {
    "p53":  "p53.pdb",
    "EGFR": "egfr.pdb",
    "GFP":  "gfp.pdb",
}

results = [analyze_protein(name, path) for name, path in proteins.items()]
df = pd.DataFrame([r for r in results if r])
print(df.to_string(index=False))
df.to_csv("protein_properties.csv", index=False)

Method	Returns	Units / notes
molecular_weight()	float	Daltons (Da)
isoelectric_point()	float	pH units, 0–14
charge_at_pH(ph)	float	Net charge at given pH
gravy()	float	Kyte-Doolittle scale, typically −2 to +2
instability_index()	float	Dimensionless; <40 = stable
aromaticity()	float	Fraction of Phe, Trp, Tyr residues
get_amino_acids_percent()	dict	Keys = 3-letter codes; values = fractions summing to 1.0
secondary_structure_fraction()	tuple	(helix, turn, sheet) as fractions; sequence-based prediction
molar_extinction_coefficient()	tuple	(reduced, oxidized) ε in M⁻¹cm⁻¹ based on Trp, Tyr, Cys

Protein properties in one paragraph

Build a sequence string from your PDB file with PPBuilder, pass it to ProteinAnalysis(sequence), and call any property method you need — molecular_weight() for mass in Daltons, isoelectric_point() for the pH of zero net charge, gravy() for hydrophobicity (negative = soluble, positive = membrane), instability_index() for a predicted stability flag, and secondary_structure_fraction() for the predicted helix/turn/sheet breakdown from sequence propensities. For batch work, wrap the analysis in a function, iterate over a list of PDB files, and collect results into a pandas DataFrame. All these calculations take milliseconds per protein and produce the exact values you’d report in a methods section alongside experimental characterization.

How to Calculate Protein Properties with BioPython: Molecular Weight, GRAVY and Secondary Structure

Getting the sequence from a PDB structure

The ProteinAnalysis class

Molecular weight

Isoelectric point

Amino acid composition

GRAVY score — hydrophobicity

Instability index

Secondary structure fraction

Batch analysis across multiple proteins

Protein properties in one paragraph

Common BioPython and MDAnalysis Mistakes and How to Fix Them

How to Parse AutoDock Vina Docking Results with Python: From PDBQT to Ranked Hit List

How to Automate GROMACS Analysis with Python: RMSD, RMSF and Energy Plots

How to Analyze Protein-Ligand Interactions over an MD Trajectory with MDAnalysis

How to Set Up a Python Environment for Structural Biology (2026)

MDAnalysis Tutorial for Beginners: Loading and Working with MD Trajectories in Python

Leave a Reply Cancel reply

Getting the sequence from a PDB structure

The ProteinAnalysis class

Molecular weight

Isoelectric point

Amino acid composition

GRAVY score — hydrophobicity

Instability index

Secondary structure fraction

Batch analysis across multiple proteins

Protein properties in one paragraph

Similar Posts

Leave a Reply Cancel reply