Working with Protein Structures (PDB Files) in Biopython

Protein structures contain spatial information that sequence-only analysis cannot capture. If you need to study active sites, residue contacts, or structural differences, PDB parsing is a critical step.

In this tutorial, you will use `Bio.PDB` to load structures, extract key components, and compute atomic distances.

## Extracting Chains and Residues from Structures

import requests
from Bio.PDB import PDBParser

# Download one example PDB file for all sections below
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/PDB/1A8O.pdb"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("1A8O.pdb", "w", encoding="utf-8") as f:
    f.write(response.text)

# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]

# Inspect chains and a few residues from each chain
for chain in model:
    residues = [res for res in chain.get_residues() if res.id[0] == " "]
    print(f"Chain {chain.id}: {len(residues)} standard residues")
    print("First 5 residue names:", [res.resname for res in residues[:5]])
Chain A: 66 standard residues
First 5 residue names: ['ASP', 'ILE', 'ARG', 'GLN', 'GLY']
This block downloads and parses a real PDB file, then iterates through chains and standard residues. Chain/residue extraction is the typical first step before more focused structural measurements.

## Saving a Single Chain to a New PDB File

from Bio.PDB import PDBParser, PDBIO, Select

class ChainSelect(Select):
    def __init__(self, chain_id):
        self.chain_id = chain_id

    def accept_chain(self, chain):
        return chain.id == self.chain_id

# Parse structure and pick first chain
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
first_chain_id = next(iter(model.child_dict.keys()))

# Write only one chain to a new PDB file
io = PDBIO()
io.set_structure(structure)
io.save("1A8O_chain_only.pdb", ChainSelect(first_chain_id))

print("Saved chain", first_chain_id, "to 1A8O_chain_only.pdb")
Saved chain A to 1A8O_chain_only.pdb
Extracting a single chain is common when running docking, residue scanning, or visualization pipelines that should ignore other chains.

## Calculating Distances Between Atoms

from Bio.PDB import PDBParser

# Parse the same local PDB file
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]

# Use the first chain that has at least two residues with CA atoms
target_chain = None
for chain in model:
    residues = [r for r in chain.get_residues() if r.id[0] == " " and "CA" in r]
    if len(residues) >= 2:
        target_chain = chain
        break

if target_chain is None:
    raise ValueError("No chain with at least two CA atoms found.")

residues = [r for r in target_chain.get_residues() if r.id[0] == " " and "CA" in r]
res1, res2 = residues[0], residues[1]

# Biopython atom objects support distance with subtraction
distance = res1["CA"] - res2["CA"]

print("Chain:", target_chain.id)
print("Residue 1:", res1.resname, res1.id)
print("Residue 2:", res2.resname, res2.id)
print("CA-CA distance (Angstrom):", round(distance, 3))
Chain: A
Residue 1: ASP (' ', 152, ' ')
Residue 2: ILE (' ', 153, ' ')
CA-CA distance (Angstrom): 3.829
You locate two alpha-carbon atoms and compute their Euclidean distance directly from atom objects. Distance calculations like this are the basis for contact maps, structural quality checks, and geometric feature engineering.

## Finding Nearby Atoms Around a Residue

from Bio.PDB import PDBParser, NeighborSearch

# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]

# Build atom list and neighbor search index
atoms = list(structure.get_atoms())
search = NeighborSearch(atoms)

# Pick the first CA atom as the center
center_atom = next(a for a in atoms if a.id == "CA")
nearby_atoms = search.search(center_atom.coord, 5.0)  # 5 Angstrom sphere

print("Center atom:", center_atom.get_full_id())
print("Number of nearby atoms within 5A:", len(nearby_atoms))
Center atom: ('1A8O', 0, 'A', ('H_MSE', 151, ' '), ('CA', ' '))
Number of nearby atoms within 5A: 17
Neighborhood queries are practical for active-site analysis, contact checks, and identifying residues likely involved in binding.

## Building a Simple CA Distance Matrix

import matplotlib.pyplot as plt
from Bio.PDB import PDBParser

# Parse structure and collect first 12 CA atoms from one chain
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
chain = next(iter(model))
ca_atoms = [res["CA"] for res in chain.get_residues() if res.id[0] == " " and "CA" in res][:12]

if len(ca_atoms) < 2:
    raise ValueError("Not enough CA atoms to build a distance matrix.")

# Compute pairwise CA distances
matrix = []
for atom_i in ca_atoms:
    row = []
    for atom_j in ca_atoms:
        row.append(atom_i - atom_j)
    matrix.append(row)

# Visualize matrix as a heatmap
plt.figure(figsize=(6, 5))
plt.imshow(matrix, cmap="viridis")
plt.colorbar(label="Distance (Angstrom)")
plt.title("CA Distance Matrix (first 12 residues)")
plt.xlabel("Residue index")
plt.ylabel("Residue index")
plt.tight_layout()
plt.show()
A distance matrix gives a compact structural view and is useful for identifying close residue clusters and generating contact-style features.

## Summarizing B-factors by Chain

from Bio.PDB import PDBParser

# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]

for chain in model:
    b_factors = [atom.get_bfactor() for atom in chain.get_atoms()]
    if not b_factors:
        continue
    mean_b = sum(b_factors) / len(b_factors)
    print(f"Chain {chain.id}: atoms={len(b_factors)}, mean B-factor={mean_b:.2f}")
Chain A: atoms=644, mean B-factor=22.58
B-factor summaries provide a quick estimate of local flexibility/disorder, which is useful when prioritizing stable regions for downstream modeling.