Protein structures contain spatial information that sequence-only analysis cannot capture. If you need to study active sites, residue contacts, or structural differences, PDB parsing is a critical step. In this tutorial, you will use `Bio.PDB` to load structures, extract key components, and compute atomic distances. ## Extracting Chains and Residues from Structures
import requests
from Bio.PDB import PDBParser
# Download one example PDB file for all sections below
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/PDB/1A8O.pdb"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("1A8O.pdb", "w", encoding="utf-8") as f:
f.write(response.text)
# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
# Inspect chains and a few residues from each chain
for chain in model:
residues = [res for res in chain.get_residues() if res.id[0] == " "]
print(f"Chain {chain.id}: {len(residues)} standard residues")
print("First 5 residue names:", [res.resname for res in residues[:5]])Chain A: 66 standard residues First 5 residue names: ['ASP', 'ILE', 'ARG', 'GLN', 'GLY']
This block downloads and parses a real PDB file, then iterates through chains and standard residues. Chain/residue extraction is the typical first step before more focused structural measurements. ## Saving a Single Chain to a New PDB File
from Bio.PDB import PDBParser, PDBIO, Select
class ChainSelect(Select):
def __init__(self, chain_id):
self.chain_id = chain_id
def accept_chain(self, chain):
return chain.id == self.chain_id
# Parse structure and pick first chain
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
first_chain_id = next(iter(model.child_dict.keys()))
# Write only one chain to a new PDB file
io = PDBIO()
io.set_structure(structure)
io.save("1A8O_chain_only.pdb", ChainSelect(first_chain_id))
print("Saved chain", first_chain_id, "to 1A8O_chain_only.pdb")Saved chain A to 1A8O_chain_only.pdb
Extracting a single chain is common when running docking, residue scanning, or visualization pipelines that should ignore other chains. ## Calculating Distances Between Atoms
from Bio.PDB import PDBParser
# Parse the same local PDB file
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
# Use the first chain that has at least two residues with CA atoms
target_chain = None
for chain in model:
residues = [r for r in chain.get_residues() if r.id[0] == " " and "CA" in r]
if len(residues) >= 2:
target_chain = chain
break
if target_chain is None:
raise ValueError("No chain with at least two CA atoms found.")
residues = [r for r in target_chain.get_residues() if r.id[0] == " " and "CA" in r]
res1, res2 = residues[0], residues[1]
# Biopython atom objects support distance with subtraction
distance = res1["CA"] - res2["CA"]
print("Chain:", target_chain.id)
print("Residue 1:", res1.resname, res1.id)
print("Residue 2:", res2.resname, res2.id)
print("CA-CA distance (Angstrom):", round(distance, 3))Chain: A
Residue 1: ASP (' ', 152, ' ')
Residue 2: ILE (' ', 153, ' ')
CA-CA distance (Angstrom): 3.829
You locate two alpha-carbon atoms and compute their Euclidean distance directly from atom objects. Distance calculations like this are the basis for contact maps, structural quality checks, and geometric feature engineering. ## Finding Nearby Atoms Around a Residue
from Bio.PDB import PDBParser, NeighborSearch
# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
# Build atom list and neighbor search index
atoms = list(structure.get_atoms())
search = NeighborSearch(atoms)
# Pick the first CA atom as the center
center_atom = next(a for a in atoms if a.id == "CA")
nearby_atoms = search.search(center_atom.coord, 5.0) # 5 Angstrom sphere
print("Center atom:", center_atom.get_full_id())
print("Number of nearby atoms within 5A:", len(nearby_atoms))Center atom: ('1A8O', 0, 'A', ('H_MSE', 151, ' '), ('CA', ' '))
Number of nearby atoms within 5A: 17
Neighborhood queries are practical for active-site analysis, contact checks, and identifying residues likely involved in binding. ## Building a Simple CA Distance Matrix
import matplotlib.pyplot as plt
from Bio.PDB import PDBParser
# Parse structure and collect first 12 CA atoms from one chain
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
chain = next(iter(model))
ca_atoms = [res["CA"] for res in chain.get_residues() if res.id[0] == " " and "CA" in res][:12]
if len(ca_atoms) < 2:
raise ValueError("Not enough CA atoms to build a distance matrix.")
# Compute pairwise CA distances
matrix = []
for atom_i in ca_atoms:
row = []
for atom_j in ca_atoms:
row.append(atom_i - atom_j)
matrix.append(row)
# Visualize matrix as a heatmap
plt.figure(figsize=(6, 5))
plt.imshow(matrix, cmap="viridis")
plt.colorbar(label="Distance (Angstrom)")
plt.title("CA Distance Matrix (first 12 residues)")
plt.xlabel("Residue index")
plt.ylabel("Residue index")
plt.tight_layout()
plt.show()A distance matrix gives a compact structural view and is useful for identifying close residue clusters and generating contact-style features. ## Summarizing B-factors by Chain
from Bio.PDB import PDBParser
# Parse structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("1A8O", "1A8O.pdb")
model = structure[0]
for chain in model:
b_factors = [atom.get_bfactor() for atom in chain.get_atoms()]
if not b_factors:
continue
mean_b = sum(b_factors) / len(b_factors)
print(f"Chain {chain.id}: atoms={len(b_factors)}, mean B-factor={mean_b:.2f}")Chain A: atoms=644, mean B-factor=22.58
B-factor summaries provide a quick estimate of local flexibility/disorder, which is useful when prioritizing stable regions for downstream modeling.