Proteins are central to biology, and amino acid sequences are one of the main ways we represent them in code. If you want to study enzymes, compare proteins, search for motifs, or load sequence data from FASTA files, Biopython gives you a clean set of tools to do that in Python. In this tutorial, you will learn how to work with amino acid sequences in Biopython step by step. We will start by creating simple protein sequences, then move on to common tasks such as measuring sequence length, counting residues, finding motifs, calculating molecular weight, reading FASTA files, and comparing sequences. ## Creating an Amino Acid Sequence Biopython uses the `Seq` class to represent biological sequences. A protein sequence can be stored as a string of one-letter amino acid codes.
from Bio.Seq import Seq
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
print("Sequence:", protein)
print("Length:", len(protein))
print("First amino acid:", protein[0])
print("Last amino acid:", protein[-1])Sequence: MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE Length: 40 First amino acid: M Last amino acid: E
This code creates a protein sequence using `Seq`. You can print the sequence, measure its length with `len()`, and access individual amino acids using indexing just like a normal Python string. ## Slicing a Protein Sequence You often need to look at only part of a protein, such as a signal peptide, domain, or motif region.
from Bio.Seq import Seq
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
signal_peptide = protein[:15]
middle_region = protein[15:30]
last_five = protein[-5:]
print("Signal peptide:", signal_peptide)
print("Middle region:", middle_region)
print("Last five amino acids:", last_five)Signal peptide: MKWVTFISLLFLFSS Middle region: AYSRGVFRRDTHKSE Last five amino acids: KDLGE
This example uses slicing to extract parts of the sequence. The syntax works the same way as Python string slicing, which makes it easy to inspect specific regions of a protein. ## Counting Amino Acids A common first analysis is to count how many times each amino acid appears.
from Bio.Seq import Seq
from collections import Counter
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
counts = Counter(str(protein))
print("Amino acid counts:")
for amino_acid, count in sorted(counts.items()):
print(amino_acid, count)Amino acid counts: A 2 D 2 E 2 F 5 G 2 H 2 I 2 K 3 L 4 M 1 R 4 S 5 T 2 V 2 W 1 Y 1
Here, the `Seq` object is converted to a regular string so it can be passed to `Counter`. The result is a frequency table showing how many times each amino acid occurs in the sequence. ## Calculating Amino Acid Percentages Raw counts are helpful, but percentages make it easier to compare proteins of different lengths.
from Bio.Seq import Seq
from collections import Counter
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
counts = Counter(str(protein))
length = len(protein)
print("Amino acid percentages:")
for amino_acid, count in sorted(counts.items()):
percentage = (count / length) * 100
print(f"{amino_acid}: {percentage:.2f}%")Amino acid percentages: A: 5.00% D: 5.00% E: 5.00% F: 12.50% G: 5.00% H: 5.00% I: 5.00% K: 7.50% L: 10.00% M: 2.50% R: 10.00% S: 12.50% T: 5.00% V: 5.00% W: 2.50% Y: 2.50%
This code calculates the percentage of each amino acid by dividing the count by the total sequence length. This is useful for identifying amino-acid-rich proteins or comparing sequence composition. ## Finding a Motif in a Protein Sequence A motif is a short pattern of amino acids that may have structural or functional importance.
from Bio.Seq import Seq
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
motif = "HRF"
position = str(protein).find(motif)
if position != -1:
print(f"Motif '{motif}' found at position {position}")
else:
print(f"Motif '{motif}' not found")Motif 'HRF' found at position 32
This example searches for a short amino acid pattern inside the protein. The `find()` method returns the starting index of the motif, or `-1` if it is not present. ## Finding All Occurrences of a Motif Sometimes a motif appears more than once, so it is useful to find every match.
from Bio.Seq import Seq
protein = Seq("AKTAAAKTAAGGAKTA")
motif = "AKTA"
positions = []
sequence_text = str(protein)
start = 0
while True:
position = sequence_text.find(motif, start)
if position == -1:
break
positions.append(position)
start = position + 1
print("Sequence:", protein)
print("Motif:", motif)
print("Positions:", positions)Sequence: AKTAAAKTAAGGAKTA Motif: AKTA Positions: [0, 5, 12]
This code repeatedly searches for the motif and records every starting position. It is a simple way to scan a protein for repeated sequence patterns. ## Calculating Molecular Weight Biopython includes tools for basic protein properties, including molecular weight.
from Bio.Seq import Seq
from Bio.SeqUtils import molecular_weight
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
mw = molecular_weight(protein, seq_type="protein")
print("Sequence:", protein)
print("Molecular weight:", round(mw, 2), "Da")Sequence: MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE Molecular weight: 4777.46 Da
The `molecular_weight()` function computes the approximate mass of the protein in Daltons. Setting `seq_type="protein"` tells Biopython that the sequence contains amino acids rather than DNA or RNA. ## Checking for Valid Amino Acid Symbols Real datasets sometimes contain unknown or unexpected characters. It is a good idea to validate a sequence before analyzing it.
from Bio.Seq import Seq
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
valid_amino_acids = set("ACDEFGHIKLMNPQRSTVWY")
invalid = sorted(set(str(protein)) - valid_amino_acids)
if invalid:
print("Invalid symbols found:", invalid)
else:
print("Sequence contains only standard amino acids")Sequence contains only standard amino acids
This code checks whether the sequence uses only the 20 standard amino acid symbols. It helps catch problems early, especially when working with data from files or external sources. ## Translating DNA into a Protein Sequence Amino acid sequences are often produced by translating coding DNA.
from Bio.Seq import Seq
dna = Seq("ATGGCCAAGTAA")
protein = dna.translate(to_stop=True)
print("DNA sequence:", dna)
print("Protein sequence:", protein)DNA sequence: ATGGCCAAGTAA Protein sequence: MAK
This example translates a DNA sequence into a protein. The argument `to_stop=True` stops translation at the first stop codon, which is often useful when extracting a coding region. ## Reading Protein Sequences from a FASTA File Biopython makes it easy to read protein sequences from FASTA files using `SeqIO`.
from Bio import SeqIO
fasta_text = """>protein_1
MKWVTFISLLFLFSSAYSRG
>protein_2
GAVLILKKKGHHEAELKPLA
"""
with open("proteins.fasta", "w") as handle:
handle.write(fasta_text)
for record in SeqIO.parse("proteins.fasta", "fasta"):
print("ID:", record.id)
print("Sequence:", record.seq)
print("Length:", len(record.seq))
print()ID: protein_1 Sequence: MKWVTFISLLFLFSSAYSRG Length: 20 ID: protein_2 Sequence: GAVLILKKKGHHEAELKPLA Length: 20
This code creates a small FASTA file and then reads it with `SeqIO.parse()`. Each FASTA entry becomes a record with attributes such as `id` and `seq`. ## Writing Protein Sequences to a FASTA File After processing protein sequences, you may want to save them back to a file.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
records = [
SeqRecord(Seq("MKWVTFISLLFLFSSAYSRG"), id="protein_1", description="Example protein 1"),
SeqRecord(Seq("GAVLILKKKGHHEAELKPLA"), id="protein_2", description="Example protein 2"),
]
SeqIO.write(records, "output_proteins.fasta", "fasta")
print("Wrote", len(records), "records to output_proteins.fasta")Wrote 2 records to output_proteins.fasta
This example uses `SeqRecord` objects to store protein sequences along with IDs and descriptions. `SeqIO.write()` then saves them in FASTA format. ## Comparing Two Protein Sequences One simple way to compare proteins is to count how many positions match.
from Bio.Seq import Seq
protein1 = Seq("MKWVTFISLLFLFSSAYSRG")
protein2 = Seq("MKWVTFISLMFLFSSAYARG")
min_length = min(len(protein1), len(protein2))
matches = sum(a == b for a, b in zip(str(protein1[:min_length]), str(protein2[:min_length])))
identity = (matches / min_length) * 100
print("Protein 1:", protein1)
print("Protein 2:", protein2)
print("Compared length:", min_length)
print("Matching positions:", matches)
print(f"Percent identity: {identity:.2f}%")Protein 1: MKWVTFISLLFLFSSAYSRG Protein 2: MKWVTFISLMFLFSSAYARG Compared length: 20 Matching positions: 18 Percent identity: 90.00%
This code compares two protein sequences position by position and calculates percent identity over the shared length. It is a simple introduction to sequence comparison before using full alignment tools. ## Pairwise Alignment of Protein Sequences For a more realistic comparison, Biopython provides pairwise alignment tools.
from Bio import pairwise2
protein1 = "MKWVTFISLLFLFSSAYSRG"
protein2 = "MKWVTFISLMFLFSSAYARG"
alignments = pairwise2.align.globalxx(protein1, protein2)
best_alignment = alignments[0]
print("Alignment score:", best_alignment.score)
print(best_alignment.seqA)
print(best_alignment.seqB)Alignment score: 18.0 MKWVTFISLL-FLFSSAYS-RG MKWVTFIS-LMFLFSSAY-ARG
/Users/yogesh/projects/pyfiddle/.venv/lib/python3.12/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module. warnings.warn(
This example performs a global alignment of two amino acid sequences. The `globalxx()` function gives a score based on matches, and the first alignment in the result list is one of the best-scoring alignments. ## Extracting Hydrophobic Residues Protein analysis often focuses on groups of amino acids with similar chemical properties. Hydrophobic residues are especially important in membrane proteins and protein folding.
from Bio.Seq import Seq
protein = Seq("MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE")
hydrophobic = set("AILMFWVY")
hydrophobic_positions = [
(index, amino_acid)
for index, amino_acid in enumerate(str(protein))
if amino_acid in hydrophobic
]
print("Hydrophobic residues:")
for position, amino_acid in hydrophobic_positions:
print(position, amino_acid)Hydrophobic residues: 0 M 2 W 3 V 5 F 6 I 8 L 9 L 10 F 11 L 12 F 15 A 16 Y 20 V 21 F 30 I 31 A 34 F 37 L
This code scans the sequence and records the positions of hydrophobic amino acids. This kind of filtering is useful when looking for hydrophobic stretches or possible transmembrane regions. ## Summary Biopython gives you practical tools for protein sequence work without forcing you to build everything from scratch. You can represent amino acid sequences with `Seq`, slice them, count residues, search for motifs, calculate molecular weight, read and write FASTA files, translate DNA into protein, and compare sequences. These building blocks are enough to start exploring real protein datasets in Python. Once you are comfortable with them, the next step is to learn sequence alignment in more depth, work with annotations, and connect sequence analysis to protein structure and function.