Multiple sequence alignments (MSAs) are a backbone of evolutionary and comparative analysis. They let you compare homologous positions across sequences, which is essential for tree building, motif discovery, and conservation analysis. In this tutorial, you will focus on reading and writing alignment files with Biopython. ## Reading and Writing Alignment Files
import requests
from Bio import AlignIO
# Download one example alignment file to reuse
url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/opuntia.aln"
response = requests.get(url, timeout=30)
response.raise_for_status()
with open("opuntia.aln", "w", encoding="utf-8") as f:
f.write(response.text)
# Read alignment in Clustal format
alignment = AlignIO.read("opuntia.aln", "clustal")
print("Number of sequences:", len(alignment))
print("Alignment length:", alignment.get_alignment_length())
print("First three sequence IDs:", [record.id for record in alignment[:3]])
# Write the same alignment to FASTA and PHYLIP formats
AlignIO.write(alignment, "opuntia.fasta", "fasta")
AlignIO.write(alignment, "opuntia.phy", "phylip")
print("Wrote opuntia.fasta and opuntia.phy")Number of sequences: 7 Alignment length: 906 First three sequence IDs: ['gi|6273285|gb|AF191659.1|AF191', 'gi|6273284|gb|AF191658.1|AF191', 'gi|6273287|gb|AF191661.1|AF191'] Wrote opuntia.fasta and opuntia.phy
This block shows the full I/O cycle for alignments: read once, inspect core metadata, and export to additional formats. Format conversion is common when chaining tools that each expect a different alignment format. ## Reading Back Converted Alignments
from Bio import AlignIO
# Read converted files back to verify they are usable
fasta_alignment = AlignIO.read("opuntia.fasta", "fasta")
phylip_alignment = AlignIO.read("opuntia.phy", "phylip")
print("FASTA alignment length:", fasta_alignment.get_alignment_length())
print("PHYLIP alignment length:", phylip_alignment.get_alignment_length())
print("Same number of sequences:", len(fasta_alignment) == len(phylip_alignment))FASTA alignment length: 906 PHYLIP alignment length: 906 Same number of sequences: True
Reading exported files back is a quick integrity check that your conversion worked as expected. This prevents subtle downstream errors when a tool fails due to malformed or incompatible alignment output. ## Computing a Consensus Sequence
from Bio import AlignIO
from Bio import motifs
# Read alignment from local file
alignment = AlignIO.read("opuntia.aln", "clustal")
# Build a motif from aligned sequences and use built-in consensus methods
motif = motifs.create([record.seq for record in alignment])
consensus = motif.consensus
degenerate = motif.degenerate_consensus
thresholded = motif.counts.calculate_consensus(identity=0.7)
print("Consensus sequence:")
print(consensus)
print("Degenerate consensus:")
print(degenerate)
print("Thresholded consensus (identity=0.7):")
print(thresholded)Consensus sequence: TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATACGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTAATTGATCATTGGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTTAAATTGCTCATATTTTATTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAATATAGACAAACTATATATATATATATATATAATATATTTCAAATTTCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATCTTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTATAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAGAGAACCAGA Degenerate consensus: TATACATTAAAGRAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATASGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTAATKRWTCATTGGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTTAAATTGCTCATATTTTMTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAAYATAGACAAACTATATATATATATATATATAATATATTTCAAATTYCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATMTTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAKAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAARAGAACCAGA Thresholded consensus (identity=0.7): TATACATTAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATANGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTAATNNNTCATTGGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTTAAATTGCTCATATTTTNTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAANATAGACAAACTATATATATATATATATATAATATATTTCAAATTNCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATNTTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTANAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAANAGAACCAGA
Consensus extraction is useful when you need one representative sequence for primer design, motif checks, or high-level reporting. Using `Bio.motifs` also keeps the tutorial aligned with current Biopython APIs instead of deprecated `SummaryInfo` helpers. ## Measuring Per-Column Conservation
from Bio import AlignIO
# Read alignment
alignment = AlignIO.read("opuntia.aln", "clustal")
length = alignment.get_alignment_length()
conservation_scores = []
for i in range(length):
column = alignment[:, i]
non_gap = [c for c in column if c != "-"]
if not non_gap:
conservation_scores.append(0.0)
continue
most_common = max(set(non_gap), key=non_gap.count)
score = non_gap.count(most_common) / len(non_gap)
conservation_scores.append(score)
print("First 20 conservation scores:")
print([round(s, 3) for s in conservation_scores[:20]])First 20 conservation scores: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.857, 1.0, 1.0, 1.0, 1.0, 0.571, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Per-column conservation helps you find stable versus variable regions, which is practical for marker selection and downstream phylogenetic feature design. ## Filtering High-Gap Columns
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
# Read alignment
alignment = AlignIO.read("opuntia.aln", "clustal")
length = alignment.get_alignment_length()
# Keep columns where gap fraction is <= 0.3
keep_positions = []
for i in range(length):
column = alignment[:, i]
gap_fraction = column.count("-") / len(column)
if gap_fraction <= 0.3:
keep_positions.append(i)
# Rebuild filtered alignment from kept columns
filtered_records = []
for record in alignment:
filtered_seq = "".join(record.seq[i] for i in keep_positions)
filtered_records.append(SeqRecord(Seq(filtered_seq), id=record.id, description=""))
filtered_alignment = MultipleSeqAlignment(filtered_records)
AlignIO.write(filtered_alignment, "opuntia_filtered.aln", "clustal")
print("Original length:", length)
print("Filtered length:", filtered_alignment.get_alignment_length())
print("Wrote opuntia_filtered.aln")Original length: 906 Filtered length: 894 Wrote opuntia_filtered.aln
Gap filtering is a common preprocessing step before tree construction to reduce noise from poorly aligned positions. ## Visualizing Alignment Patterns
import matplotlib.pyplot as plt
from Bio import AlignIO
# Read alignment and compare each residue against first sequence as reference
alignment = AlignIO.read("opuntia.aln", "clustal")
reference = str(alignment[0].seq)
# Encode alignment: 0 = match to reference, 1 = mismatch, 2 = gap
matrix = []
for record in alignment:
row = []
seq = str(record.seq)
for ref_char, char in zip(reference, seq):
if char == "-":
row.append(2)
elif char == ref_char:
row.append(0)
else:
row.append(1)
matrix.append(row)
plt.figure(figsize=(12, 5))
plt.imshow(matrix, aspect="auto", interpolation="nearest", cmap="viridis")
plt.colorbar(label="0=match, 1=mismatch, 2=gap")
plt.title("Alignment Pattern Heatmap (relative to first sequence)")
plt.xlabel("Alignment position")
plt.ylabel("Sequence index")
plt.tight_layout()
plt.show()This heatmap gives a quick visual summary of conserved blocks, mismatch-rich regions, and gap-heavy segments. It is a practical diagnostic before selecting regions for phylogenetic or marker analyses. ## Exporting a Subset of Sequences
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
# Read alignment and keep only first five sequences as an example subset
alignment = AlignIO.read("opuntia.aln", "clustal")
subset = MultipleSeqAlignment(alignment[:5])
AlignIO.write(subset, "opuntia_subset.fasta", "fasta")
print("Subset size:", len(subset))
print("Wrote opuntia_subset.fasta")Subset size: 5 Wrote opuntia_subset.fasta
Subsetting is useful when preparing focused comparisons for one clade, one sample group, or a small demo dataset for collaborators.