Working with Multiple Sequence Alignments in Biopython

Multiple sequence alignments (MSAs) are a backbone of evolutionary and comparative analysis. They let you compare homologous positions across sequences, which is essential for [tree building](/tutorials/biopython-phylogenetic-trees), motif discovery, and conservation analysis.

In this tutorial, you will focus on reading and writing alignment files with Biopython.

## Reading and Writing Alignment Files

import requests
from Bio import AlignIO

# Download one example alignment file to reuse
url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/opuntia.aln"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("opuntia.aln", "w", encoding="utf-8") as f:
    f.write(response.text)

# Read alignment in Clustal format
alignment = AlignIO.read("opuntia.aln", "clustal")
print("Number of sequences:", len(alignment))
print("Alignment length:", alignment.get_alignment_length())
print("First three sequence IDs:", [record.id for record in alignment[:3]])

# Write the same alignment to FASTA and PHYLIP formats
AlignIO.write(alignment, "opuntia.fasta", "fasta")
AlignIO.write(alignment, "opuntia.phy", "phylip")
print("Wrote opuntia.fasta and opuntia.phy")

Number of sequences: 7
Alignment length: 906
First three sequence IDs: ['gi|6273285|gb|AF191659.1|AF191', 'gi|6273284|gb|AF191658.1|AF191', 'gi|6273287|gb|AF191661.1|AF191']
Wrote opuntia.fasta and opuntia.phy

This block shows the full I/O cycle for alignments: read once, inspect core metadata, and export to additional formats. Format conversion is common when chaining tools that each expect a different alignment format.

## Reading Back Converted Alignments

from Bio import AlignIO

# Read converted files back to verify they are usable
fasta_alignment = AlignIO.read("opuntia.fasta", "fasta")
phylip_alignment = AlignIO.read("opuntia.phy", "phylip")

print("FASTA alignment length:", fasta_alignment.get_alignment_length())
print("PHYLIP alignment length:", phylip_alignment.get_alignment_length())
print("Same number of sequences:", len(fasta_alignment) == len(phylip_alignment))

FASTA alignment length: 906
PHYLIP alignment length: 906
Same number of sequences: True

Reading exported files back is a quick integrity check that your conversion worked as expected. This prevents subtle downstream errors when a tool fails due to malformed or incompatible alignment output.

## Computing a Consensus Sequence

from Bio import AlignIO
from Bio import motifs

# Read alignment from local file
alignment = AlignIO.read("opuntia.aln", "clustal")

# Build a motif from aligned sequences and use built-in consensus methods
motif = motifs.create([record.seq for record in alignment])
consensus = motif.consensus
degenerate = motif.degenerate_consensus
thresholded = motif.counts.calculate_consensus(identity=0.7)

print("Consensus sequence:")
print(consensus)
print("Degenerate consensus:")
print(degenerate)
print("Thresholded consensus (identity=0.7):")
print(thresholded)

Consensus sequence:
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATACGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTAATTGATCATTGGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTTAAATTGCTCATATTTTATTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAATATAGACAAACTATATATATATATATATATAATATATTTCAAATTTCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATCTTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTATAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAGAGAACCAGA
Degenerate consensus:
TATACATTAAAGRAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATASGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTAATKRWTCATTGGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTTAAATTGCTCATATTTTMTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAAYATAGACAAACTATATATATATATATATATAATATATTTCAAATTYCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATMTTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTAKAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAARAGAACCAGA
Thresholded consensus (identity=0.7):
TATACATTAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAATCTAAATGATATANGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAAGAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTCATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGATGTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTAATNNNTCATTGGGAAGGATGGCGGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACTAAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTTAAATTGCTCATATTTTNTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAANATAGACAAACTATATATATATATATATATAATATATTTCAAATTNCCTTATATATCCAAATATAAAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTATATNTTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATTAATCTATATATTAATTTANAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATATTCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTCCGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAANAGAACCAGA

Consensus extraction is useful when you need one representative sequence for primer design, motif checks, or high-level reporting. Using `Bio.motifs` also keeps the tutorial aligned with current Biopython APIs instead of deprecated `SummaryInfo` helpers.

## Measuring Per-Column Conservation

from Bio import AlignIO

# Read alignment
alignment = AlignIO.read("opuntia.aln", "clustal")
length = alignment.get_alignment_length()

conservation_scores = []
for i in range(length):
    column = alignment[:, i]
    non_gap = [c for c in column if c != "-"]
    if not non_gap:
        conservation_scores.append(0.0)
        continue
    most_common = max(set(non_gap), key=non_gap.count)
    score = non_gap.count(most_common) / len(non_gap)
    conservation_scores.append(score)

print("First 20 conservation scores:")
print([round(s, 3) for s in conservation_scores[:20]])

First 20 conservation scores:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.857, 1.0, 1.0, 1.0, 1.0, 0.571, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Per-column conservation helps you find stable versus variable regions, which is practical for marker selection and downstream phylogenetic feature design.

## Filtering High-Gap Columns

from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Read alignment
alignment = AlignIO.read("opuntia.aln", "clustal")
length = alignment.get_alignment_length()

# Keep columns where gap fraction is <= 0.3
keep_positions = []
for i in range(length):
    column = alignment[:, i]
    gap_fraction = column.count("-") / len(column)
    if gap_fraction <= 0.3:
        keep_positions.append(i)

# Rebuild filtered alignment from kept columns
filtered_records = []
for record in alignment:
    filtered_seq = "".join(record.seq[i] for i in keep_positions)
    filtered_records.append(SeqRecord(Seq(filtered_seq), id=record.id, description=""))

filtered_alignment = MultipleSeqAlignment(filtered_records)
AlignIO.write(filtered_alignment, "opuntia_filtered.aln", "clustal")

print("Original length:", length)
print("Filtered length:", filtered_alignment.get_alignment_length())
print("Wrote opuntia_filtered.aln")

Original length: 906
Filtered length: 894
Wrote opuntia_filtered.aln

Gap filtering is a common preprocessing step before tree construction to reduce noise from poorly aligned positions.

## Visualizing Alignment Patterns

import matplotlib.pyplot as plt
from Bio import AlignIO

# Read alignment and compare each residue against first sequence as reference
alignment = AlignIO.read("opuntia.aln", "clustal")
reference = str(alignment[0].seq)

# Encode alignment: 0 = match to reference, 1 = mismatch, 2 = gap
matrix = []
for record in alignment:
    row = []
    seq = str(record.seq)
    for ref_char, char in zip(reference, seq):
        if char == "-":
            row.append(2)
        elif char == ref_char:
            row.append(0)
        else:
            row.append(1)
    matrix.append(row)

plt.figure(figsize=(12, 5))
plt.imshow(matrix, aspect="auto", interpolation="nearest", cmap="viridis")
plt.colorbar(label="0=match, 1=mismatch, 2=gap")
plt.title("Alignment Pattern Heatmap (relative to first sequence)")
plt.xlabel("Alignment position")
plt.ylabel("Sequence index")
plt.tight_layout()
plt.show()

This heatmap gives a quick visual summary of conserved blocks, mismatch-rich regions, and gap-heavy segments. It is a practical diagnostic before selecting regions for phylogenetic or marker analyses.

## Exporting a Subset of Sequences

from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment

# Read alignment and keep only first five sequences as an example subset
alignment = AlignIO.read("opuntia.aln", "clustal")
subset = MultipleSeqAlignment(alignment[:5])

AlignIO.write(subset, "opuntia_subset.fasta", "fasta")
print("Subset size:", len(subset))
print("Wrote opuntia_subset.fasta")

Subset size: 5
Wrote opuntia_subset.fasta

Subsetting is useful when preparing focused comparisons for one clade, one sample group, or a small demo dataset for collaborators.