Custom Alignment Scoring Matrices in Biopython

Default [alignment scoring](/tutorials/biopython-pairwise-sequence-alignment) is often not ideal for every dataset. In real projects, you may want to reward transitions differently, penalize gaps more strongly, or adapt scoring to expected mutation patterns.

In this tutorial, you will customize alignment scoring with Biopython.

## Basic Custom Match, Mismatch, and Gap Scores

from Bio.Align import PairwiseAligner

seq1 = "ACGTACGT"
seq2 = "ACGTTGGT"  # includes an A<->G transition mismatch

# Configure a pairwise aligner with explicit scores
aligner = PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

score = aligner.score(seq1, seq2)
alignment = aligner.align(seq1, seq2)[0]

print("Score:", score)
print(alignment)

Score: 10.0
target            0 ACGTACGT 8
                  0 ||||..|| 8
query             0 ACGTTGGT 8

This gives you full control over the basic scoring scheme and gap behavior. It is useful when you want deterministic scoring without external matrix files.

## Building a Custom Substitution Matrix

from Bio.Align import PairwiseAligner
from Bio.Align.substitution_matrices import Array

seq1 = "ACGTACGT"
seq2 = "ACGTTGGT"

# Build a DNA substitution matrix with custom transition handling
alphabet = "ACGT"
matrix = Array(alphabet=alphabet, dims=2)

for a in alphabet:
    for b in alphabet:
        matrix[a, b] = 2 if a == b else -1

# Reduce penalty for transitions (A<->T, C<->G)
matrix["A", "T"] = -0.2
matrix["C", "G"] = -0.2

aligner = PairwiseAligner()
aligner.substitution_matrix = matrix
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

print("Custom-matrix score:", aligner.score(seq1, seq2))
print(aligner.align(seq1, seq2)[0])

Custom-matrix score: 11.6
target            0 ACGTACGT 8
                  0 ||||..|| 8
query             0 ACGTTGGT 8

A custom matrix lets you encode domain assumptions directly in scoring. For example, transition-friendly scoring is common in evolutionary DNA comparisons.

## Comparing Scoring Schemes Side by Side

from Bio.Align import PairwiseAligner

seq1 = "ATGCTGATACGA"
seq2 = "ATGGTCGGTTA"

# Strict scheme: mismatches are expensive, so aligner is more willing to open gaps
strict = PairwiseAligner()
strict.match_score = 2
strict.mismatch_score = -3
strict.open_gap_score = -3
strict.extend_gap_score = -1

# Lenient mismatch + expensive gaps: aligner prefers mismatches over many gaps
lenient = PairwiseAligner()
lenient.match_score = 2
lenient.mismatch_score = -0.5
lenient.open_gap_score = -5
lenient.extend_gap_score = -2

# Compute scores
strict_score = strict.score(seq1, seq2)
lenient_score = lenient.score(seq1, seq2)

# Get top alignment from each scheme for visual comparison
strict_alignment = strict.align(seq1, seq2)[0]
lenient_alignment = lenient.align(seq1, seq2)[0]

print("Strict score:", strict_score)
print(strict_alignment)
print("-" * 80)
print("Lenient score:", lenient_score)
print(lenient_alignment)

Strict score: 1.0
target            0 ATGCTGATACG---A 12
                  0 |||--|-|-||---| 15
query             0 ATG--G-T-CGGTTA 11

--------------------------------------------------------------------------------
Lenient score: 2.0
target            0 ATGCTGATACGA 12
                  0 |||.|-.....| 12
query             0 ATGGT-CGGTTA 11

Scheme comparison helps choose scoring settings that match your biological question. Seeing the rendered alignments side by side makes it much easier to detect whether score differences actually change mismatch/gap placement in biologically meaningful ways.

## Visual Comparison: Default Matrix vs Transition-Friendly Matrix

from Bio.Align import PairwiseAligner
from Bio.Align.substitution_matrices import Array

seq1 = "TAAATGACCCTCT"
seq2 = "CGTCATAAAACCT"

# Default matrix-like behavior via simple match/mismatch scores
default_aligner = PairwiseAligner()
default_aligner.match_score = 2
default_aligner.mismatch_score = -1
default_aligner.open_gap_score = -3
default_aligner.extend_gap_score = -1

# Transition-friendly matrix (A<->G and C<->T less penalized)
alphabet = "ACGT"
matrix = Array(alphabet=alphabet, dims=2)
for a in alphabet:
    for b in alphabet:
        matrix[a, b] = 2 if a == b else -1
matrix["A", "G"] = -0.2
matrix["G", "A"] = -0.2
matrix["C", "T"] = -0.2
matrix["T", "C"] = -0.2

matrix_aligner = PairwiseAligner()
matrix_aligner.substitution_matrix = matrix
matrix_aligner.open_gap_score = -3
matrix_aligner.extend_gap_score = -1

default_alignment = default_aligner.align(seq1, seq2)[0]
matrix_alignment = matrix_aligner.align(seq1, seq2)[0]

print("Default scoring alignment:")
print(default_alignment)
print("Score:", default_aligner.score(seq1, seq2))
print("-" * 80)
print("Transition-friendly matrix alignment:")
print(matrix_alignment)
print("Score:", matrix_aligner.score(seq1, seq2))

Default scoring alignment:
target            0 -TAAATGACCCTCT 13
                  0 -...||.|..|-|| 14
query             0 CGTCATAAAAC-CT 13

Score: 0.0
--------------------------------------------------------------------------------
Transition-friendly matrix alignment:
target            0 TA-AATGACCCTCT 13
                  0 ..-.||.|..|-|| 14
query             0 CGTCATAAAAC-CT 13

Score: 2.3999999999999995

This visual comparison highlights how a biologically tuned matrix can shift the preferred alignment path, not just the final score. In practice, that affects which substitutions you interpret as plausible evolutionary events.

## Practical Use Case: Prioritizing Candidate Homolog Pairs

from Bio.Align import PairwiseAligner

query = "ATGCGTACGTTAGCTAGCTAG"
candidates = {
    "candidate_A": "ATGCGTACGTTAGCTTGCTAG",
    "candidate_B": "ATGCGCACGATAGATAGCTAG",
    "candidate_C": "GCGTACGTTAGCTAGCTAAAT",
}

aligner = PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

scores = []
for name, seq in candidates.items():
    score = aligner.score(query, seq)
    top_alignment = aligner.align(query, seq)[0]
    scores.append((name, score, top_alignment))

scores.sort(key=lambda x: x[1], reverse=True)
print("Ranked candidates (name, score):", [(name, score) for name, score, _ in scores])

# Show alignments for the top two candidates so score differences are interpretable
for name, score, alignment in scores[:2]:
    print("-" * 80)
    print(f"{name} | score={score}")
    print(alignment)

Ranked candidates (name, score): [('candidate_A', 39.0), ('candidate_B', 33.0), ('candidate_C', 30.0)]
--------------------------------------------------------------------------------
candidate_A | score=39.0
target            0 ATGCGTACGTTAGCTAGCTAG 21
                  0 |||||||||||||||.||||| 21
query             0 ATGCGTACGTTAGCTTGCTAG 21

--------------------------------------------------------------------------------
candidate_B | score=33.0
target            0 ATGCGTACGTTAGCTAGCTAG 21
                  0 |||||.|||.|||.||||||| 21
query             0 ATGCGCACGATAGATAGCTAG 21

This is a practical ranking step for shortlist generation before expensive downstream analyses like tree construction or structure prediction. Printing the top alignments (not just scores) helps validate whether high-scoring candidates also have biologically sensible mismatch and gap patterns.