BLAST with Biopython

BLAST is one of the fastest ways to identify related sequences and infer biological context for unknown DNA or protein fragments. Automating BLAST with Biopython makes repetitive searches and downstream filtering much easier.

In this tutorial, you will run a BLAST query, parse BLAST XML, and extract similar-sequence hits.

## Running BLAST Searches with Biopython

from Bio.Blast import NCBIWWW

# Use a known accession query from Biopython BLAST examples
# (this tends to return many hits more reliably than a very short random sequence)
query = "8332116"

# Submit an online BLASTN search to NCBI and ask for XML output
result_handle = NCBIWWW.qblast(
    program="blastn",
    database="nt",
    sequence=query,
    hitlist_size=5,
    format_type="XML",
)

# Save results for later parsing steps
with open("blast_result.xml", "w", encoding="utf-8") as out_handle:
    out_handle.write(result_handle.read())

result_handle.close()
print("Saved BLAST results to blast_result.xml for query:", query)

Saved BLAST results to blast_result.xml for query: 8332116

This code sends a live BLAST request to NCBI and stores XML output locally. Using a known accession query gives a more practical demo because it typically returns clear, non-empty hit lists for parsing and filtering.

## Parsing BLAST XML Results

from Bio.Blast import NCBIXML

# Parse the saved BLAST XML file
with open("blast_result.xml", "r", encoding="utf-8") as result_file:
    blast_record = NCBIXML.read(result_file)

print("Query:", blast_record.query)
print("Alignments returned:", len(blast_record.alignments))

# Show summary for the first three hits
for alignment in blast_record.alignments[:3]:
    hsp = alignment.hsps[0]
    print("Hit title:", alignment.title)
    print("E-value:", hsp.expect)
    print("Alignment length:", hsp.align_length)
    print("-" * 60)

Query: MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acclimation protein, mRNA sequence
Alignments returned: 5
Hit title: gi|1219041180|ref|XM_021875076.1| PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane protein 2-like (LOC110697660), mRNA
E-value: 6.30608e-117
Alignment length: 624
------------------------------------------------------------
Hit title: gi|2514617377|ref|XM_021992092.2| PREDICTED: Spinacia oleracea cold-regulated 413 plasma membrane protein 2-like (LOC110787470), mRNA
E-value: 1.69216e-111
Alignment length: 590
------------------------------------------------------------
Hit title: gi|2518612504|ref|XM_010682658.3| PREDICTED: Beta vulgaris subsp. vulgaris cold-regulated 413 plasma membrane protein 2 (LOC104895996), mRNA
E-value: 4.54069e-106
Alignment length: 597
------------------------------------------------------------

Here you parse the XML into Python objects and inspect key fields like title, e-value, and alignment length. Structured parsing is what turns BLAST output from raw text into filterable programmatic data.

## Finding Similar Sequences with BLAST

from Bio.Blast import NCBIXML

# Load parsed BLAST results
with open("blast_result.xml", "r", encoding="utf-8") as result_file:
    blast_record = NCBIXML.read(result_file)

# Keep only strong matches by e-value threshold
threshold = 1e-10
similar_hits = []

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect <= threshold:
            similar_hits.append(
                {
                    "title": alignment.title,
                    "evalue": hsp.expect,
                    "identity_fraction": hsp.identities / hsp.align_length,
                }
            )
            break

print("Strong similar hits found:", len(similar_hits))
for hit in similar_hits[:5]:
    print(hit)

Strong similar hits found: 5
{'title': 'gi|1219041180|ref|XM_021875076.1| PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane protein 2-like (LOC110697660), mRNA', 'evalue': 6.30608e-117, 'identity_fraction': 0.7580128205128205}
{'title': 'gi|2514617377|ref|XM_021992092.2| PREDICTED: Spinacia oleracea cold-regulated 413 plasma membrane protein 2-like (LOC110787470), mRNA', 'evalue': 1.69216e-111, 'identity_fraction': 0.7576271186440678}
{'title': 'gi|2518612504|ref|XM_010682658.3| PREDICTED: Beta vulgaris subsp. vulgaris cold-regulated 413 plasma membrane protein 2 (LOC104895996), mRNA', 'evalue': 4.54069e-106, 'identity_fraction': 0.7504187604690117}
{'title': 'gi|2031543140|ref|XM_041168865.1| PREDICTED: Juglans microcarpa x Juglans regia cold-regulated 413 plasma membrane protein 2-like (LOC121265293), mRNA', 'evalue': 1.58486e-105, 'identity_fraction': 0.7554806070826307}
{'title': 'gi|2082357255|ref|XM_043119049.1| PREDICTED: Carya illinoinensis cold-regulated 413 plasma membrane protein 2-like (LOC122306609), transcript variant X2, mRNA', 'evalue': 6.73898e-104, 'identity_fraction': 0.7533783783783784}

This block applies an e-value filter to identify higher-confidence similar sequences. Explicit filtering criteria make BLAST results more reproducible and easier to integrate into downstream analysis steps.

## Ranking Hits by Identity and Query Coverage

from Bio.Blast import NCBIXML

# Parse BLAST XML
with open("blast_result.xml", "r", encoding="utf-8") as result_file:
    blast_record = NCBIXML.read(result_file)

# Biopython BLAST records can expose query length as query_letters
query_len = getattr(blast_record, "query_length", None) or getattr(blast_record, "query_letters", None)
if not query_len:
    raise ValueError("Could not determine query length from BLAST record.")

ranked = []
for alignment in blast_record.alignments:
    best_hsp = max(alignment.hsps, key=lambda h: h.identities / h.align_length)
    query_coverage = best_hsp.align_length / query_len
    identity_fraction = best_hsp.identities / best_hsp.align_length
    ranked.append(
        {
            "title": alignment.title,
            "evalue": best_hsp.expect,
            "identity_fraction": identity_fraction,
            "query_coverage": query_coverage,
        }
    )

# Sort by identity first, then coverage
ranked.sort(key=lambda x: (x["identity_fraction"], x["query_coverage"]), reverse=True)

if not ranked:
    print("No BLAST hits were returned for ranking.")
else:
    for hit in ranked[:5]:
        print(hit)

{'title': 'gi|1219041180|ref|XM_021875076.1| PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane protein 2-like (LOC110697660), mRNA', 'evalue': 6.30608e-117, 'identity_fraction': 0.7580128205128205, 'query_coverage': 0.5616561656165616}
{'title': 'gi|2514617377|ref|XM_021992092.2| PREDICTED: Spinacia oleracea cold-regulated 413 plasma membrane protein 2-like (LOC110787470), mRNA', 'evalue': 1.69216e-111, 'identity_fraction': 0.7576271186440678, 'query_coverage': 0.5310531053105311}
{'title': 'gi|2031543140|ref|XM_041168865.1| PREDICTED: Juglans microcarpa x Juglans regia cold-regulated 413 plasma membrane protein 2-like (LOC121265293), mRNA', 'evalue': 1.58486e-105, 'identity_fraction': 0.7554806070826307, 'query_coverage': 0.5337533753375338}
{'title': 'gi|2082357255|ref|XM_043119049.1| PREDICTED: Carya illinoinensis cold-regulated 413 plasma membrane protein 2-like (LOC122306609), transcript variant X2, mRNA', 'evalue': 6.73898e-104, 'identity_fraction': 0.7533783783783784, 'query_coverage': 0.5328532853285328}
{'title': 'gi|2518612504|ref|XM_010682658.3| PREDICTED: Beta vulgaris subsp. vulgaris cold-regulated 413 plasma membrane protein 2 (LOC104895996), mRNA', 'evalue': 4.54069e-106, 'identity_fraction': 0.7504187604690117, 'query_coverage': 0.5373537353735374}

Combining identity with query coverage gives a more practical quality signal than e-value alone, especially when deciding which references are suitable for follow-up analysis.

## Exporting BLAST Summaries to CSV

import csv
from Bio.Blast import NCBIXML

# Parse BLAST XML
with open("blast_result.xml", "r", encoding="utf-8") as result_file:
    blast_record = NCBIXML.read(result_file)

query_len = getattr(blast_record, "query_length", None) or getattr(blast_record, "query_letters", None)
if not query_len:
    raise ValueError("Could not determine query length from BLAST record.")

rows = []
for alignment in blast_record.alignments:
    hsp = alignment.hsps[0]
    rows.append(
        {
            "title": alignment.title,
            "evalue": hsp.expect,
            "identities": hsp.identities,
            "align_length": hsp.align_length,
            "identity_fraction": hsp.identities / hsp.align_length,
            "query_coverage": hsp.align_length / query_len,
        }
    )

fieldnames = [
    "title",
    "evalue",
    "identities",
    "align_length",
    "identity_fraction",
    "query_coverage",
]

with open("blast_hits_summary.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    if rows:
        writer.writerows(rows)

print(f"Wrote blast_hits_summary.csv with {len(rows)} hit rows")

Wrote blast_hits_summary.csv with 5 hit rows

CSV export makes BLAST results easier to share with collaborators and easy to inspect in spreadsheets or downstream ETL workflows.

## Batch BLAST Queries with Delay and Retry

import time
from Bio.Blast import NCBIWWW

queries = {
    "query_1": "8332116",
    "query_2": "AY851612",
}

for query_id, sequence in queries.items():
    for attempt in range(1, 4):
        try:
            handle = NCBIWWW.qblast(
                program="blastn",
                database="nt",
                sequence=sequence,
                hitlist_size=3,
                format_type="XML",
            )
            with open(f"{query_id}.xml", "w", encoding="utf-8") as f:
                f.write(handle.read())
            handle.close()
            print(f"Saved {query_id}.xml")
            break
        except Exception as exc:
            print(f"{query_id} attempt {attempt} failed: {exc}")
            if attempt == 3:
                raise
            time.sleep(5)
    # Respectful delay between NCBI requests
    time.sleep(2)

Saved query_1.xml

Saved query_2.xml

Batch querying with retry/backoff is a practical pattern for robust automation, reducing failures from transient network or service issues.