ABI files (`.ab1`) are produced by **Sanger DNA sequencing instruments** from companies such as Applied Biosystems. Unlike FASTA or FASTQ files, ABI files contain much more information: - the DNA sequence - base quality scores - raw chromatogram trace data - sequencing metadata (instrument, run parameters, etc.) These files are commonly used when analyzing **Sanger sequencing results**, verifying cloned DNA sequences, or checking PCR products. Biopython provides built-in support for reading ABI files through the `Bio.SeqIO` module. In this tutorial, you'll learn how to: - download an example ABI file - read ABI sequencing data - access sequence and quality scores - examine metadata stored in the file - extract chromatogram trace data - convert ABI data to FASTA or FASTQ These skills are useful for building automated Sanger sequencing analysis pipelines. --- ## Downloading an Example ABI File First, let's download a small example ABI file that we can analyze.
import requests
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Abi/3100.ab1"
response = requests.get(url)
response.raise_for_status()
with open("example.ab1", "wb") as f:
f.write(response.content)
print("Downloaded example.ab1")Downloaded example.ab1
This code downloads a real ABI sequencing file from the Biopython repository and saves it locally as `example.ab1`. The file contains Sanger sequencing data including the base calls and chromatogram traces. --- ## Reading an ABI File Biopython reads ABI files using `SeqIO.read()` with the `"abi"` format.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
print("Sequence ID:", record.id)
print("Sequence length:", len(record.seq))
print("First 50 bases:", record.seq[:50])Sequence ID: 16S_S2_1387R Sequence length: 795 First 50 bases: CAAGATTGCATTCATGATCTACGATTACTAGCGATTCCAGCTTCATATAG
ABI files contain a **single sequencing read**, so `SeqIO.read()` is used instead of `SeqIO.parse()`. The returned object is a `SeqRecord` containing the base-called sequence from the chromatogram. --- ## Accessing Base Quality Scores Sanger sequencing also produces quality scores that estimate the confidence of each base call.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
qualities = record.letter_annotations["phred_quality"]
print("Number of quality scores:", len(qualities))
print("First 20 quality scores:", qualities[:20])Number of quality scores: 795 First 20 quality scores: [5, 3, 4, 4, 4, 5, 9, 4, 4, 4, 5, 4, 4, 4, 4, 4, 6, 13, 23, 20]
The quality scores are stored in `record.letter_annotations["phred_quality"]`. Each number corresponds to the confidence of the base call at the same position in the sequence. Higher scores indicate more reliable base calls. --- ## Calculating Average Read Quality You can quickly estimate overall sequencing quality by computing the average PHRED score.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
qualities = record.letter_annotations["phred_quality"]
average_quality = sum(qualities) / len(qualities)
print("Average read quality:", round(average_quality, 2))Average read quality: 46.82
This calculation can help determine whether the sequencing run produced reliable results or whether trimming might be necessary. --- ## Exploring ABI Metadata ABI files contain many additional metadata fields describing the sequencing run. Biopython stores these in the `record.annotations["abif_raw"]` dictionary.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
metadata = record.annotations["abif_raw"]
print("Number of metadata entries:", len(metadata))
for key in list(metadata.keys())[:10]:
print(key)Number of metadata entries: 130 AEPt1 AEPt2 APFN2 APXV1 APrN1 APrV1 APrX1 ARTN1 ASPF1 ASPt1
The `abif_raw` dictionary stores low-level data extracted from the ABI file structure. These entries include information such as: - instrument name - run parameters - base call data - trace intensities Exploring these values can help you understand how the sequencing run was performed. --- ## Accessing Chromatogram Trace Data ABI files store the raw fluorescence signal for each nucleotide (A, C, G, T). These traces create the familiar chromatogram peaks used to determine base calls.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
raw_data = record.annotations["abif_raw"]
trace_a = raw_data["DATA9"]
trace_c = raw_data["DATA10"]
trace_g = raw_data["DATA11"]
trace_t = raw_data["DATA12"]
print("Trace length:", len(trace_a))
print("First 10 A-channel values:", trace_a[:10])Trace length: 10303 First 10 A-channel values: (2892, 2897, 2907, 2925, 2951, 2984, 3012, 3030, 3037, 3039)
Each trace corresponds to fluorescence intensity detected for a particular nucleotide during sequencing. Typical trace channels include: - `DATA9` → A channel - `DATA10` → C channel - `DATA11` → G channel - `DATA12` → T channel These signals are used by base-calling software to determine the DNA sequence. --- ## Converting ABI Files to FASTA Sometimes you want to extract just the sequence and store it in FASTA format.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
record = SeqIO.read("example.ab1", "abi")
SeqIO.write(record, "sequence.fasta", "fasta")1
This writes the base-called DNA sequence into a FASTA file. This is useful when preparing sequences for alignment or BLAST searches. --- ## Converting ABI Files to FASTQ You can also convert ABI files to FASTQ format, which includes both the sequence and the quality scores.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
record = SeqIO.read("example.ab1", "abi")
SeqIO.write(record, "sequence.fastq", "fastq")1
The resulting FASTQ file preserves both the sequence and PHRED quality values, which can be useful when integrating Sanger reads with next-generation sequencing workflows. --- ## Inspecting All Available ABI Tags If you want to see all the available ABI data fields, you can list them.
from Bio import SeqIO
record = SeqIO.read("example.ab1", "abi")
for tag in record.annotations["abif_raw"]:
print(tag)AEPt1 AEPt2 APFN2 APXV1 APrN1 APrV1 APrX1 ARTN1 ASPF1 ASPt1 ASPt2 AUDT1 B1Pt1 B1Pt2 BCTS1 CTID1 CTNM1 CTOw1 CTTL1 CpEP1 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7 DATA8 DATA9 DATA10 DATA11 DATA12 DCHT1 DSam1 DySN1 Dye#1 DyeN1 DyeN2 DyeN3 DyeN4 DyeW1 DyeW2 DyeW3 DyeW4 EPVt1 EVNT1 EVNT2 EVNT3 EVNT4 FTab1 FVoc1 FWO_1 Feat1 GTyp1 HCFG1 HCFG2 HCFG3 HCFG4 InSc1 InVt1 LANE1 LAST1 LIMS1 LNTD1 LsrP1 MCHN1 MODF1 MODL1 NAVG1 NLNE1 NOIS1 OfSc1 P1AM1 P1RL1 P1WD1 P2AM1 P2BA1 P2RL1 PBAS1 PBAS2 PCON1 PCON2 PDMF1 PDMF2 PLOC1 PLOC2 PSZE1 PTYP1 PXLB1 RGNm1 RGOw1 RMXV1 RMdN1 RMdV1 RMdX1 RPrN1 RPrV1 RUND1 RUND2 RUND3 RUND4 RUNT1 RUNT2 RUNT3 RUNT4 Rate1 RunN1 S/N%1 SCAN1 SMED1 SMLt1 SMPL1 SPAC1 SPAC2 SPAC3 SVER1 SVER2 SVER3 Satd1 Scal1 Scan1 TUBE1 Tmpr1 User1 phAR1 phCH1 phDY1 phQL1 phTR1 phTR2
This will display all ABI tags stored in the file. Different sequencing instruments may include different tags. --- ## Conclusion ABI files contain rich sequencing information including the base-called DNA sequence, quality scores, chromatogram traces, and instrument metadata. Biopython makes it easy to access and analyze all of this information directly from Python. In this tutorial you learned how to: - read `.ab1` ABI sequencing files - access sequences and PHRED quality scores - inspect metadata from the sequencing run - extract chromatogram trace data - convert ABI files to FASTA or FASTQ These techniques are useful when analyzing **Sanger sequencing data**, verifying DNA constructs, or building automated analysis tools for sequencing workflows. For more advanced use cases, you can combine these techniques with plotting libraries to visualize chromatograms or integrate the sequences into alignment and variant detection pipelines.