Working with ABI Files in Biopython

ABI files (`.ab1`) are produced by **Sanger DNA sequencing instruments** from companies such as Applied Biosystems. Unlike FASTA or FASTQ files, ABI files contain much more information:

- the DNA sequence
- base quality scores
- raw chromatogram trace data
- sequencing metadata (instrument, run parameters, etc.)

These files are commonly used when analyzing **Sanger sequencing results**, verifying cloned DNA sequences, or checking PCR products.

Biopython provides built-in support for reading ABI files through the `Bio.SeqIO` module. In this tutorial, you'll learn how to:

- download an example ABI file
- read ABI sequencing data
- access sequence and quality scores
- examine metadata stored in the file
- extract chromatogram trace data
- convert ABI data to FASTA or FASTQ

These skills are useful for building automated Sanger sequencing analysis pipelines.

---

## Downloading an Example ABI File

First, let's download a small example ABI file that we can analyze.

import requests

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Abi/3100.ab1"
response = requests.get(url)
response.raise_for_status()

with open("example.ab1", "wb") as f:
    f.write(response.content)

print("Downloaded example.ab1")
Downloaded example.ab1
This code downloads a real ABI sequencing file from the Biopython repository and saves it locally as `example.ab1`. The file contains Sanger sequencing data including the base calls and chromatogram traces.

---

## Reading an ABI File

Biopython reads ABI files using `SeqIO.read()` with the `"abi"` format.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

print("Sequence ID:", record.id)
print("Sequence length:", len(record.seq))
print("First 50 bases:", record.seq[:50])
Sequence ID: 16S_S2_1387R
Sequence length: 795
First 50 bases: CAAGATTGCATTCATGATCTACGATTACTAGCGATTCCAGCTTCATATAG
ABI files contain a **single sequencing read**, so `SeqIO.read()` is used instead of `SeqIO.parse()`. The returned object is a `SeqRecord` containing the base-called sequence from the chromatogram.

---

## Accessing Base Quality Scores

Sanger sequencing also produces quality scores that estimate the confidence of each base call.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

qualities = record.letter_annotations["phred_quality"]

print("Number of quality scores:", len(qualities))
print("First 20 quality scores:", qualities[:20])
Number of quality scores: 795
First 20 quality scores: [5, 3, 4, 4, 4, 5, 9, 4, 4, 4, 5, 4, 4, 4, 4, 4, 6, 13, 23, 20]
The quality scores are stored in `record.letter_annotations["phred_quality"]`. Each number corresponds to the confidence of the base call at the same position in the sequence.

Higher scores indicate more reliable base calls.

---

## Calculating Average Read Quality

You can quickly estimate overall sequencing quality by computing the average PHRED score.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

qualities = record.letter_annotations["phred_quality"]
average_quality = sum(qualities) / len(qualities)

print("Average read quality:", round(average_quality, 2))
Average read quality: 46.82
This calculation can help determine whether the sequencing run produced reliable results or whether trimming might be necessary.

---

## Exploring ABI Metadata

ABI files contain many additional metadata fields describing the sequencing run.

Biopython stores these in the `record.annotations["abif_raw"]` dictionary.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

metadata = record.annotations["abif_raw"]

print("Number of metadata entries:", len(metadata))

for key in list(metadata.keys())[:10]:
    print(key)
Number of metadata entries: 130
AEPt1
AEPt2
APFN2
APXV1
APrN1
APrV1
APrX1
ARTN1
ASPF1
ASPt1
The `abif_raw` dictionary stores low-level data extracted from the ABI file structure. These entries include information such as:

- instrument name
- run parameters
- base call data
- trace intensities

Exploring these values can help you understand how the sequencing run was performed.

---

## Accessing Chromatogram Trace Data

ABI files store the raw fluorescence signal for each nucleotide (A, C, G, T). These traces create the familiar chromatogram peaks used to determine base calls.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

raw_data = record.annotations["abif_raw"]

trace_a = raw_data["DATA9"]
trace_c = raw_data["DATA10"]
trace_g = raw_data["DATA11"]
trace_t = raw_data["DATA12"]

print("Trace length:", len(trace_a))
print("First 10 A-channel values:", trace_a[:10])
Trace length: 10303
First 10 A-channel values: (2892, 2897, 2907, 2925, 2951, 2984, 3012, 3030, 3037, 3039)
Each trace corresponds to fluorescence intensity detected for a particular nucleotide during sequencing.

Typical trace channels include:

- `DATA9` → A channel
- `DATA10` → C channel
- `DATA11` → G channel
- `DATA12` → T channel

These signals are used by base-calling software to determine the DNA sequence.

---

## Converting ABI Files to FASTA

Sometimes you want to extract just the sequence and store it in FASTA format.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

record = SeqIO.read("example.ab1", "abi")

SeqIO.write(record, "sequence.fasta", "fasta")
1
This writes the base-called DNA sequence into a FASTA file. This is useful when preparing sequences for alignment or BLAST searches.

---

## Converting ABI Files to FASTQ

You can also convert ABI files to FASTQ format, which includes both the sequence and the quality scores.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

record = SeqIO.read("example.ab1", "abi")

SeqIO.write(record, "sequence.fastq", "fastq")
1
The resulting FASTQ file preserves both the sequence and PHRED quality values, which can be useful when integrating Sanger reads with next-generation sequencing workflows.

---

## Inspecting All Available ABI Tags

If you want to see all the available ABI data fields, you can list them.

from Bio import SeqIO

record = SeqIO.read("example.ab1", "abi")

for tag in record.annotations["abif_raw"]:
    print(tag)
AEPt1
AEPt2
APFN2
APXV1
APrN1
APrV1
APrX1
ARTN1
ASPF1
ASPt1
ASPt2
AUDT1
B1Pt1
B1Pt2
BCTS1
CTID1
CTNM1
CTOw1
CTTL1
CpEP1
DATA1
DATA2
DATA3
DATA4
DATA5
DATA6
DATA7
DATA8
DATA9
DATA10
DATA11
DATA12
DCHT1
DSam1
DySN1
Dye#1
DyeN1
DyeN2
DyeN3
DyeN4
DyeW1
DyeW2
DyeW3
DyeW4
EPVt1
EVNT1
EVNT2
EVNT3
EVNT4
FTab1
FVoc1
FWO_1
Feat1
GTyp1
HCFG1
HCFG2
HCFG3
HCFG4
InSc1
InVt1
LANE1
LAST1
LIMS1
LNTD1
LsrP1
MCHN1
MODF1
MODL1
NAVG1
NLNE1
NOIS1
OfSc1
P1AM1
P1RL1
P1WD1
P2AM1
P2BA1
P2RL1
PBAS1
PBAS2
PCON1
PCON2
PDMF1
PDMF2
PLOC1
PLOC2
PSZE1
PTYP1
PXLB1
RGNm1
RGOw1
RMXV1
RMdN1
RMdV1
RMdX1
RPrN1
RPrV1
RUND1
RUND2
RUND3
RUND4
RUNT1
RUNT2
RUNT3
RUNT4
Rate1
RunN1
S/N%1
SCAN1
SMED1
SMLt1
SMPL1
SPAC1
SPAC2
SPAC3
SVER1
SVER2
SVER3
Satd1
Scal1
Scan1
TUBE1
Tmpr1
User1
phAR1
phCH1
phDY1
phQL1
phTR1
phTR2
This will display all ABI tags stored in the file. Different sequencing instruments may include different tags.

---

## Conclusion

ABI files contain rich sequencing information including the base-called DNA sequence, quality scores, chromatogram traces, and instrument metadata. Biopython makes it easy to access and analyze all of this information directly from Python.

In this tutorial you learned how to:

- read `.ab1` ABI sequencing files
- access sequences and PHRED quality scores
- inspect metadata from the sequencing run
- extract chromatogram trace data
- convert ABI files to FASTA or FASTQ

These techniques are useful when analyzing **Sanger sequencing data**, verifying DNA constructs, or building automated analysis tools for sequencing workflows.

For more advanced use cases, you can combine these techniques with plotting libraries to visualize chromatograms or integrate the sequences into alignment and variant detection pipelines.