Working with FASTQ Files in Biopython

If you work with sequencing data, FASTQ files are everywhere. Unlike [FASTA files](/tutorials/biopython-fasta-files), FASTQ files store both the nucleotide sequence and a quality score for every base. That makes them essential in real bioinformatics workflows, because quality scores help you decide whether reads are reliable enough for downstream analysis.

In this tutorial, you will learn how to use **Biopython** to work with FASTQ files in Python. You will see how to:

* download a sample FASTQ file
* read sequencing reads with `SeqIO`
* inspect sequences and quality scores
* compute simple statistics
* filter reads by length or quality
* write filtered reads back to a new FASTQ file

Biopython makes this much easier than parsing FASTQ text by hand, and the same ideas you learn here apply to larger sequencing projects.

## Downloading an Example FASTQ File

First, let's download a small example FASTQ file that we can use in the rest of the tutorial.

import requests

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

print("Downloaded example.fastq")

Downloaded example.fastq

This code uses `requests` to fetch a FASTQ file from the Biopython GitHub repository and saves it as `example.fastq` in the current folder. The call to `raise_for_status()` makes sure the download succeeded before writing the file.

## Reading a FASTQ File

The most common way to read FASTQ files in Biopython is with `Bio.SeqIO.parse()`.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

for record in SeqIO.parse("example.fastq", "fastq"):
    print(record.id)

EAS54_6_R1_2_1_413_324
EAS54_6_R1_2_1_540_792
EAS54_6_R1_2_1_443_348

`SeqIO.parse()` reads the file one record at a time. Each `record` is a `SeqRecord` object containing the read ID, the sequence, and the per-base quality scores. Using an iterator like this is memory-efficient, which is important when FASTQ files are large.

## Accessing the Sequence and Quality Scores

A FASTQ record contains more than just the sequence. You can also access the quality values stored for each base.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

for record in SeqIO.parse("example.fastq", "fastq"):
    print("ID:", record.id)
    print("Sequence:", record.seq)
    print("Length:", len(record.seq))
    print("Quality scores:", record.letter_annotations["phred_quality"])
    break

ID: EAS54_6_R1_2_1_413_324
Sequence: CCCTTCTTGTCTTCAGCGTTTCTCC
Length: 25
Quality scores: [26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26, 26, 26, 26, 23, 23]

The sequence is available through `record.seq`. The quality scores are stored in `record.letter_annotations["phred_quality"]` as a list of integers. Each integer is the PHRED quality score for the base at the same position in the sequence.

## Looking at the Average Quality of a Read

A useful first step in read quality control is computing the average quality score for each read.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

for record in SeqIO.parse("example.fastq", "fastq"):
    qualities = record.letter_annotations["phred_quality"]
    average_quality = sum(qualities) / len(qualities)
    print(record.id, "average quality =", round(average_quality, 2))

EAS54_6_R1_2_1_413_324 average quality = 25.28
EAS54_6_R1_2_1_540_792 average quality = 24.52
EAS54_6_R1_2_1_443_348 average quality = 23.4

This code loops through every read, extracts its PHRED scores, and calculates the mean. Reads with low average quality may need to be removed or trimmed before downstream analysis.

## Counting Reads in a FASTQ File

Another common task is finding out how many reads are in a file.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

count = 0
for record in SeqIO.parse("example.fastq", "fastq"):
    count += 1

print("Number of reads:", count)

Number of reads: 3

This code counts how many `SeqRecord` objects are produced by `SeqIO.parse()`. For small files this is simple and clear, and the same pattern works for larger files too.

## Calculating Basic FASTQ Statistics

You often want summary information such as the number of reads, total bases, and average read length.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

read_count = 0
total_bases = 0

for record in SeqIO.parse("example.fastq", "fastq"):
    read_count += 1
    total_bases += len(record.seq)

average_length = total_bases / read_count if read_count > 0 else 0

print("Reads:", read_count)
print("Total bases:", total_bases)
print("Average read length:", round(average_length, 2))

Reads: 3
Total bases: 75
Average read length: 25.0

This example keeps running totals as it reads the file. That lets you calculate summary statistics without storing all reads in memory at once.

## Filtering Reads by Length

Sometimes you want to keep only reads above a certain length threshold.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

min_length = 25
filtered_reads = []

for record in SeqIO.parse("example.fastq", "fastq"):
    if len(record.seq) >= min_length:
        filtered_reads.append(record)

print("Reads kept:", len(filtered_reads))

Reads kept: 3

This code checks the length of each sequence and stores only the reads that meet the minimum length requirement. In real datasets, this is often part of an initial cleanup step.

## Filtering Reads by Average Quality

You can also filter reads based on their average quality score.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

min_avg_quality = 30
high_quality_reads = []

for record in SeqIO.parse("example.fastq", "fastq"):
    qualities = record.letter_annotations["phred_quality"]
    average_quality = sum(qualities) / len(qualities)
    if average_quality >= min_avg_quality:
        high_quality_reads.append(record)

print("High-quality reads kept:", len(high_quality_reads))

High-quality reads kept: 0

Here, the code calculates the average PHRED score for each read and keeps only reads above a chosen threshold. This is a simple quality-control strategy for identifying more reliable reads.

## Writing Filtered Reads to a New FASTQ File

After filtering, you can save the remaining reads as a new FASTQ file.

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

filtered_reads = []

for record in SeqIO.parse("example.fastq", "fastq"):
    qualities = record.letter_annotations["phred_quality"]
    average_quality = sum(qualities) / len(qualities)
    if average_quality >= 30:
        filtered_reads.append(record)

written = SeqIO.write(filtered_reads, "filtered.fastq", "fastq")
print("Reads written:", written)

Reads written: 0

`SeqIO.write()` takes a list of `SeqRecord` objects and writes them in FASTQ format. This is useful when you want to create a cleaned dataset for later analysis.

## Converting FASTQ Records to FASTA

Sometimes you want the sequences but not the quality scores. In that case, you can convert a FASTQ file to [FASTA](/tutorials/biopython-fasta-files).

import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with open("example.fastq", "wb") as f:
    f.write(response.content)

records = SeqIO.parse("example.fastq", "fastq")
written = SeqIO.write(records, "example.fasta", "fasta")

print("FASTA records written:", written)

FASTA records written: 3

This code reads FASTQ records and writes them out in FASTA format. The sequence IDs and sequences are preserved, but the quality scores are not included because FASTA does not support them.

## Working with Compressed FASTQ Files

Real sequencing data is often stored in `.fastq.gz` files to save space. Python's `gzip` module works well with Biopython for this.

import gzip
import requests
from Bio import SeqIO

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()

with gzip.open("example.fastq.gz", "wb") as f:
    f.write(response.content)

with gzip.open("example.fastq.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fastq"):
        print(record.id, len(record.seq))

EAS54_6_R1_2_1_413_324 25
EAS54_6_R1_2_1_540_792 25
EAS54_6_R1_2_1_443_348 25

This example downloads a FASTQ file, saves it in gzipped form, and then reads it back using `gzip.open()` in text mode. This is a common pattern for handling compressed sequencing data.

## Conclusion

FASTQ files are central to sequencing analysis because they combine sequence data with per-base quality information. With Biopython, you can read them, inspect their contents, calculate useful statistics, filter reads, and write the results back to disk without needing to parse the format manually.

You now know how to:

* read FASTQ files with `SeqIO.parse()`
* access sequences and PHRED quality scores
* calculate average read quality
* count reads and summarize a dataset
* filter reads by length or quality
* write new FASTQ and FASTA files
* handle compressed FASTQ files