If you work with sequencing data, FASTQ files are everywhere. Unlike FASTA files, FASTQ files store both the nucleotide sequence and a quality score for every base. That makes them essential in real bioinformatics workflows, because quality scores help you decide whether reads are reliable enough for downstream analysis. In this tutorial, you will learn how to use **Biopython** to work with FASTQ files in Python. You will see how to: * download a sample FASTQ file * read sequencing reads with `SeqIO` * inspect sequences and quality scores * compute simple statistics * filter reads by length or quality * write filtered reads back to a new FASTQ file Biopython makes this much easier than parsing FASTQ text by hand, and the same ideas you learn here apply to larger sequencing projects. ## Downloading an Example FASTQ File First, let's download a small example FASTQ file that we can use in the rest of the tutorial.
import requests
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
print("Downloaded example.fastq")Downloaded example.fastq
This code uses `requests` to fetch a FASTQ file from the Biopython GitHub repository and saves it as `example.fastq` in the current folder. The call to `raise_for_status()` makes sure the download succeeded before writing the file. ## Reading a FASTQ File The most common way to read FASTQ files in Biopython is with `Bio.SeqIO.parse()`.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
for record in SeqIO.parse("example.fastq", "fastq"):
print(record.id)EAS54_6_R1_2_1_413_324 EAS54_6_R1_2_1_540_792 EAS54_6_R1_2_1_443_348
`SeqIO.parse()` reads the file one record at a time. Each `record` is a `SeqRecord` object containing the read ID, the sequence, and the per-base quality scores. Using an iterator like this is memory-efficient, which is important when FASTQ files are large. ## Accessing the Sequence and Quality Scores A FASTQ record contains more than just the sequence. You can also access the quality values stored for each base.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
for record in SeqIO.parse("example.fastq", "fastq"):
print("ID:", record.id)
print("Sequence:", record.seq)
print("Length:", len(record.seq))
print("Quality scores:", record.letter_annotations["phred_quality"])
breakID: EAS54_6_R1_2_1_413_324 Sequence: CCCTTCTTGTCTTCAGCGTTTCTCC Length: 25 Quality scores: [26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26, 26, 26, 26, 23, 23]
The sequence is available through `record.seq`. The quality scores are stored in `record.letter_annotations["phred_quality"]` as a list of integers. Each integer is the PHRED quality score for the base at the same position in the sequence. ## Looking at the Average Quality of a Read A useful first step in read quality control is computing the average quality score for each read.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
for record in SeqIO.parse("example.fastq", "fastq"):
qualities = record.letter_annotations["phred_quality"]
average_quality = sum(qualities) / len(qualities)
print(record.id, "average quality =", round(average_quality, 2))EAS54_6_R1_2_1_413_324 average quality = 25.28 EAS54_6_R1_2_1_540_792 average quality = 24.52 EAS54_6_R1_2_1_443_348 average quality = 23.4
This code loops through every read, extracts its PHRED scores, and calculates the mean. Reads with low average quality may need to be removed or trimmed before downstream analysis. ## Counting Reads in a FASTQ File Another common task is finding out how many reads are in a file.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
count = 0
for record in SeqIO.parse("example.fastq", "fastq"):
count += 1
print("Number of reads:", count)Number of reads: 3
This code counts how many `SeqRecord` objects are produced by `SeqIO.parse()`. For small files this is simple and clear, and the same pattern works for larger files too. ## Calculating Basic FASTQ Statistics You often want summary information such as the number of reads, total bases, and average read length.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
read_count = 0
total_bases = 0
for record in SeqIO.parse("example.fastq", "fastq"):
read_count += 1
total_bases += len(record.seq)
average_length = total_bases / read_count if read_count > 0 else 0
print("Reads:", read_count)
print("Total bases:", total_bases)
print("Average read length:", round(average_length, 2))Reads: 3 Total bases: 75 Average read length: 25.0
This example keeps running totals as it reads the file. That lets you calculate summary statistics without storing all reads in memory at once. ## Filtering Reads by Length Sometimes you want to keep only reads above a certain length threshold.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
min_length = 25
filtered_reads = []
for record in SeqIO.parse("example.fastq", "fastq"):
if len(record.seq) >= min_length:
filtered_reads.append(record)
print("Reads kept:", len(filtered_reads))Reads kept: 3
This code checks the length of each sequence and stores only the reads that meet the minimum length requirement. In real datasets, this is often part of an initial cleanup step. ## Filtering Reads by Average Quality You can also filter reads based on their average quality score.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
min_avg_quality = 30
high_quality_reads = []
for record in SeqIO.parse("example.fastq", "fastq"):
qualities = record.letter_annotations["phred_quality"]
average_quality = sum(qualities) / len(qualities)
if average_quality >= min_avg_quality:
high_quality_reads.append(record)
print("High-quality reads kept:", len(high_quality_reads))High-quality reads kept: 0
Here, the code calculates the average PHRED score for each read and keeps only reads above a chosen threshold. This is a simple quality-control strategy for identifying more reliable reads. ## Writing Filtered Reads to a New FASTQ File After filtering, you can save the remaining reads as a new FASTQ file.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
filtered_reads = []
for record in SeqIO.parse("example.fastq", "fastq"):
qualities = record.letter_annotations["phred_quality"]
average_quality = sum(qualities) / len(qualities)
if average_quality >= 30:
filtered_reads.append(record)
written = SeqIO.write(filtered_reads, "filtered.fastq", "fastq")
print("Reads written:", written)Reads written: 0
`SeqIO.write()` takes a list of `SeqRecord` objects and writes them in FASTQ format. This is useful when you want to create a cleaned dataset for later analysis. ## Converting FASTQ Records to FASTA Sometimes you want the sequences but not the quality scores. In that case, you can convert a FASTQ file to FASTA.
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with open("example.fastq", "wb") as f:
f.write(response.content)
records = SeqIO.parse("example.fastq", "fastq")
written = SeqIO.write(records, "example.fasta", "fasta")
print("FASTA records written:", written)FASTA records written: 3
This code reads FASTQ records and writes them out in FASTA format. The sequence IDs and sequences are preserved, but the quality scores are not included because FASTA does not support them. ## Working with Compressed FASTQ Files Real sequencing data is often stored in `.fastq.gz` files to save space. Python's `gzip` module works well with Biopython for this.
import gzip
import requests
from Bio import SeqIO
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Quality/example.fastq"
response = requests.get(url)
response.raise_for_status()
with gzip.open("example.fastq.gz", "wb") as f:
f.write(response.content)
with gzip.open("example.fastq.gz", "rt") as handle:
for record in SeqIO.parse(handle, "fastq"):
print(record.id, len(record.seq))EAS54_6_R1_2_1_413_324 25 EAS54_6_R1_2_1_540_792 25 EAS54_6_R1_2_1_443_348 25
This example downloads a FASTQ file, saves it in gzipped form, and then reads it back using `gzip.open()` in text mode. This is a common pattern for handling compressed sequencing data. ## Conclusion FASTQ files are central to sequencing analysis because they combine sequence data with per-base quality information. With Biopython, you can read them, inspect their contents, calculate useful statistics, filter reads, and write the results back to disk without needing to parse the format manually. You now know how to: * read FASTQ files with `SeqIO.parse()` * access sequences and PHRED quality scores * calculate average read quality * count reads and summarize a dataset * filter reads by length or quality * write new FASTQ and FASTA files * handle compressed FASTQ files