**GenBank files** are one of the most information-rich formats used in bioinformatics. Unlike FASTA files, which typically store only sequence data, GenBank files include extensive biological annotations such as: * gene locations * coding sequences (CDS) * regulatory regions * protein translations * organism metadata * references and publication information Because of this, GenBank files are widely used in genome annotation, plasmid analysis, and biological databases. In this tutorial, you will learn how to use **Biopython** to work with GenBank files. Specifically, you will learn how to: * download a GenBank file * read sequence data * explore metadata and annotations * inspect genomic features * extract genes and coding sequences * convert GenBank files to other formats The key tool we will use is the `Bio.SeqIO` module. --- ## Downloading an Example GenBank File First, let's download a GenBank file to work with.
import requests
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb"
response = requests.get(url)
response.raise_for_status()
with open("plasmid.gbk", "wb") as f:
f.write(response.content)
print("Downloaded plasmid.gbk")Downloaded plasmid.gbk
This code downloads an example GenBank file from the Biopython repository and saves it locally. The file contains annotated DNA sequences that we will analyze throughout this tutorial. --- ## Reading a GenBank File GenBank files are read using `SeqIO.parse()` with the `"genbank"` format.
from Bio import SeqIO
for record in SeqIO.parse("plasmid.gbk", "genbank"):
print(record.id)
print("Sequence length:", len(record.seq))NC_005816.1 Sequence length: 9609
`SeqIO.parse()` reads each sequence record in the GenBank file. Each record is stored as a **SeqRecord** object containing the sequence and associated annotations. Many GenBank files contain multiple sequence records, which is why we use `parse()` rather than `read()`. --- ## Accessing Sequence Information Each GenBank record contains several useful attributes.
from Bio import SeqIO
for record in SeqIO.parse("plasmid.gbk", "genbank"):
print("ID:", record.id)
print("Name:", record.name)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print()ID: NC_005816.1 Name: NC_005816 Description: Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence Sequence length: 9609
Important fields include: - **`record.id`** — accession identifier - **`record.name`** — short sequence name - **`record.description`** — full description from the file - **`record.seq`** — the DNA sequence The sequence itself is stored as a **Seq object**. --- ## Exploring GenBank Annotations GenBank files store rich metadata in the `annotations` dictionary.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for key in record.annotations:
print(key)molecule_type topology data_file_division date accessions sequence_version gi keywords source organism taxonomy references comment
Common annotation fields include: - organism - taxonomy - references - source - date - sequence version These provide biological context about the sequence. --- ## Accessing the Organism Name You can easily extract organism information from the annotations.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
print("Organism:", record.annotations["organism"])
print("Taxonomy:", record.annotations["taxonomy"])Organism: Yersinia pestis biovar Microtus str. 91001 Taxonomy: ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia']
This information comes directly from the GenBank metadata and is useful when analyzing genomic datasets. --- ## Working with Sequence Features One of the most powerful aspects of GenBank files is their **feature annotations**. Features describe biological regions such as: - genes - coding sequences (CDS) - promoters - exons - regulatory elements You can access these through `record.features`.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
print("Number of features:", len(record.features))
for feature in record.features[:5]:
print(feature.type)Number of features: 41 source repeat_region gene CDS misc_feature
Each feature is a **SeqFeature object** containing the type of feature and its location on the sequence. --- ## Extracting Gene Features We can filter features to find specific types, such as genes.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for feature in record.features:
if feature.type == "gene":
print(feature.location)
print(feature.qualifiers.get("gene"))[86:1109](+) None [1105:1888](+) None [2924:3119](+) ['rop'] [3485:3857](+) None [4342:4780](+) ['pim'] [4814:5888](-) ['pst'] [6004:6421](+) None [6663:7602](+) ['pla'] [7788:8088](-) None [8087:8360](-) None
Feature objects contain: - **`feature.type`** — feature category (gene, CDS, etc.) - **`feature.location`** — coordinates on the sequence - **`feature.qualifiers`** — additional metadata such as gene name or product --- ## Extracting Coding Sequences (CDS) Coding sequences represent protein-coding regions of DNA.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for feature in record.features:
if feature.type == "CDS":
gene = feature.qualifiers.get("gene", ["unknown"])[0]
protein = feature.qualifiers.get("product", ["unknown protein"])[0]
print("Gene:", gene)
print("Protein:", protein)
print()Gene: unknown Protein: putative transposase Gene: unknown Protein: transposase/IS protein Gene: rop Protein: putative replication regulatory protein Gene: unknown Protein: hypothetical protein Gene: pim Protein: pesticin immunity protein Gene: pst Protein: pesticin Gene: unknown Protein: hypothetical protein Gene: pla Protein: outer membrane protease Gene: unknown Protein: putative transcriptional regulator Gene: unknown Protein: hypothetical protein
CDS features often contain important qualifiers such as: - gene name - protein product - translation (amino acid sequence) These are useful for functional genomics analysis. --- ## Extracting the DNA Sequence of a Feature You can also extract the exact DNA sequence corresponding to a feature.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for feature in record.features:
if feature.type == "CDS":
sequence = feature.extract(record.seq)
print("CDS: ", sequence)
breakCDS: ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATGAGTAGCCGGGCGATTGCCAGAGAACTGGGGATCTCCCGCAATACCGTTAAACGTTATTTGCAGGCAAAATCTGAGCCGCCAAAATATACGCCGCGACCTGCTGTTGCTTCACTCCTGGATGAATACCGGGATTATATTCGTCAACGCATCGCCGATGCTCATCCTTACAAAATCCCGGCAACGGTAATCGCTCGCGAGATCAGAGACCAGGGATATCGTGGCGGAATGACCATTCTCAGGGCATTCATTCGTTCTCTCTCGGTTCCTCAGGAGCAGGAGCCTGCCGTTCGGTTCGAAACTGAACCCGGACGACAGATGCAGGTTGACTGGGGCACTATGCGTAATGGTCGCTCACCGCTTCACGTGTTCGTTGCTGTTCTCGGATACAGCCGAATGCTGTACATCGAATTCACTGACAATATGCGTTATGACACGCTGGAGACCTGCCATCGTAATGCGTTCCGCTTCTTTGGTGGTGTGCCGCGCGAAGTGTTGTATGACAATATGAAAACTGTGGTTCTGCAACGTGACGCATATCAGACCGGTCAGCACCGGTTCCATCCTTCGCTGTGGCAGTTCGGCAAGGAGATGGGCTTCTCTCCCCGACTGTGTCGCCCCTTCAGGGCACAGACTAAAGGTAAGGTGGAACGGATGGTGCAGTACACCCGTAACAGTTTTTACATCCCACTAATGACTCGCCTGCGCCCGATGGGGATCACTGTCGATGTTGAAACAGCCAACCGCCACGGTCTGCGCTGGCTGCACGATGTCGCTAACCAACGAAAGCATGAAACAATCCAGGCCCGTCCCTGCGATCGCTGGCTCGAAGAGCAGCAGTCCATGCTGGCACTGCCTCCGGAGAAAAAAGAGTATGACGTGCATCTTGATGAAAATCTGGTGAACTTCGACAAACACCCCCTGCATCATCCACTCTCCATCTACGACTCATTCTGCAGAGGAGTGGCGTGA
The `extract()` method retrieves the subsequence defined by the feature location. --- ## Converting GenBank to FASTA Sometimes you want only the raw DNA sequence.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
SeqIO.write([record], "orchid.fasta", "fasta")1
This converts all sequences in the GenBank file into FASTA format. --- ## Writing a GenBank File You can also write modified sequence records back to a GenBank file.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
records = [record]
SeqIO.write(records, "copy.gbk", "genbank")1
This example reads a GenBank file and writes the records into a new file. --- ## Conclusion GenBank files store much more than sequence data—they include rich biological annotations that describe genes, coding regions, regulatory elements, and organism metadata. Using **Biopython**, you can easily access and analyze this information in Python. In this tutorial, you learned how to: - read GenBank files with `SeqIO` - explore sequence metadata and annotations - access genomic features - extract genes and coding sequences - retrieve subsequences from features - convert GenBank files to other formats These capabilities are essential when working with **genome annotations, plasmid maps, and public biological databases**.