Working with GenBank Files in Biopython

**GenBank files** are one of the most information-rich formats used in bioinformatics. Unlike FASTA files, which typically store only sequence data, GenBank files include extensive biological annotations such as:

* gene locations
* coding sequences (CDS)
* regulatory regions
* protein translations
* organism metadata
* references and publication information

Because of this, GenBank files are widely used in genome annotation, plasmid analysis, and biological databases.

In this tutorial, you will learn how to use **Biopython** to work with GenBank files. Specifically, you will learn how to:

* download a GenBank file
* read sequence data
* explore metadata and annotations
* inspect genomic features
* extract genes and coding sequences
* convert GenBank files to other formats

The key tool we will use is the `Bio.SeqIO` module.

---

## Downloading an Example GenBank File

First, let's download a GenBank file to work with.

import requests

url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb"
response = requests.get(url)
response.raise_for_status()

with open("plasmid.gbk", "wb") as f:
    f.write(response.content)

print("Downloaded plasmid.gbk")
Downloaded plasmid.gbk
This code downloads an example GenBank file from the Biopython repository and saves it locally. The file contains annotated DNA sequences that we will analyze throughout this tutorial.

---

## Reading a GenBank File

GenBank files are read using `SeqIO.parse()` with the `"genbank"` format.

from Bio import SeqIO

for record in SeqIO.parse("plasmid.gbk", "genbank"):
    print(record.id)
    print("Sequence length:", len(record.seq))
NC_005816.1
Sequence length: 9609
`SeqIO.parse()` reads each sequence record in the GenBank file. Each record is stored as a **SeqRecord** object containing the sequence and associated annotations.

Many GenBank files contain multiple sequence records, which is why we use `parse()` rather than `read()`.

---

## Accessing Sequence Information

Each GenBank record contains several useful attributes.

from Bio import SeqIO

for record in SeqIO.parse("plasmid.gbk", "genbank"):
    print("ID:", record.id)
    print("Name:", record.name)
    print("Description:", record.description)
    print("Sequence length:", len(record.seq))
    print()
ID: NC_005816.1
Name: NC_005816
Description: Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Sequence length: 9609

Important fields include:

- **`record.id`** — accession identifier
- **`record.name`** — short sequence name
- **`record.description`** — full description from the file
- **`record.seq`** — the DNA sequence

The sequence itself is stored as a **Seq object**.

---

## Exploring GenBank Annotations

GenBank files store rich metadata in the `annotations` dictionary.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

for key in record.annotations:
    print(key)
molecule_type
topology
data_file_division
date
accessions
sequence_version
gi
keywords
source
organism
taxonomy
references
comment
Common annotation fields include:

- organism
- taxonomy
- references
- source
- date
- sequence version

These provide biological context about the sequence.

---

## Accessing the Organism Name

You can easily extract organism information from the annotations.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

print("Organism:", record.annotations["organism"])
print("Taxonomy:", record.annotations["taxonomy"])
Organism: Yersinia pestis biovar Microtus str. 91001
Taxonomy: ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia']
This information comes directly from the GenBank metadata and is useful when analyzing genomic datasets.

---

## Working with Sequence Features

One of the most powerful aspects of GenBank files is their **feature annotations**.

Features describe biological regions such as:

- genes
- coding sequences (CDS)
- promoters
- exons
- regulatory elements

You can access these through `record.features`.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

print("Number of features:", len(record.features))

for feature in record.features[:5]:
    print(feature.type)
Number of features: 41
source
repeat_region
gene
CDS
misc_feature
Each feature is a **SeqFeature object** containing the type of feature and its location on the sequence.

---

## Extracting Gene Features

We can filter features to find specific types, such as genes.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

for feature in record.features:
    if feature.type == "gene":
        print(feature.location)
        print(feature.qualifiers.get("gene"))
[86:1109](+)
None
[1105:1888](+)
None
[2924:3119](+)
['rop']
[3485:3857](+)
None
[4342:4780](+)
['pim']
[4814:5888](-)
['pst']
[6004:6421](+)
None
[6663:7602](+)
['pla']
[7788:8088](-)
None
[8087:8360](-)
None
Feature objects contain:

- **`feature.type`** — feature category (gene, CDS, etc.)
- **`feature.location`** — coordinates on the sequence
- **`feature.qualifiers`** — additional metadata such as gene name or product

---

## Extracting Coding Sequences (CDS)

Coding sequences represent protein-coding regions of DNA.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

for feature in record.features:
    if feature.type == "CDS":
        gene = feature.qualifiers.get("gene", ["unknown"])[0]
        protein = feature.qualifiers.get("product", ["unknown protein"])[0]
        
        print("Gene:", gene)
        print("Protein:", protein)
        print()
Gene: unknown
Protein: putative transposase

Gene: unknown
Protein: transposase/IS protein

Gene: rop
Protein: putative replication regulatory protein

Gene: unknown
Protein: hypothetical protein

Gene: pim
Protein: pesticin immunity protein

Gene: pst
Protein: pesticin

Gene: unknown
Protein: hypothetical protein

Gene: pla
Protein: outer membrane protease

Gene: unknown
Protein: putative transcriptional regulator

Gene: unknown
Protein: hypothetical protein

CDS features often contain important qualifiers such as:

- gene name
- protein product
- translation (amino acid sequence)

These are useful for functional genomics analysis.

---

## Extracting the DNA Sequence of a Feature

You can also extract the exact DNA sequence corresponding to a feature.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

for feature in record.features:
    if feature.type == "CDS":
        sequence = feature.extract(record.seq)
        print("CDS: ", sequence)
        break
CDS:  ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATGAGTAGCCGGGCGATTGCCAGAGAACTGGGGATCTCCCGCAATACCGTTAAACGTTATTTGCAGGCAAAATCTGAGCCGCCAAAATATACGCCGCGACCTGCTGTTGCTTCACTCCTGGATGAATACCGGGATTATATTCGTCAACGCATCGCCGATGCTCATCCTTACAAAATCCCGGCAACGGTAATCGCTCGCGAGATCAGAGACCAGGGATATCGTGGCGGAATGACCATTCTCAGGGCATTCATTCGTTCTCTCTCGGTTCCTCAGGAGCAGGAGCCTGCCGTTCGGTTCGAAACTGAACCCGGACGACAGATGCAGGTTGACTGGGGCACTATGCGTAATGGTCGCTCACCGCTTCACGTGTTCGTTGCTGTTCTCGGATACAGCCGAATGCTGTACATCGAATTCACTGACAATATGCGTTATGACACGCTGGAGACCTGCCATCGTAATGCGTTCCGCTTCTTTGGTGGTGTGCCGCGCGAAGTGTTGTATGACAATATGAAAACTGTGGTTCTGCAACGTGACGCATATCAGACCGGTCAGCACCGGTTCCATCCTTCGCTGTGGCAGTTCGGCAAGGAGATGGGCTTCTCTCCCCGACTGTGTCGCCCCTTCAGGGCACAGACTAAAGGTAAGGTGGAACGGATGGTGCAGTACACCCGTAACAGTTTTTACATCCCACTAATGACTCGCCTGCGCCCGATGGGGATCACTGTCGATGTTGAAACAGCCAACCGCCACGGTCTGCGCTGGCTGCACGATGTCGCTAACCAACGAAAGCATGAAACAATCCAGGCCCGTCCCTGCGATCGCTGGCTCGAAGAGCAGCAGTCCATGCTGGCACTGCCTCCGGAGAAAAAAGAGTATGACGTGCATCTTGATGAAAATCTGGTGAACTTCGACAAACACCCCCTGCATCATCCACTCTCCATCTACGACTCATTCTGCAGAGGAGTGGCGTGA
The `extract()` method retrieves the subsequence defined by the feature location.

---

## Converting GenBank to FASTA

Sometimes you want only the raw DNA sequence.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

SeqIO.write([record], "orchid.fasta", "fasta")
1
This converts all sequences in the GenBank file into FASTA format.

---

## Writing a GenBank File

You can also write modified sequence records back to a GenBank file.

from Bio import SeqIO

record = next(SeqIO.parse("plasmid.gbk", "genbank"))

records = [record]

SeqIO.write(records, "copy.gbk", "genbank")
1
This example reads a GenBank file and writes the records into a new file.

---

## Conclusion

GenBank files store much more than sequence data—they include rich biological annotations that describe genes, coding regions, regulatory elements, and organism metadata.

Using **Biopython**, you can easily access and analyze this information in Python.

In this tutorial, you learned how to:

- read GenBank files with `SeqIO`
- explore sequence metadata and annotations
- access genomic features
- extract genes and coding sequences
- retrieve subsequences from features
- convert GenBank files to other formats

These capabilities are essential when working with **genome annotations, plasmid maps, and public biological databases**.