Working with Nucleotide Sequences in Biopython

Nucleotide sequences (DNA or RNA) are central to many bioinformatics workflows. The **Biopython** library provides powerful tools for creating, manipulating, and analyzing biological sequences in Python.

In this tutorial, you’ll learn how to:

- Create nucleotide sequences
- Compute complements and reverse complements
- Transcribe DNA to RNA
- Translate DNA into protein sequences
- Perform simple sequence analysis

Biopython represents sequences using the `Seq` object from the `Bio.Seq` module.

---

## Creating a DNA Sequence

The `Seq` class represents a biological sequence and behaves similarly to a Python string, but with extra biological functionality.

from Bio.Seq import Seq

# Create a DNA sequence
dna_seq = Seq("ATGCGTACGTTAGC")

# Print the sequence
print("DNA sequence:", dna_seq)

# Get sequence length
print("Length:", len(dna_seq))

# Access specific nucleotide
print("First nucleotide:", dna_seq[0])

# Slice the sequence
print("First five nucleotides:", dna_seq[:5])
DNA sequence: ATGCGTACGTTAGC
Length: 14
First nucleotide: A
First five nucleotides: ATGCG
**Explanation**

* **`Seq("ATGCGTACGTTAGC")`** creates a Biopython sequence object containing DNA nucleotides.
* **`len(dna_seq)`** returns the sequence length.
* **Indexing (`dna_seq[0]`)** accesses a single nucleotide.
* **Slicing (`dna_seq[:5]`)** extracts a portion of the sequence, just like with Python strings.

---

## Complement and Reverse Complement

DNA strands are complementary. Biopython provides built-in methods to compute complements.

from Bio.Seq import Seq

# Define DNA sequence
dna_seq = Seq("ATGCGTACGTTAGC")

# Complement
complement = dna_seq.complement()

# Reverse complement
reverse_complement = dna_seq.reverse_complement()

print("Original:", dna_seq)
print("Complement:", complement)
print("Reverse complement:", reverse_complement)
Original: ATGCGTACGTTAGC
Complement: TACGCATGCAATCG
Reverse complement: GCTAACGTACGCAT
**Explanation**

* **`complement()`** replaces each nucleotide with its pair:

  * A ↔ T
  * C ↔ G
* **`reverse_complement()`** first reverses the sequence and then computes the complement.
* Reverse complements are commonly used when analyzing the opposite DNA strand.

---

## Transcribing DNA to RNA

Transcription converts DNA into RNA by replacing thymine (`T`) with uracil (`U`).

from Bio.Seq import Seq

# DNA sequence
dna_seq = Seq("ATGCGTACGTTAGC")

# Transcribe DNA to RNA
rna_seq = dna_seq.transcribe()

print("DNA:", dna_seq)
print("RNA:", rna_seq)
DNA: ATGCGTACGTTAGC
RNA: AUGCGUACGUUAGC
**Explanation**

* **`transcribe()`** converts DNA to RNA.
* Thymine (`T`) becomes uracil (`U`).
* This mimics the biological transcription process that occurs in cells.

---

## Translating DNA into Protein

Biopython can translate nucleotide sequences into amino acid sequences.

from Bio.Seq import Seq

# DNA sequence with a start codon
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Translate DNA into protein
protein = dna_seq.translate()

print("DNA:", dna_seq)
print("Protein:", protein)
DNA: ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
Protein: MAIVMGR*KGAR*
**Explanation**

* **`translate()`** converts DNA into amino acids using the genetic code.
* Translation reads nucleotides in groups of **three (codons)**.
* Each codon corresponds to one amino acid in the resulting protein sequence.

---

## Counting Nucleotides

You can easily analyze nucleotide composition using standard Python methods.

from Bio.Seq import Seq

# DNA sequence
dna_seq = Seq("ATGCGTACGTTAGC")

# Convert to string for counting
seq_str = str(dna_seq)

# Count nucleotides
a_count = seq_str.count("A")
t_count = seq_str.count("T")
g_count = seq_str.count("G")
c_count = seq_str.count("C")

print("A:", a_count)
print("T:", t_count)
print("G:", g_count)
print("C:", c_count)
A: 3
T: 4
G: 4
C: 3
**Explanation**

* **`str(dna_seq)`** converts the Biopython sequence object to a normal Python string.
* The **`count()`** method counts occurrences of each nucleotide.
* This is useful for computing sequence composition or GC content.

---

## Calculating GC Content

GC content measures the proportion of guanine (`G`) and cytosine (`C`) bases in a sequence.

from Bio.Seq import Seq

# DNA sequence
dna_seq = Seq("ATGCGTACGTTAGC")

# Convert to string
seq_str = str(dna_seq)

# Calculate GC content
gc_count = seq_str.count("G") + seq_str.count("C")
gc_content = gc_count / len(seq_str) * 100

print("GC content:", gc_content)
GC content: 50.0
**Explanation**

* **`seq_str.count("G") + seq_str.count("C")`** counts GC bases.
* The value is divided by the total sequence length.
* Multiplying by **100** converts it into a percentage.

---

## Reading Sequences from a FASTA File

Biopython’s `SeqIO` module allows you to read sequences from common bioinformatics file formats.

from Bio import SeqIO
import requests

# Let's first download an example FASTA file to work with
url = "https://raw.githubusercontent.com/omgenomics/bio-data-zoo/refs/heads/main/data/fasta/good/basic_dna.fa"
response = requests.get(url)
with open("example.fasta", "w") as f:
    f.write(response.text)

# Parse sequences from a FASTA file
for record in SeqIO.parse("example.fasta", "fasta"):
    print("ID:", record.id)
    print("Sequence:", record.seq)
    print("Length:", len(record.seq))
ID: sequence1
Sequence: AATTCTCATTACTGTATCACAGCAAGTTGTATTTACAACAAAAATCCAAA
Length: 50
ID: sequence2
Sequence: GCCTACCAGAAAACGTTGTATTTTGGCAAAGTTCAAAAAGTCAGTCCAGA
Length: 50
ID: sequence3
Sequence: GTATAATTCACAGAGTTTCATGTGGTTGTTGTTGACTCTACATATTGTCT
Length: 50
**Explanation**

* **`SeqIO.parse()`** reads sequences from a file.
* `"example.fasta"` is the input file.
* `"fasta"` specifies the file format.
* Each **`record`** contains metadata (`id`) and the biological sequence (`seq`).

---

# Conclusion

Biopython provides a powerful toolkit for working with nucleotide sequences in Python. With just a few lines of code, you can:

* Create DNA and RNA sequences
* Compute complements and reverse complements
* Transcribe and translate sequences
* Perform basic sequence analysis
* Read biological data from FASTA files

These capabilities form the foundation of many **bioinformatics workflows**, from genome analysis to protein prediction.