Reverse Complement and Transcription in Biopython

Reverse complements and transcription are two of the most common sequence transformations in bioinformatics pipelines. They are essential when working with opposite strands, preparing coding regions, or moving from DNA to RNA-level analysis.

In this tutorial, you will use Biopython sequence objects for these transformations.

## Reverse Complement of DNA

from Bio.Seq import Seq

# Define a DNA sequence
seq = Seq("ATGCCGTTAACCGT")

# Compute complement and reverse complement
complement = seq.complement()
reverse_complement = seq.reverse_complement()

print("Original:", seq)
print("Complement:", complement)
print("Reverse complement:", reverse_complement)
Original: ATGCCGTTAACCGT
Complement: TACGGCAATTGGCA
Reverse complement: ACGGTTAACGGCAT
This block demonstrates strand transformations directly on a `Seq` object. Reverse complement is particularly important when genes or motifs are located on the opposite DNA strand.

## Transcription (DNA -> RNA)

from Bio.Seq import Seq

# DNA coding-strand example
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Transcribe thymine (T) to uracil (U)
rna = dna.transcribe()

print("DNA:", dna)
print("RNA:", rna)
DNA: ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
RNA: AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
Transcription converts DNA sequence representation into RNA by replacing `T` with `U`. This is a common preprocessing step before translation or RNA-focused analyses.

## Back-Transcription (RNA -> DNA)

from Bio.Seq import Seq

# Start from RNA
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")

# Convert RNA back to DNA representation
dna_back = rna.back_transcribe()

print("RNA:", rna)
print("Back-transcribed DNA:", dna_back)
RNA: AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
Back-transcribed DNA: ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
Back-transcription is useful when integrating RNA-derived sequences with DNA-based tools and file formats. It keeps workflows consistent when different software expects different nucleotide alphabets.

## Batch Reverse-Complement Processing for FASTA Files

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Create a small example FASTA file
records = [
    SeqRecord(seq=Seq("ATGCGTAC"), id="seq1", description=""),
    SeqRecord(seq=Seq("TTAACCGG"), id="seq2", description=""),
]

SeqIO.write(records, "input_dna.fasta", "fasta")

# Read, reverse-complement, and write output FASTA
revcomp_records = []
for record in SeqIO.parse("input_dna.fasta", "fasta"):
    rc = record.seq.reverse_complement()
    revcomp_records.append(SeqRecord(rc, id=record.id + "_rc", description="reverse_complement"))

SeqIO.write(revcomp_records, "output_revcomp.fasta", "fasta")
print("Wrote output_revcomp.fasta")
Wrote output_revcomp.fasta
Batch processing is the practical pattern for real datasets with many sequences. It helps standardize strand orientation before downstream alignment or mapping.

## Primer-Oriented Reverse Complement Example

from Bio.Seq import Seq

# Example primer pair for a target region
forward_primer = Seq("AGTCTGACCTGAACTG")
reverse_primer_template = Seq("TCAGGTTGCTAACGTA")

# Reverse primer used in PCR is reverse-complement of template-side sequence
reverse_primer = reverse_primer_template.reverse_complement()

print("Forward primer (5'->3'):", forward_primer)
print("Reverse primer template-side:", reverse_primer_template)
print("Reverse primer (5'->3'):", reverse_primer)
Forward primer (5'->3'): AGTCTGACCTGAACTG
Reverse primer template-side: TCAGGTTGCTAACGTA
Reverse primer (5'->3'): TACGTTAGCAACCTGA
This is a common lab-facing use case: converting a template-side reverse-primer region into the sequence that should be synthesized.

## Strand-Aware Gene Extraction Before Transcription

from Bio.Seq import Seq

# Toy genomic region and gene coordinates
genome = Seq("TTTATGAAACCCGGGTTTAAACCCATGCGTAAAGGG")
start, end = 5, 25
strand = -1  # gene annotated on reverse strand

gene_seq = genome[start:end]
if strand == -1:
    gene_seq = gene_seq.reverse_complement()

rna = gene_seq.transcribe()

print("Extracted coding DNA:", gene_seq)
print("Transcribed RNA:", rna)
Extracted coding DNA: TGGGTTTAAACCCGGGTTTC
Transcribed RNA: UGGGUUUAAACCCGGGUUUC
In annotation pipelines, strand-aware extraction is critical; otherwise, you may transcribe the wrong orientation and get biologically incorrect proteins later.

## Handling Ambiguous Bases During Transformations

from Bio.Seq import Seq

# Sequence with ambiguity codes
ambiguous_dna = Seq("ATGNCGTRYAAT")

print("Original DNA:", ambiguous_dna)
print("Reverse complement:", ambiguous_dna.reverse_complement())
print("RNA transcript:", ambiguous_dna.transcribe())
Original DNA: ATGNCGTRYAAT
Reverse complement: ATTRYACGNCAT
RNA transcript: AUGNCGURYAAU
Ambiguous nucleotide codes appear often in consensus and low-confidence regions. Verifying transformation behavior on these inputs helps avoid subtle bugs in production workflows.