Translating DNA to Protein in Biopython

Translation is how [nucleotide sequence](/tutorials/biopython-nucleotide-sequences) becomes biologically interpretable [protein sequence](/tutorials/biopython-amino-acid-sequences). In practice, you often need to inspect multiple reading frames and identify candidate open reading frames (ORFs), especially when annotation is incomplete.

In this tutorial, you will translate DNA in different frames and detect ORFs with Biopython.

## Working with Reading Frames

from Bio.Seq import Seq

# Example DNA sequence
sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Translate frame 1 (starting at position 0)
protein_frame_1 = sequence.translate(to_stop=False)

# Translate frame 2 (starting at position 1)
protein_frame_2 = sequence[1:].translate(to_stop=False)

# Translate frame 3 (starting at position 2)
protein_frame_3 = sequence[2:].translate(to_stop=False)

print("Frame +1:", protein_frame_1)
print("Frame +2:", protein_frame_2)
print("Frame +3:", protein_frame_3)

Frame +1: MAIVMGR*KGAR*
Frame +2: WPL*WAAERVPD
Frame +3: GHCNGPLKGCPI

/Users/yogesh/projects/pyfiddle/.venv/lib/python3.12/site-packages/Bio/Seq.py:2877: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(

This block shows how frame offset changes the codon grouping and protein output. Frame-aware translation is essential because biologically valid proteins usually appear in only a subset of possible frames.

## Finding Open Reading Frames (ORFs) in DNA

from Bio.Seq import Seq

# DNA sequence containing multiple potential starts/stops
dna = Seq("AAATGAAATAGATGCCCTAAATGGGGTTTGA")

# Scan one frame for start codon (ATG) and stop codons
stops = {"TAA", "TAG", "TGA"}
orf_results = []

for i in range(0, len(dna) - 2):
    codon = str(dna[i:i+3])
    if codon == "ATG":
        for j in range(i + 3, len(dna) - 2, 3):
            stop_codon = str(dna[j:j+3])
            if stop_codon in stops:
                orf_seq = dna[i:j+3]
                protein = orf_seq.translate(to_stop=True)
                orf_results.append((i, j + 3, str(orf_seq), str(protein)))
                break

print("ORFs found:", len(orf_results))
for start, end, orf_seq, protein in orf_results:
    print(f"Start={start}, End={end}, ORF={orf_seq}, Protein={protein}")

ORFs found: 2
Start=2, End=11, ORF=ATGAAATAG, Protein=MK
Start=11, End=20, ORF=ATGCCCTAA, Protein=MP

This example searches for start and in-frame stop codons, then translates each ORF candidate. ORF detection helps you move from raw DNA segments to candidate coding regions for annotation.

## Translating All Reading Frames

from Bio.Seq import Seq

# Input DNA sequence
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Positive-strand frames (+1, +2, +3)
forward_frames = [
    dna[i:].translate(to_stop=False)
    for i in range(3)
]

# Reverse-complement strand frames (-1, -2, -3)
reverse = dna.reverse_complement()
reverse_frames = [
    reverse[i:].translate(to_stop=False)
    for i in range(3)
]

for idx, protein in enumerate(forward_frames, start=1):
    print(f"Frame +{idx}: {protein}")

for idx, protein in enumerate(reverse_frames, start=1):
    print(f"Frame -{idx}: {protein}")

Frame +1: MAIVMGR*KGAR*
Frame +2: WPL*WAAERVPD
Frame +3: GHCNGPLKGCPI
Frame -1: LSGTLSAAHYNGH
Frame -2: YRAPFQRPITMA
Frame -3: IGHPFSGPLQWP

Translating all six frames gives a complete view of possible coding interpretations from both strands. This is especially useful in exploratory analysis or when gene orientation is unknown.

## Longest ORF Across All Six Frames

from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
stops = {"*"}

def longest_orf_protein(seq):
    longest = ""
    for frame in range(3):
        protein = str(seq[frame:].translate(to_stop=False))
        current = ""
        for aa in protein:
            if aa in stops:
                if len(current) > len(longest):
                    longest = current
                current = ""
            else:
                current += aa
        if len(current) > len(longest):
            longest = current
    return longest

forward_longest = longest_orf_protein(dna)
reverse_longest = longest_orf_protein(dna.reverse_complement())
best = max(forward_longest, reverse_longest, key=len)

print("Longest ORF protein candidate:", best)
print("Length:", len(best))

Longest ORF protein candidate: LSGTLSAAHYNGH
Length: 13

Taking the longest ORF across six frames is a practical heuristic for candidate coding sequence discovery in unannotated fragments.

## Applying a Minimum ORF Length Filter

from Bio.Seq import Seq

dna = Seq("AAATGAAATAGATGCCCTAAATGGGGTTTGA")
min_aa_length = 4
stops = {"TAA", "TAG", "TGA"}

filtered_orfs = []
for i in range(0, len(dna) - 2):
    if str(dna[i:i+3]) != "ATG":
        continue
    for j in range(i + 3, len(dna) - 2, 3):
        codon = str(dna[j:j+3])
        if codon in stops:
            orf_seq = dna[i:j+3]
            protein = str(orf_seq.translate(to_stop=True))
            if len(protein) >= min_aa_length:
                filtered_orfs.append((i, j + 3, protein))
            break

print("Filtered ORFs (>= min length):", filtered_orfs)

Filtered ORFs (>= min length): []

Length filtering reduces short, likely spurious ORFs and helps you prioritize biologically plausible candidates for annotation.

## Translating with an Alternative Genetic Code

from Bio.Seq import Seq

# Example DNA where table choice can change interpretation
dna = Seq("ATGATAAAGAATAG")

# Standard code (table 1)
protein_standard = dna.translate(table=1, to_stop=False)

# Vertebrate mitochondrial code (table 2)
protein_mito = dna.translate(table=2, to_stop=False)

print("Standard table translation:", protein_standard)
print("Mitochondrial table translation:", protein_mito)

Standard table translation: MIKN
Mitochondrial table translation: MMKN

Alternative code tables are essential in mitochondrial and non-standard organisms where codon meanings differ from the canonical code.

## Frame-Level Translation Report

from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
reverse = dna.reverse_complement()

report = []
for strand_label, seq in [("+", dna), ("-", reverse)]:
    for frame in range(3):
        protein = str(seq[frame:].translate(to_stop=False))
        fragments = [frag for frag in protein.split("*") if frag]
        longest_fragment = max((len(f) for f in fragments), default=0)
        report.append(
            {
                "strand": strand_label,
                "frame": frame + 1,
                "protein_length": len(protein),
                "longest_orf_length": longest_fragment,
                "stop_count": protein.count("*"),
            }
        )

for row in report:
    print(row)

{'strand': '+', 'frame': 1, 'protein_length': 13, 'longest_orf_length': 7, 'stop_count': 2}
{'strand': '+', 'frame': 2, 'protein_length': 12, 'longest_orf_length': 8, 'stop_count': 1}
{'strand': '+', 'frame': 3, 'protein_length': 12, 'longest_orf_length': 12, 'stop_count': 0}
{'strand': '-', 'frame': 1, 'protein_length': 13, 'longest_orf_length': 13, 'stop_count': 0}
{'strand': '-', 'frame': 2, 'protein_length': 12, 'longest_orf_length': 12, 'stop_count': 0}
{'strand': '-', 'frame': 3, 'protein_length': 12, 'longest_orf_length': 12, 'stop_count': 0}

A frame summary report helps you compare coding potential systematically across all frames instead of inspecting each translation manually.