FASTA files are one of the most common formats used in bioinformatics. They store DNA, RNA, or protein sequences in a simple text format and are widely used in genomic databases, sequence analysis pipelines, and research workflows. If you're studying biology, bioinformatics, or computational biology, you'll almost certainly encounter FASTA files. Fortunately, the **Biopython** library provides convenient tools for reading, parsing, and writing FASTA data in Python. In this tutorial, you'll learn how to: * Read sequences from a FASTA file * Access sequence IDs and descriptions * Iterate through multiple sequences * Calculate sequence statistics * Write new FASTA files We'll use the `Bio.SeqIO` module, which is designed for reading and writing biological sequence file formats. --- ## Downloading an Example FASTA File Before working with FASTA files, let's download a small example file from the web.
import requests
# Download an example FASTA file
url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta"
response = requests.get(url)
with open("orchids.fasta", "w") as f:
f.write(response.text)
print("FASTA file downloaded.")FASTA file downloaded.
This code downloads a publicly available FASTA file from the Biopython repository and saves it locally as `orchids.fasta`. This file contains multiple DNA sequences that we will use in the examples throughout this tutorial. --- ## Reading a FASTA File The `SeqIO.parse()` function is the most common way to read FASTA files in Biopython.
from Bio import SeqIO
# Parse the FASTA file
for record in SeqIO.parse("orchids.fasta", "fasta"):
print(record.id)gi|2765658|emb|Z78533.1|CIZ78533 gi|2765657|emb|Z78532.1|CCZ78532 gi|2765656|emb|Z78531.1|CFZ78531 gi|2765655|emb|Z78530.1|CMZ78530 gi|2765654|emb|Z78529.1|CLZ78529 gi|2765652|emb|Z78527.1|CYZ78527 gi|2765651|emb|Z78526.1|CGZ78526 gi|2765650|emb|Z78525.1|CAZ78525 gi|2765649|emb|Z78524.1|CFZ78524 gi|2765648|emb|Z78523.1|CHZ78523 gi|2765647|emb|Z78522.1|CMZ78522 gi|2765646|emb|Z78521.1|CCZ78521 gi|2765645|emb|Z78520.1|CSZ78520 gi|2765644|emb|Z78519.1|CPZ78519 gi|2765643|emb|Z78518.1|CRZ78518 gi|2765642|emb|Z78517.1|CFZ78517 gi|2765641|emb|Z78516.1|CPZ78516 gi|2765640|emb|Z78515.1|MXZ78515 gi|2765639|emb|Z78514.1|PSZ78514 gi|2765638|emb|Z78513.1|PBZ78513 gi|2765637|emb|Z78512.1|PWZ78512 gi|2765636|emb|Z78511.1|PEZ78511 gi|2765635|emb|Z78510.1|PCZ78510 gi|2765634|emb|Z78509.1|PPZ78509 gi|2765633|emb|Z78508.1|PLZ78508 gi|2765632|emb|Z78507.1|PLZ78507 gi|2765631|emb|Z78506.1|PLZ78506 gi|2765630|emb|Z78505.1|PSZ78505 gi|2765629|emb|Z78504.1|PKZ78504 gi|2765628|emb|Z78503.1|PCZ78503 gi|2765627|emb|Z78502.1|PBZ78502 gi|2765626|emb|Z78501.1|PCZ78501 gi|2765625|emb|Z78500.1|PWZ78500 gi|2765624|emb|Z78499.1|PMZ78499 gi|2765623|emb|Z78498.1|PMZ78498 gi|2765622|emb|Z78497.1|PDZ78497 gi|2765621|emb|Z78496.1|PAZ78496 gi|2765620|emb|Z78495.1|PEZ78495 gi|2765619|emb|Z78494.1|PNZ78494 gi|2765618|emb|Z78493.1|PGZ78493 gi|2765617|emb|Z78492.1|PBZ78492 gi|2765616|emb|Z78491.1|PCZ78491 gi|2765615|emb|Z78490.1|PFZ78490 gi|2765614|emb|Z78489.1|PDZ78489 gi|2765613|emb|Z78488.1|PTZ78488 gi|2765612|emb|Z78487.1|PHZ78487 gi|2765611|emb|Z78486.1|PBZ78486 gi|2765610|emb|Z78485.1|PHZ78485 gi|2765609|emb|Z78484.1|PCZ78484 gi|2765608|emb|Z78483.1|PVZ78483 gi|2765607|emb|Z78482.1|PEZ78482 gi|2765606|emb|Z78481.1|PIZ78481 gi|2765605|emb|Z78480.1|PGZ78480 gi|2765604|emb|Z78479.1|PPZ78479 gi|2765603|emb|Z78478.1|PVZ78478 gi|2765602|emb|Z78477.1|PVZ78477 gi|2765601|emb|Z78476.1|PGZ78476 gi|2765600|emb|Z78475.1|PSZ78475 gi|2765599|emb|Z78474.1|PKZ78474 gi|2765598|emb|Z78473.1|PSZ78473 gi|2765597|emb|Z78472.1|PLZ78472 gi|2765596|emb|Z78471.1|PDZ78471 gi|2765595|emb|Z78470.1|PPZ78470 gi|2765594|emb|Z78469.1|PHZ78469 gi|2765593|emb|Z78468.1|PAZ78468 gi|2765592|emb|Z78467.1|PSZ78467 gi|2765591|emb|Z78466.1|PPZ78466 gi|2765590|emb|Z78465.1|PRZ78465 gi|2765589|emb|Z78464.1|PGZ78464 gi|2765588|emb|Z78463.1|PGZ78463 gi|2765587|emb|Z78462.1|PSZ78462 gi|2765586|emb|Z78461.1|PWZ78461 gi|2765585|emb|Z78460.1|PCZ78460 gi|2765584|emb|Z78459.1|PDZ78459 gi|2765583|emb|Z78458.1|PHZ78458 gi|2765582|emb|Z78457.1|PCZ78457 gi|2765581|emb|Z78456.1|PTZ78456 gi|2765580|emb|Z78455.1|PJZ78455 gi|2765579|emb|Z78454.1|PFZ78454 gi|2765578|emb|Z78453.1|PSZ78453 gi|2765577|emb|Z78452.1|PBZ78452 gi|2765576|emb|Z78451.1|PHZ78451 gi|2765575|emb|Z78450.1|PPZ78450 gi|2765574|emb|Z78449.1|PMZ78449 gi|2765573|emb|Z78448.1|PAZ78448 gi|2765572|emb|Z78447.1|PVZ78447 gi|2765571|emb|Z78446.1|PAZ78446 gi|2765570|emb|Z78445.1|PUZ78445 gi|2765569|emb|Z78444.1|PAZ78444 gi|2765568|emb|Z78443.1|PLZ78443 gi|2765567|emb|Z78442.1|PBZ78442 gi|2765566|emb|Z78441.1|PSZ78441 gi|2765565|emb|Z78440.1|PPZ78440 gi|2765564|emb|Z78439.1|PBZ78439
* **`SeqIO.parse()`** reads sequences from a file. * The first argument is the filename. * The second argument (`"fasta"`) tells Biopython the file format. * Each sequence is returned as a **SeqRecord** object called `record`. A `SeqRecord` contains useful information such as the sequence ID, description, and the sequence itself. --- ## Accessing Sequence Information Each FASTA entry contains several pieces of information. Let's explore them.
from Bio import SeqIO
for record in SeqIO.parse("orchids.fasta", "fasta"):
print("ID:", record.id)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print("First 20 bases:", record.seq[:20])
print()ID: gi|2765658|emb|Z78533.1|CIZ78533 Description: gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765657|emb|Z78532.1|CCZ78532 Description: gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 753 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765656|emb|Z78531.1|CFZ78531 Description: gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 748 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765655|emb|Z78530.1|CMZ78530 Description: gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 744 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765654|emb|Z78529.1|CLZ78529 Description: gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 733 First 20 bases: ACGGCGAGCTGCCGAAGGAC ID: gi|2765652|emb|Z78527.1|CYZ78527 Description: gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 718 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765651|emb|Z78526.1|CGZ78526 Description: gi|2765651|emb|Z78526.1|CGZ78526 C.guttatum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 730 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765650|emb|Z78525.1|CAZ78525 Description: gi|2765650|emb|Z78525.1|CAZ78525 C.acaule 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 704 First 20 bases: TGTTGAGATAGCAGAATATA ID: gi|2765649|emb|Z78524.1|CFZ78524 Description: gi|2765649|emb|Z78524.1|CFZ78524 C.formosanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765648|emb|Z78523.1|CHZ78523 Description: gi|2765648|emb|Z78523.1|CHZ78523 C.himalaicum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 709 First 20 bases: CGTAACCAGGTTTCCGTAGG ID: gi|2765647|emb|Z78522.1|CMZ78522 Description: gi|2765647|emb|Z78522.1|CMZ78522 C.macranthum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 700 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765646|emb|Z78521.1|CCZ78521 Description: gi|2765646|emb|Z78521.1|CCZ78521 C.calceolus 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 726 First 20 bases: GTAGGTGAACCTGCGGAAGG ID: gi|2765645|emb|Z78520.1|CSZ78520 Description: gi|2765645|emb|Z78520.1|CSZ78520 C.segawai 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 753 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765644|emb|Z78519.1|CPZ78519 Description: gi|2765644|emb|Z78519.1|CPZ78519 C.pubescens 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 699 First 20 bases: ATATGATCGAGTGAATCTGG ID: gi|2765643|emb|Z78518.1|CRZ78518 Description: gi|2765643|emb|Z78518.1|CRZ78518 C.reginae 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 658 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765642|emb|Z78517.1|CFZ78517 Description: gi|2765642|emb|Z78517.1|CFZ78517 C.flavum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 752 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765641|emb|Z78516.1|CPZ78516 Description: gi|2765641|emb|Z78516.1|CPZ78516 C.passerinum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 726 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765640|emb|Z78515.1|MXZ78515 Description: gi|2765640|emb|Z78515.1|MXZ78515 M.xerophyticum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 765 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765639|emb|Z78514.1|PSZ78514 Description: gi|2765639|emb|Z78514.1|PSZ78514 P.schlimii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 755 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765638|emb|Z78513.1|PBZ78513 Description: gi|2765638|emb|Z78513.1|PBZ78513 P.besseae 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 742 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765637|emb|Z78512.1|PWZ78512 Description: gi|2765637|emb|Z78512.1|PWZ78512 P.wallisii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 762 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765636|emb|Z78511.1|PEZ78511 Description: gi|2765636|emb|Z78511.1|PEZ78511 P.exstaminodium 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 745 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765635|emb|Z78510.1|PCZ78510 Description: gi|2765635|emb|Z78510.1|PCZ78510 P.caricinum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 750 First 20 bases: CTAACCAGGGTTCCGAGGTG ID: gi|2765634|emb|Z78509.1|PPZ78509 Description: gi|2765634|emb|Z78509.1|PPZ78509 P.pearcei 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 731 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765633|emb|Z78508.1|PLZ78508 Description: gi|2765633|emb|Z78508.1|PLZ78508 P.longifolium 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 741 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765632|emb|Z78507.1|PLZ78507 Description: gi|2765632|emb|Z78507.1|PLZ78507 P.lindenii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765631|emb|Z78506.1|PLZ78506 Description: gi|2765631|emb|Z78506.1|PLZ78506 P.lindleyanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 727 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765630|emb|Z78505.1|PSZ78505 Description: gi|2765630|emb|Z78505.1|PSZ78505 P.sargentianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 711 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765629|emb|Z78504.1|PKZ78504 Description: gi|2765629|emb|Z78504.1|PKZ78504 P.kaiteurum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 743 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765628|emb|Z78503.1|PCZ78503 Description: gi|2765628|emb|Z78503.1|PCZ78503 P.czerwiakowianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 727 First 20 bases: CGTAACCAGGTTTCCGTAGG ID: gi|2765627|emb|Z78502.1|PBZ78502 Description: gi|2765627|emb|Z78502.1|PBZ78502 P.boissierianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 757 First 20 bases: CGTAACCAGGTTTCCGTAGG ID: gi|2765626|emb|Z78501.1|PCZ78501 Description: gi|2765626|emb|Z78501.1|PCZ78501 P.caudatum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 770 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765625|emb|Z78500.1|PWZ78500 Description: gi|2765625|emb|Z78500.1|PWZ78500 P.warszewiczianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 767 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765624|emb|Z78499.1|PMZ78499 Description: gi|2765624|emb|Z78499.1|PMZ78499 P.micranthum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 759 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765623|emb|Z78498.1|PMZ78498 Description: gi|2765623|emb|Z78498.1|PMZ78498 P.malipoense 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 750 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765622|emb|Z78497.1|PDZ78497 Description: gi|2765622|emb|Z78497.1|PDZ78497 P.delenatii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 788 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765621|emb|Z78496.1|PAZ78496 Description: gi|2765621|emb|Z78496.1|PAZ78496 P.armeniacum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 774 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765620|emb|Z78495.1|PEZ78495 Description: gi|2765620|emb|Z78495.1|PEZ78495 P.emersonii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 789 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765619|emb|Z78494.1|PNZ78494 Description: gi|2765619|emb|Z78494.1|PNZ78494 P.niveum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 688 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765618|emb|Z78493.1|PGZ78493 Description: gi|2765618|emb|Z78493.1|PGZ78493 P.godefroyae 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 719 First 20 bases: CGTAACAAGGATTCCGTAGG ID: gi|2765617|emb|Z78492.1|PBZ78492 Description: gi|2765617|emb|Z78492.1|PBZ78492 P.bellatulum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 743 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765616|emb|Z78491.1|PCZ78491 Description: gi|2765616|emb|Z78491.1|PCZ78491 P.concolor 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 737 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765615|emb|Z78490.1|PFZ78490 Description: gi|2765615|emb|Z78490.1|PFZ78490 P.fairrieanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 728 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765614|emb|Z78489.1|PDZ78489 Description: gi|2765614|emb|Z78489.1|PDZ78489 P.druryi 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765613|emb|Z78488.1|PTZ78488 Description: gi|2765613|emb|Z78488.1|PTZ78488 P.tigrinum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 696 First 20 bases: CTGTAGGTGAACCTGCGGAA ID: gi|2765612|emb|Z78487.1|PHZ78487 Description: gi|2765612|emb|Z78487.1|PHZ78487 P.hirsutissimum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 732 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765611|emb|Z78486.1|PBZ78486 Description: gi|2765611|emb|Z78486.1|PBZ78486 P.barbigerum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 731 First 20 bases: CGTCACGAGGTTTCCGTAGG ID: gi|2765610|emb|Z78485.1|PHZ78485 Description: gi|2765610|emb|Z78485.1|PHZ78485 P.henryanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 735 First 20 bases: CTGAACCTGGTGTCCGAAGG ID: gi|2765609|emb|Z78484.1|PCZ78484 Description: gi|2765609|emb|Z78484.1|PCZ78484 P.charlesworthii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 720 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765608|emb|Z78483.1|PVZ78483 Description: gi|2765608|emb|Z78483.1|PVZ78483 P.villosum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765607|emb|Z78482.1|PEZ78482 Description: gi|2765607|emb|Z78482.1|PEZ78482 P.exul 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 629 First 20 bases: TCTACTGCAGTGACCGAGAT ID: gi|2765606|emb|Z78481.1|PIZ78481 Description: gi|2765606|emb|Z78481.1|PIZ78481 P.insigne 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 572 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765605|emb|Z78480.1|PGZ78480 Description: gi|2765605|emb|Z78480.1|PGZ78480 P.gratrixianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 587 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765604|emb|Z78479.1|PPZ78479 Description: gi|2765604|emb|Z78479.1|PPZ78479 P.primulinum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 700 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765603|emb|Z78478.1|PVZ78478 Description: gi|2765603|emb|Z78478.1|PVZ78478 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 636 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765602|emb|Z78477.1|PVZ78477 Description: gi|2765602|emb|Z78477.1|PVZ78477 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 716 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765601|emb|Z78476.1|PGZ78476 Description: gi|2765601|emb|Z78476.1|PGZ78476 P.glaucophyllum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 592 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765600|emb|Z78475.1|PSZ78475 Description: gi|2765600|emb|Z78475.1|PSZ78475 P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 716 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765599|emb|Z78474.1|PKZ78474 Description: gi|2765599|emb|Z78474.1|PKZ78474 P.kolopakingii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 733 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765598|emb|Z78473.1|PSZ78473 Description: gi|2765598|emb|Z78473.1|PSZ78473 P.sanderianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 626 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765597|emb|Z78472.1|PLZ78472 Description: gi|2765597|emb|Z78472.1|PLZ78472 P.lowii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 737 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765596|emb|Z78471.1|PDZ78471 Description: gi|2765596|emb|Z78471.1|PDZ78471 P.dianthum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765595|emb|Z78470.1|PPZ78470 Description: gi|2765595|emb|Z78470.1|PPZ78470 P.parishii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 574 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765594|emb|Z78469.1|PHZ78469 Description: gi|2765594|emb|Z78469.1|PHZ78469 P.haynaldianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 594 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765593|emb|Z78468.1|PAZ78468 Description: gi|2765593|emb|Z78468.1|PAZ78468 P.adductum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 610 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765592|emb|Z78467.1|PSZ78467 Description: gi|2765592|emb|Z78467.1|PSZ78467 P.stonei 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 730 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765591|emb|Z78466.1|PPZ78466 Description: gi|2765591|emb|Z78466.1|PPZ78466 P.philippinense 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 641 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765590|emb|Z78465.1|PRZ78465 Description: gi|2765590|emb|Z78465.1|PRZ78465 P.rothschildianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 702 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765589|emb|Z78464.1|PGZ78464 Description: gi|2765589|emb|Z78464.1|PGZ78464 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 733 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765588|emb|Z78463.1|PGZ78463 Description: gi|2765588|emb|Z78463.1|PGZ78463 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 738 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765587|emb|Z78462.1|PSZ78462 Description: gi|2765587|emb|Z78462.1|PSZ78462 P.sukhakulii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 736 First 20 bases: CGTCACGAGGTCTCCGGATG ID: gi|2765586|emb|Z78461.1|PWZ78461 Description: gi|2765586|emb|Z78461.1|PWZ78461 P.wardii 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 732 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765585|emb|Z78460.1|PCZ78460 Description: gi|2765585|emb|Z78460.1|PCZ78460 P.ciliolare 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 745 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765584|emb|Z78459.1|PDZ78459 Description: gi|2765584|emb|Z78459.1|PDZ78459 P.dayanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 744 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765583|emb|Z78458.1|PHZ78458 Description: gi|2765583|emb|Z78458.1|PHZ78458 P.hennisianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 738 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765582|emb|Z78457.1|PCZ78457 Description: gi|2765582|emb|Z78457.1|PCZ78457 P.callosum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 739 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765581|emb|Z78456.1|PTZ78456 Description: gi|2765581|emb|Z78456.1|PTZ78456 P.tonsum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 740 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765580|emb|Z78455.1|PJZ78455 Description: gi|2765580|emb|Z78455.1|PJZ78455 P.javanicum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 745 First 20 bases: CGTAACCAGGTTTCCGTAGG ID: gi|2765579|emb|Z78454.1|PFZ78454 Description: gi|2765579|emb|Z78454.1|PFZ78454 P.fowliei 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 695 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765578|emb|Z78453.1|PSZ78453 Description: gi|2765578|emb|Z78453.1|PSZ78453 P.schoseri 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 745 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765577|emb|Z78452.1|PBZ78452 Description: gi|2765577|emb|Z78452.1|PBZ78452 P.bougainvilleanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 743 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765576|emb|Z78451.1|PHZ78451 Description: gi|2765576|emb|Z78451.1|PHZ78451 P.hookerae 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 730 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765575|emb|Z78450.1|PPZ78450 Description: gi|2765575|emb|Z78450.1|PPZ78450 P.papuanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 706 First 20 bases: GGAAGGATCATTGCTGATAT ID: gi|2765574|emb|Z78449.1|PMZ78449 Description: gi|2765574|emb|Z78449.1|PMZ78449 P.mastersianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 744 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765573|emb|Z78448.1|PAZ78448 Description: gi|2765573|emb|Z78448.1|PAZ78448 P.argus 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 742 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765572|emb|Z78447.1|PVZ78447 Description: gi|2765572|emb|Z78447.1|PVZ78447 P.venustum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 694 First 20 bases: CGTAACAAGGATTCCGTAGG ID: gi|2765571|emb|Z78446.1|PAZ78446 Description: gi|2765571|emb|Z78446.1|PAZ78446 P.acmodontum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 712 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765570|emb|Z78445.1|PUZ78445 Description: gi|2765570|emb|Z78445.1|PUZ78445 P.urbanianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 715 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765569|emb|Z78444.1|PAZ78444 Description: gi|2765569|emb|Z78444.1|PAZ78444 P.appletonianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 688 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765568|emb|Z78443.1|PLZ78443 Description: gi|2765568|emb|Z78443.1|PLZ78443 P.lawrenceanum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 784 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765567|emb|Z78442.1|PBZ78442 Description: gi|2765567|emb|Z78442.1|PBZ78442 P.bullenianum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 721 First 20 bases: GTAGGTGAACCTGCGGAAGG ID: gi|2765566|emb|Z78441.1|PSZ78441 Description: gi|2765566|emb|Z78441.1|PSZ78441 P.superbiens 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 703 First 20 bases: GGAAGGTCATTGCCGATATC ID: gi|2765565|emb|Z78440.1|PPZ78440 Description: gi|2765565|emb|Z78440.1|PPZ78440 P.purpuratum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 744 First 20 bases: CGTAACAAGGTTTCCGTAGG ID: gi|2765564|emb|Z78439.1|PBZ78439 Description: gi|2765564|emb|Z78439.1|PBZ78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA Sequence length: 592 First 20 bases: CATTGTTGAGATCACATAAT
* **`record.id`**: The sequence identifier. * **`record.description`**: The full FASTA header line. * **`record.seq`**: The biological sequence. * **`len(record.seq)`**: The length of the sequence. The sequence itself is stored as a **Seq object**, which behaves much like a Python string but includes additional biological functionality. --- ## Counting the Number of Sequences Sometimes you just want to know how many sequences are in a FASTA file.
from Bio import SeqIO
count = 0
for record in SeqIO.parse("orchids.fasta", "fasta"):
count += 1
print("Number of sequences:", count)Number of sequences: 94
This code iterates through each record and increments a counter. FASTA files can contain thousands or even millions of sequences, so iterating like this avoids loading everything into memory at once. --- ## Converting FASTA Records to a List If the FASTA file is small, you may prefer to load all sequences into a list.
from Bio import SeqIO
records = list(SeqIO.parse("orchids.fasta", "fasta"))
print("Total sequences:", len(records))
print("First sequence ID:", records[0].id)
print("Sequence length:", len(records[0].seq))Total sequences: 94 First sequence ID: gi|2765658|emb|Z78533.1|CIZ78533 Sequence length: 740
* `list()` converts the iterator returned by `SeqIO.parse()` into a list. * This allows random access, such as `records[0]` or `records[5]`. Be careful with very large FASTA files, as loading everything into memory can consume a lot of RAM. --- ## Calculating GC Content A common task in sequence analysis is calculating the **GC content**, which is the percentage of nucleotides that are G or C.
from Bio import SeqIO
# Calculate GC content for each sequence
for record in SeqIO.parse("orchids.fasta", "fasta"):
seq = record.seq.upper()
g = seq.count("G")
c = seq.count("C")
gc_content = (g + c) / len(seq) * 100
print(record.id, "GC%:", round(gc_content, 2))gi|2765658|emb|Z78533.1|CIZ78533 GC%: 59.59 gi|2765657|emb|Z78532.1|CCZ78532 GC%: 48.47 gi|2765656|emb|Z78531.1|CFZ78531 GC%: 57.09 gi|2765655|emb|Z78530.1|CMZ78530 GC%: 47.58 gi|2765654|emb|Z78529.1|CLZ78529 GC%: 47.89 gi|2765652|emb|Z78527.1|CYZ78527 GC%: 50.7 gi|2765651|emb|Z78526.1|CGZ78526 GC%: 50.41 gi|2765650|emb|Z78525.1|CAZ78525 GC%: 50.43 gi|2765649|emb|Z78524.1|CFZ78524 GC%: 47.7 gi|2765648|emb|Z78523.1|CHZ78523 GC%: 50.35 gi|2765647|emb|Z78522.1|CMZ78522 GC%: 49.86 gi|2765646|emb|Z78521.1|CCZ78521 GC%: 49.04 gi|2765645|emb|Z78520.1|CSZ78520 GC%: 49.54 gi|2765644|emb|Z78519.1|CPZ78519 GC%: 49.07 gi|2765643|emb|Z78518.1|CRZ78518 GC%: 51.52 gi|2765642|emb|Z78517.1|CFZ78517 GC%: 49.73 gi|2765641|emb|Z78516.1|CPZ78516 GC%: 49.17 gi|2765640|emb|Z78515.1|MXZ78515 GC%: 53.73 gi|2765639|emb|Z78514.1|PSZ78514 GC%: 56.03 gi|2765638|emb|Z78513.1|PBZ78513 GC%: 55.93 gi|2765637|emb|Z78512.1|PWZ78512 GC%: 56.17 gi|2765636|emb|Z78511.1|PEZ78511 GC%: 56.24 gi|2765635|emb|Z78510.1|PCZ78510 GC%: 57.07 gi|2765634|emb|Z78509.1|PPZ78509 GC%: 55.54 gi|2765633|emb|Z78508.1|PLZ78508 GC%: 56.82 gi|2765632|emb|Z78507.1|PLZ78507 GC%: 56.35 gi|2765631|emb|Z78506.1|PLZ78506 GC%: 55.98 gi|2765630|emb|Z78505.1|PSZ78505 GC%: 55.7 gi|2765629|emb|Z78504.1|PKZ78504 GC%: 55.18 gi|2765628|emb|Z78503.1|PCZ78503 GC%: 56.26 gi|2765627|emb|Z78502.1|PBZ78502 GC%: 56.41 gi|2765626|emb|Z78501.1|PCZ78501 GC%: 56.49 gi|2765625|emb|Z78500.1|PWZ78500 GC%: 57.24 gi|2765624|emb|Z78499.1|PMZ78499 GC%: 51.52 gi|2765623|emb|Z78498.1|PMZ78498 GC%: 51.33 gi|2765622|emb|Z78497.1|PDZ78497 GC%: 52.54 gi|2765621|emb|Z78496.1|PAZ78496 GC%: 51.81 gi|2765620|emb|Z78495.1|PEZ78495 GC%: 53.11 gi|2765619|emb|Z78494.1|PNZ78494 GC%: 50.0 gi|2765618|emb|Z78493.1|PGZ78493 GC%: 51.18 gi|2765617|emb|Z78492.1|PBZ78492 GC%: 50.61 gi|2765616|emb|Z78491.1|PCZ78491 GC%: 50.34 gi|2765615|emb|Z78490.1|PFZ78490 GC%: 51.37 gi|2765614|emb|Z78489.1|PDZ78489 GC%: 51.22 gi|2765613|emb|Z78488.1|PTZ78488 GC%: 51.44 gi|2765612|emb|Z78487.1|PHZ78487 GC%: 51.09 gi|2765611|emb|Z78486.1|PBZ78486 GC%: 51.03 gi|2765610|emb|Z78485.1|PHZ78485 GC%: 50.75 gi|2765609|emb|Z78484.1|PCZ78484 GC%: 50.83 gi|2765608|emb|Z78483.1|PVZ78483 GC%: 49.86 gi|2765607|emb|Z78482.1|PEZ78482 GC%: 52.15 gi|2765606|emb|Z78481.1|PIZ78481 GC%: 50.17 gi|2765605|emb|Z78480.1|PGZ78480 GC%: 50.09 gi|2765604|emb|Z78479.1|PPZ78479 GC%: 50.86 gi|2765603|emb|Z78478.1|PVZ78478 GC%: 51.1 gi|2765602|emb|Z78477.1|PVZ78477 GC%: 51.54 gi|2765601|emb|Z78476.1|PGZ78476 GC%: 50.51 gi|2765600|emb|Z78475.1|PSZ78475 GC%: 43.3 gi|2765599|emb|Z78474.1|PKZ78474 GC%: 50.75 gi|2765598|emb|Z78473.1|PSZ78473 GC%: 50.96 gi|2765597|emb|Z78472.1|PLZ78472 GC%: 49.39 gi|2765596|emb|Z78471.1|PDZ78471 GC%: 50.0 gi|2765595|emb|Z78470.1|PPZ78470 GC%: 48.78 gi|2765594|emb|Z78469.1|PHZ78469 GC%: 49.16 gi|2765593|emb|Z78468.1|PAZ78468 GC%: 50.0 gi|2765592|emb|Z78467.1|PSZ78467 GC%: 50.82 gi|2765591|emb|Z78466.1|PPZ78466 GC%: 50.86 gi|2765590|emb|Z78465.1|PRZ78465 GC%: 49.43 gi|2765589|emb|Z78464.1|PGZ78464 GC%: 50.34 gi|2765588|emb|Z78463.1|PGZ78463 GC%: 51.22 gi|2765587|emb|Z78462.1|PSZ78462 GC%: 32.34 gi|2765586|emb|Z78461.1|PWZ78461 GC%: 50.41 gi|2765585|emb|Z78460.1|PCZ78460 GC%: 50.74 gi|2765584|emb|Z78459.1|PDZ78459 GC%: 50.4 gi|2765583|emb|Z78458.1|PHZ78458 GC%: 51.08 gi|2765582|emb|Z78457.1|PCZ78457 GC%: 50.47 gi|2765581|emb|Z78456.1|PTZ78456 GC%: 50.54 gi|2765580|emb|Z78455.1|PJZ78455 GC%: 49.26 gi|2765579|emb|Z78454.1|PFZ78454 GC%: 50.65 gi|2765578|emb|Z78453.1|PSZ78453 GC%: 50.47 gi|2765577|emb|Z78452.1|PBZ78452 GC%: 49.39 gi|2765576|emb|Z78451.1|PHZ78451 GC%: 50.82 gi|2765575|emb|Z78450.1|PPZ78450 GC%: 50.57 gi|2765574|emb|Z78449.1|PMZ78449 GC%: 50.0 gi|2765573|emb|Z78448.1|PAZ78448 GC%: 50.4 gi|2765572|emb|Z78447.1|PVZ78447 GC%: 51.44 gi|2765571|emb|Z78446.1|PAZ78446 GC%: 50.14 gi|2765570|emb|Z78445.1|PUZ78445 GC%: 50.77 gi|2765569|emb|Z78444.1|PAZ78444 GC%: 48.69 gi|2765568|emb|Z78443.1|PLZ78443 GC%: 39.67 gi|2765567|emb|Z78442.1|PBZ78442 GC%: 50.76 gi|2765566|emb|Z78441.1|PSZ78441 GC%: 50.92 gi|2765565|emb|Z78440.1|PPZ78440 GC%: 49.87 gi|2765564|emb|Z78439.1|PBZ78439 GC%: 50.0
This code: 1. Reads each sequence 2. Counts the number of **G** and **C** bases 3. Calculates the percentage of GC content GC content is important in many areas of genomics because it can influence gene expression, sequencing behavior, and genome stability. --- ## Writing a New FASTA File Biopython can also write FASTA files using `SeqIO.write()`.
from Bio import SeqIO
# Filter sequences longer than 600 bases
records = []
for record in SeqIO.parse("orchids.fasta", "fasta"):
if len(record.seq) > 600:
records.append(record)
# Write filtered sequences to a new FASTA file
SeqIO.write(records, "long_sequences.fasta", "fasta")88
* `SeqIO.write()` writes sequence records to a file. * The first argument is a list of records. * The second argument is the output filename. * The third argument is the file format. This example filters the sequences to keep only those longer than 600 bases and writes them into a new FASTA file. --- ## Creating FASTA Records Manually You can also create new FASTA sequences programmatically.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
record1 = SeqRecord(
Seq("ATGCGTACGTAGCTAGCTAG"),
id="Example1",
description="Example DNA sequence"
)
record2 = SeqRecord(
Seq("ATGGGCTAGCTAGGCTA"),
id="Example2",
description="Another DNA sequence"
)
records = [record1, record2]
SeqIO.write(records, "example_sequences.fasta", "fasta")2
* **`Seq`** represents the biological sequence. * **`SeqRecord`** stores sequence metadata like ID and description. * The records are written to a FASTA file using `SeqIO.write()`. This is useful when generating sequences from simulations, analyses, or custom pipelines. --- ## Conclusion FASTA files are fundamental to bioinformatics, and **Biopython** makes them easy to work with in Python. Using the `SeqIO` module, you can efficiently read, analyze, and write sequence data. In this tutorial, you learned how to: * Parse FASTA files with `SeqIO.parse()` * Access sequence IDs, descriptions, and sequences * Count and analyze sequences * Calculate GC content * Write new FASTA files * Create FASTA records programmatically These skills form the foundation for many real-world bioinformatics workflows, including genome analysis, sequence filtering, and building data processing pipelines.