Working with FASTA Files in Biopython

FASTA files are one of the most common formats used in bioinformatics. They store DNA, RNA, or protein sequences in a simple text format and are widely used in genomic databases, sequence analysis pipelines, and research workflows.

If you're studying biology, bioinformatics, or computational biology, you'll almost certainly encounter FASTA files. Fortunately, the **Biopython** library provides convenient tools for reading, parsing, and writing FASTA data in Python.

In this tutorial, you'll learn how to:

* Read sequences from a FASTA file
* Access sequence IDs and descriptions
* Iterate through multiple sequences
* Calculate sequence statistics
* Write new FASTA files

We'll use the `Bio.SeqIO` module, which is designed for reading and writing biological sequence file formats.

---

## Downloading an Example FASTA File

Before working with FASTA files, let's download a small example file from the web.

import requests

# Download an example FASTA file
url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta"
response = requests.get(url)

with open("orchids.fasta", "w") as f:
    f.write(response.text)

print("FASTA file downloaded.")
FASTA file downloaded.
This code downloads a publicly available FASTA file from the Biopython repository and saves it locally as `orchids.fasta`. This file contains multiple DNA sequences that we will use in the examples throughout this tutorial.

---

## Reading a FASTA File

The `SeqIO.parse()` function is the most common way to read FASTA files in Biopython.

from Bio import SeqIO

# Parse the FASTA file
for record in SeqIO.parse("orchids.fasta", "fasta"):
    print(record.id)
gi|2765658|emb|Z78533.1|CIZ78533
gi|2765657|emb|Z78532.1|CCZ78532
gi|2765656|emb|Z78531.1|CFZ78531
gi|2765655|emb|Z78530.1|CMZ78530
gi|2765654|emb|Z78529.1|CLZ78529
gi|2765652|emb|Z78527.1|CYZ78527
gi|2765651|emb|Z78526.1|CGZ78526
gi|2765650|emb|Z78525.1|CAZ78525
gi|2765649|emb|Z78524.1|CFZ78524
gi|2765648|emb|Z78523.1|CHZ78523
gi|2765647|emb|Z78522.1|CMZ78522
gi|2765646|emb|Z78521.1|CCZ78521
gi|2765645|emb|Z78520.1|CSZ78520
gi|2765644|emb|Z78519.1|CPZ78519
gi|2765643|emb|Z78518.1|CRZ78518
gi|2765642|emb|Z78517.1|CFZ78517
gi|2765641|emb|Z78516.1|CPZ78516
gi|2765640|emb|Z78515.1|MXZ78515
gi|2765639|emb|Z78514.1|PSZ78514
gi|2765638|emb|Z78513.1|PBZ78513
gi|2765637|emb|Z78512.1|PWZ78512
gi|2765636|emb|Z78511.1|PEZ78511
gi|2765635|emb|Z78510.1|PCZ78510
gi|2765634|emb|Z78509.1|PPZ78509
gi|2765633|emb|Z78508.1|PLZ78508
gi|2765632|emb|Z78507.1|PLZ78507
gi|2765631|emb|Z78506.1|PLZ78506
gi|2765630|emb|Z78505.1|PSZ78505
gi|2765629|emb|Z78504.1|PKZ78504
gi|2765628|emb|Z78503.1|PCZ78503
gi|2765627|emb|Z78502.1|PBZ78502
gi|2765626|emb|Z78501.1|PCZ78501
gi|2765625|emb|Z78500.1|PWZ78500
gi|2765624|emb|Z78499.1|PMZ78499
gi|2765623|emb|Z78498.1|PMZ78498
gi|2765622|emb|Z78497.1|PDZ78497
gi|2765621|emb|Z78496.1|PAZ78496
gi|2765620|emb|Z78495.1|PEZ78495
gi|2765619|emb|Z78494.1|PNZ78494
gi|2765618|emb|Z78493.1|PGZ78493
gi|2765617|emb|Z78492.1|PBZ78492
gi|2765616|emb|Z78491.1|PCZ78491
gi|2765615|emb|Z78490.1|PFZ78490
gi|2765614|emb|Z78489.1|PDZ78489
gi|2765613|emb|Z78488.1|PTZ78488
gi|2765612|emb|Z78487.1|PHZ78487
gi|2765611|emb|Z78486.1|PBZ78486
gi|2765610|emb|Z78485.1|PHZ78485
gi|2765609|emb|Z78484.1|PCZ78484
gi|2765608|emb|Z78483.1|PVZ78483
gi|2765607|emb|Z78482.1|PEZ78482
gi|2765606|emb|Z78481.1|PIZ78481
gi|2765605|emb|Z78480.1|PGZ78480
gi|2765604|emb|Z78479.1|PPZ78479
gi|2765603|emb|Z78478.1|PVZ78478
gi|2765602|emb|Z78477.1|PVZ78477
gi|2765601|emb|Z78476.1|PGZ78476
gi|2765600|emb|Z78475.1|PSZ78475
gi|2765599|emb|Z78474.1|PKZ78474
gi|2765598|emb|Z78473.1|PSZ78473
gi|2765597|emb|Z78472.1|PLZ78472
gi|2765596|emb|Z78471.1|PDZ78471
gi|2765595|emb|Z78470.1|PPZ78470
gi|2765594|emb|Z78469.1|PHZ78469
gi|2765593|emb|Z78468.1|PAZ78468
gi|2765592|emb|Z78467.1|PSZ78467
gi|2765591|emb|Z78466.1|PPZ78466
gi|2765590|emb|Z78465.1|PRZ78465
gi|2765589|emb|Z78464.1|PGZ78464
gi|2765588|emb|Z78463.1|PGZ78463
gi|2765587|emb|Z78462.1|PSZ78462
gi|2765586|emb|Z78461.1|PWZ78461
gi|2765585|emb|Z78460.1|PCZ78460
gi|2765584|emb|Z78459.1|PDZ78459
gi|2765583|emb|Z78458.1|PHZ78458
gi|2765582|emb|Z78457.1|PCZ78457
gi|2765581|emb|Z78456.1|PTZ78456
gi|2765580|emb|Z78455.1|PJZ78455
gi|2765579|emb|Z78454.1|PFZ78454
gi|2765578|emb|Z78453.1|PSZ78453
gi|2765577|emb|Z78452.1|PBZ78452
gi|2765576|emb|Z78451.1|PHZ78451
gi|2765575|emb|Z78450.1|PPZ78450
gi|2765574|emb|Z78449.1|PMZ78449
gi|2765573|emb|Z78448.1|PAZ78448
gi|2765572|emb|Z78447.1|PVZ78447
gi|2765571|emb|Z78446.1|PAZ78446
gi|2765570|emb|Z78445.1|PUZ78445
gi|2765569|emb|Z78444.1|PAZ78444
gi|2765568|emb|Z78443.1|PLZ78443
gi|2765567|emb|Z78442.1|PBZ78442
gi|2765566|emb|Z78441.1|PSZ78441
gi|2765565|emb|Z78440.1|PPZ78440
gi|2765564|emb|Z78439.1|PBZ78439
* **`SeqIO.parse()`** reads sequences from a file.
* The first argument is the filename.
* The second argument (`"fasta"`) tells Biopython the file format.
* Each sequence is returned as a **SeqRecord** object called `record`.

A `SeqRecord` contains useful information such as the sequence ID, description, and the sequence itself.

---

## Accessing Sequence Information

Each FASTA entry contains several pieces of information. Let's explore them.

from Bio import SeqIO

for record in SeqIO.parse("orchids.fasta", "fasta"):
    print("ID:", record.id)
    print("Description:", record.description)
    print("Sequence length:", len(record.seq))
    print("First 20 bases:", record.seq[:20])
    print()
ID: gi|2765658|emb|Z78533.1|CIZ78533
Description: gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765657|emb|Z78532.1|CCZ78532
Description: gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 753
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765656|emb|Z78531.1|CFZ78531
Description: gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 748
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765655|emb|Z78530.1|CMZ78530
Description: gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 744
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765654|emb|Z78529.1|CLZ78529
Description: gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 733
First 20 bases: ACGGCGAGCTGCCGAAGGAC

ID: gi|2765652|emb|Z78527.1|CYZ78527
Description: gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 718
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765651|emb|Z78526.1|CGZ78526
Description: gi|2765651|emb|Z78526.1|CGZ78526 C.guttatum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 730
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765650|emb|Z78525.1|CAZ78525
Description: gi|2765650|emb|Z78525.1|CAZ78525 C.acaule 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 704
First 20 bases: TGTTGAGATAGCAGAATATA

ID: gi|2765649|emb|Z78524.1|CFZ78524
Description: gi|2765649|emb|Z78524.1|CFZ78524 C.formosanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765648|emb|Z78523.1|CHZ78523
Description: gi|2765648|emb|Z78523.1|CHZ78523 C.himalaicum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 709
First 20 bases: CGTAACCAGGTTTCCGTAGG

ID: gi|2765647|emb|Z78522.1|CMZ78522
Description: gi|2765647|emb|Z78522.1|CMZ78522 C.macranthum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 700
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765646|emb|Z78521.1|CCZ78521
Description: gi|2765646|emb|Z78521.1|CCZ78521 C.calceolus 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 726
First 20 bases: GTAGGTGAACCTGCGGAAGG

ID: gi|2765645|emb|Z78520.1|CSZ78520
Description: gi|2765645|emb|Z78520.1|CSZ78520 C.segawai 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 753
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765644|emb|Z78519.1|CPZ78519
Description: gi|2765644|emb|Z78519.1|CPZ78519 C.pubescens 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 699
First 20 bases: ATATGATCGAGTGAATCTGG

ID: gi|2765643|emb|Z78518.1|CRZ78518
Description: gi|2765643|emb|Z78518.1|CRZ78518 C.reginae 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 658
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765642|emb|Z78517.1|CFZ78517
Description: gi|2765642|emb|Z78517.1|CFZ78517 C.flavum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 752
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765641|emb|Z78516.1|CPZ78516
Description: gi|2765641|emb|Z78516.1|CPZ78516 C.passerinum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 726
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765640|emb|Z78515.1|MXZ78515
Description: gi|2765640|emb|Z78515.1|MXZ78515 M.xerophyticum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 765
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765639|emb|Z78514.1|PSZ78514
Description: gi|2765639|emb|Z78514.1|PSZ78514 P.schlimii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 755
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765638|emb|Z78513.1|PBZ78513
Description: gi|2765638|emb|Z78513.1|PBZ78513 P.besseae 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 742
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765637|emb|Z78512.1|PWZ78512
Description: gi|2765637|emb|Z78512.1|PWZ78512 P.wallisii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 762
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765636|emb|Z78511.1|PEZ78511
Description: gi|2765636|emb|Z78511.1|PEZ78511 P.exstaminodium 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 745
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765635|emb|Z78510.1|PCZ78510
Description: gi|2765635|emb|Z78510.1|PCZ78510 P.caricinum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 750
First 20 bases: CTAACCAGGGTTCCGAGGTG

ID: gi|2765634|emb|Z78509.1|PPZ78509
Description: gi|2765634|emb|Z78509.1|PPZ78509 P.pearcei 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 731
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765633|emb|Z78508.1|PLZ78508
Description: gi|2765633|emb|Z78508.1|PLZ78508 P.longifolium 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 741
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765632|emb|Z78507.1|PLZ78507
Description: gi|2765632|emb|Z78507.1|PLZ78507 P.lindenii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765631|emb|Z78506.1|PLZ78506
Description: gi|2765631|emb|Z78506.1|PLZ78506 P.lindleyanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 727
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765630|emb|Z78505.1|PSZ78505
Description: gi|2765630|emb|Z78505.1|PSZ78505 P.sargentianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 711
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765629|emb|Z78504.1|PKZ78504
Description: gi|2765629|emb|Z78504.1|PKZ78504 P.kaiteurum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 743
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765628|emb|Z78503.1|PCZ78503
Description: gi|2765628|emb|Z78503.1|PCZ78503 P.czerwiakowianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 727
First 20 bases: CGTAACCAGGTTTCCGTAGG

ID: gi|2765627|emb|Z78502.1|PBZ78502
Description: gi|2765627|emb|Z78502.1|PBZ78502 P.boissierianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 757
First 20 bases: CGTAACCAGGTTTCCGTAGG

ID: gi|2765626|emb|Z78501.1|PCZ78501
Description: gi|2765626|emb|Z78501.1|PCZ78501 P.caudatum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 770
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765625|emb|Z78500.1|PWZ78500
Description: gi|2765625|emb|Z78500.1|PWZ78500 P.warszewiczianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 767
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765624|emb|Z78499.1|PMZ78499
Description: gi|2765624|emb|Z78499.1|PMZ78499 P.micranthum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 759
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765623|emb|Z78498.1|PMZ78498
Description: gi|2765623|emb|Z78498.1|PMZ78498 P.malipoense 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 750
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765622|emb|Z78497.1|PDZ78497
Description: gi|2765622|emb|Z78497.1|PDZ78497 P.delenatii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 788
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765621|emb|Z78496.1|PAZ78496
Description: gi|2765621|emb|Z78496.1|PAZ78496 P.armeniacum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 774
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765620|emb|Z78495.1|PEZ78495
Description: gi|2765620|emb|Z78495.1|PEZ78495 P.emersonii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 789
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765619|emb|Z78494.1|PNZ78494
Description: gi|2765619|emb|Z78494.1|PNZ78494 P.niveum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 688
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765618|emb|Z78493.1|PGZ78493
Description: gi|2765618|emb|Z78493.1|PGZ78493 P.godefroyae 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 719
First 20 bases: CGTAACAAGGATTCCGTAGG

ID: gi|2765617|emb|Z78492.1|PBZ78492
Description: gi|2765617|emb|Z78492.1|PBZ78492 P.bellatulum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 743
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765616|emb|Z78491.1|PCZ78491
Description: gi|2765616|emb|Z78491.1|PCZ78491 P.concolor 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 737
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765615|emb|Z78490.1|PFZ78490
Description: gi|2765615|emb|Z78490.1|PFZ78490 P.fairrieanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 728
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765614|emb|Z78489.1|PDZ78489
Description: gi|2765614|emb|Z78489.1|PDZ78489 P.druryi 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765613|emb|Z78488.1|PTZ78488
Description: gi|2765613|emb|Z78488.1|PTZ78488 P.tigrinum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 696
First 20 bases: CTGTAGGTGAACCTGCGGAA

ID: gi|2765612|emb|Z78487.1|PHZ78487
Description: gi|2765612|emb|Z78487.1|PHZ78487 P.hirsutissimum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 732
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765611|emb|Z78486.1|PBZ78486
Description: gi|2765611|emb|Z78486.1|PBZ78486 P.barbigerum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 731
First 20 bases: CGTCACGAGGTTTCCGTAGG

ID: gi|2765610|emb|Z78485.1|PHZ78485
Description: gi|2765610|emb|Z78485.1|PHZ78485 P.henryanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 735
First 20 bases: CTGAACCTGGTGTCCGAAGG

ID: gi|2765609|emb|Z78484.1|PCZ78484
Description: gi|2765609|emb|Z78484.1|PCZ78484 P.charlesworthii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 720
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765608|emb|Z78483.1|PVZ78483
Description: gi|2765608|emb|Z78483.1|PVZ78483 P.villosum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765607|emb|Z78482.1|PEZ78482
Description: gi|2765607|emb|Z78482.1|PEZ78482 P.exul 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 629
First 20 bases: TCTACTGCAGTGACCGAGAT

ID: gi|2765606|emb|Z78481.1|PIZ78481
Description: gi|2765606|emb|Z78481.1|PIZ78481 P.insigne 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 572
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765605|emb|Z78480.1|PGZ78480
Description: gi|2765605|emb|Z78480.1|PGZ78480 P.gratrixianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 587
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765604|emb|Z78479.1|PPZ78479
Description: gi|2765604|emb|Z78479.1|PPZ78479 P.primulinum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 700
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765603|emb|Z78478.1|PVZ78478
Description: gi|2765603|emb|Z78478.1|PVZ78478 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 636
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765602|emb|Z78477.1|PVZ78477
Description: gi|2765602|emb|Z78477.1|PVZ78477 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 716
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765601|emb|Z78476.1|PGZ78476
Description: gi|2765601|emb|Z78476.1|PGZ78476 P.glaucophyllum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 592
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765600|emb|Z78475.1|PSZ78475
Description: gi|2765600|emb|Z78475.1|PSZ78475 P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 716
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765599|emb|Z78474.1|PKZ78474
Description: gi|2765599|emb|Z78474.1|PKZ78474 P.kolopakingii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 733
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765598|emb|Z78473.1|PSZ78473
Description: gi|2765598|emb|Z78473.1|PSZ78473 P.sanderianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 626
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765597|emb|Z78472.1|PLZ78472
Description: gi|2765597|emb|Z78472.1|PLZ78472 P.lowii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 737
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765596|emb|Z78471.1|PDZ78471
Description: gi|2765596|emb|Z78471.1|PDZ78471 P.dianthum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765595|emb|Z78470.1|PPZ78470
Description: gi|2765595|emb|Z78470.1|PPZ78470 P.parishii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 574
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765594|emb|Z78469.1|PHZ78469
Description: gi|2765594|emb|Z78469.1|PHZ78469 P.haynaldianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 594
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765593|emb|Z78468.1|PAZ78468
Description: gi|2765593|emb|Z78468.1|PAZ78468 P.adductum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 610
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765592|emb|Z78467.1|PSZ78467
Description: gi|2765592|emb|Z78467.1|PSZ78467 P.stonei 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 730
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765591|emb|Z78466.1|PPZ78466
Description: gi|2765591|emb|Z78466.1|PPZ78466 P.philippinense 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 641
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765590|emb|Z78465.1|PRZ78465
Description: gi|2765590|emb|Z78465.1|PRZ78465 P.rothschildianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 702
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765589|emb|Z78464.1|PGZ78464
Description: gi|2765589|emb|Z78464.1|PGZ78464 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 733
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765588|emb|Z78463.1|PGZ78463
Description: gi|2765588|emb|Z78463.1|PGZ78463 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 738
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765587|emb|Z78462.1|PSZ78462
Description: gi|2765587|emb|Z78462.1|PSZ78462 P.sukhakulii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 736
First 20 bases: CGTCACGAGGTCTCCGGATG

ID: gi|2765586|emb|Z78461.1|PWZ78461
Description: gi|2765586|emb|Z78461.1|PWZ78461 P.wardii 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 732
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765585|emb|Z78460.1|PCZ78460
Description: gi|2765585|emb|Z78460.1|PCZ78460 P.ciliolare 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 745
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765584|emb|Z78459.1|PDZ78459
Description: gi|2765584|emb|Z78459.1|PDZ78459 P.dayanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 744
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765583|emb|Z78458.1|PHZ78458
Description: gi|2765583|emb|Z78458.1|PHZ78458 P.hennisianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 738
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765582|emb|Z78457.1|PCZ78457
Description: gi|2765582|emb|Z78457.1|PCZ78457 P.callosum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 739
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765581|emb|Z78456.1|PTZ78456
Description: gi|2765581|emb|Z78456.1|PTZ78456 P.tonsum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 740
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765580|emb|Z78455.1|PJZ78455
Description: gi|2765580|emb|Z78455.1|PJZ78455 P.javanicum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 745
First 20 bases: CGTAACCAGGTTTCCGTAGG

ID: gi|2765579|emb|Z78454.1|PFZ78454
Description: gi|2765579|emb|Z78454.1|PFZ78454 P.fowliei 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 695
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765578|emb|Z78453.1|PSZ78453
Description: gi|2765578|emb|Z78453.1|PSZ78453 P.schoseri 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 745
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765577|emb|Z78452.1|PBZ78452
Description: gi|2765577|emb|Z78452.1|PBZ78452 P.bougainvilleanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 743
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765576|emb|Z78451.1|PHZ78451
Description: gi|2765576|emb|Z78451.1|PHZ78451 P.hookerae 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 730
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765575|emb|Z78450.1|PPZ78450
Description: gi|2765575|emb|Z78450.1|PPZ78450 P.papuanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 706
First 20 bases: GGAAGGATCATTGCTGATAT

ID: gi|2765574|emb|Z78449.1|PMZ78449
Description: gi|2765574|emb|Z78449.1|PMZ78449 P.mastersianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 744
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765573|emb|Z78448.1|PAZ78448
Description: gi|2765573|emb|Z78448.1|PAZ78448 P.argus 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 742
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765572|emb|Z78447.1|PVZ78447
Description: gi|2765572|emb|Z78447.1|PVZ78447 P.venustum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 694
First 20 bases: CGTAACAAGGATTCCGTAGG

ID: gi|2765571|emb|Z78446.1|PAZ78446
Description: gi|2765571|emb|Z78446.1|PAZ78446 P.acmodontum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 712
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765570|emb|Z78445.1|PUZ78445
Description: gi|2765570|emb|Z78445.1|PUZ78445 P.urbanianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 715
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765569|emb|Z78444.1|PAZ78444
Description: gi|2765569|emb|Z78444.1|PAZ78444 P.appletonianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 688
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765568|emb|Z78443.1|PLZ78443
Description: gi|2765568|emb|Z78443.1|PLZ78443 P.lawrenceanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 784
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765567|emb|Z78442.1|PBZ78442
Description: gi|2765567|emb|Z78442.1|PBZ78442 P.bullenianum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 721
First 20 bases: GTAGGTGAACCTGCGGAAGG

ID: gi|2765566|emb|Z78441.1|PSZ78441
Description: gi|2765566|emb|Z78441.1|PSZ78441 P.superbiens 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 703
First 20 bases: GGAAGGTCATTGCCGATATC

ID: gi|2765565|emb|Z78440.1|PPZ78440
Description: gi|2765565|emb|Z78440.1|PPZ78440 P.purpuratum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 744
First 20 bases: CGTAACAAGGTTTCCGTAGG

ID: gi|2765564|emb|Z78439.1|PBZ78439
Description: gi|2765564|emb|Z78439.1|PBZ78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA
Sequence length: 592
First 20 bases: CATTGTTGAGATCACATAAT

* **`record.id`**: The sequence identifier.
* **`record.description`**: The full FASTA header line.
* **`record.seq`**: The biological sequence.
* **`len(record.seq)`**: The length of the sequence.

The sequence itself is stored as a **Seq object**, which behaves much like a Python string but includes additional biological functionality.

---

## Counting the Number of Sequences

Sometimes you just want to know how many sequences are in a FASTA file.

from Bio import SeqIO

count = 0

for record in SeqIO.parse("orchids.fasta", "fasta"):
    count += 1

print("Number of sequences:", count)
Number of sequences: 94
This code iterates through each record and increments a counter. FASTA files can contain thousands or even millions of sequences, so iterating like this avoids loading everything into memory at once.

---

## Converting FASTA Records to a List

If the FASTA file is small, you may prefer to load all sequences into a list.

from Bio import SeqIO

records = list(SeqIO.parse("orchids.fasta", "fasta"))

print("Total sequences:", len(records))
print("First sequence ID:", records[0].id)
print("Sequence length:", len(records[0].seq))
Total sequences: 94
First sequence ID: gi|2765658|emb|Z78533.1|CIZ78533
Sequence length: 740
* `list()` converts the iterator returned by `SeqIO.parse()` into a list.
* This allows random access, such as `records[0]` or `records[5]`.

Be careful with very large FASTA files, as loading everything into memory can consume a lot of RAM.

---

## Calculating GC Content

A common task in sequence analysis is calculating the **GC content**, which is the percentage of nucleotides that are G or C.

from Bio import SeqIO

# Calculate GC content for each sequence
for record in SeqIO.parse("orchids.fasta", "fasta"):
    seq = record.seq.upper()
    
    g = seq.count("G")
    c = seq.count("C")
    
    gc_content = (g + c) / len(seq) * 100
    
    print(record.id, "GC%:", round(gc_content, 2))
gi|2765658|emb|Z78533.1|CIZ78533 GC%: 59.59
gi|2765657|emb|Z78532.1|CCZ78532 GC%: 48.47
gi|2765656|emb|Z78531.1|CFZ78531 GC%: 57.09
gi|2765655|emb|Z78530.1|CMZ78530 GC%: 47.58
gi|2765654|emb|Z78529.1|CLZ78529 GC%: 47.89
gi|2765652|emb|Z78527.1|CYZ78527 GC%: 50.7
gi|2765651|emb|Z78526.1|CGZ78526 GC%: 50.41
gi|2765650|emb|Z78525.1|CAZ78525 GC%: 50.43
gi|2765649|emb|Z78524.1|CFZ78524 GC%: 47.7
gi|2765648|emb|Z78523.1|CHZ78523 GC%: 50.35
gi|2765647|emb|Z78522.1|CMZ78522 GC%: 49.86
gi|2765646|emb|Z78521.1|CCZ78521 GC%: 49.04
gi|2765645|emb|Z78520.1|CSZ78520 GC%: 49.54
gi|2765644|emb|Z78519.1|CPZ78519 GC%: 49.07
gi|2765643|emb|Z78518.1|CRZ78518 GC%: 51.52
gi|2765642|emb|Z78517.1|CFZ78517 GC%: 49.73
gi|2765641|emb|Z78516.1|CPZ78516 GC%: 49.17
gi|2765640|emb|Z78515.1|MXZ78515 GC%: 53.73
gi|2765639|emb|Z78514.1|PSZ78514 GC%: 56.03
gi|2765638|emb|Z78513.1|PBZ78513 GC%: 55.93
gi|2765637|emb|Z78512.1|PWZ78512 GC%: 56.17
gi|2765636|emb|Z78511.1|PEZ78511 GC%: 56.24
gi|2765635|emb|Z78510.1|PCZ78510 GC%: 57.07
gi|2765634|emb|Z78509.1|PPZ78509 GC%: 55.54
gi|2765633|emb|Z78508.1|PLZ78508 GC%: 56.82
gi|2765632|emb|Z78507.1|PLZ78507 GC%: 56.35
gi|2765631|emb|Z78506.1|PLZ78506 GC%: 55.98
gi|2765630|emb|Z78505.1|PSZ78505 GC%: 55.7
gi|2765629|emb|Z78504.1|PKZ78504 GC%: 55.18
gi|2765628|emb|Z78503.1|PCZ78503 GC%: 56.26
gi|2765627|emb|Z78502.1|PBZ78502 GC%: 56.41
gi|2765626|emb|Z78501.1|PCZ78501 GC%: 56.49
gi|2765625|emb|Z78500.1|PWZ78500 GC%: 57.24
gi|2765624|emb|Z78499.1|PMZ78499 GC%: 51.52
gi|2765623|emb|Z78498.1|PMZ78498 GC%: 51.33
gi|2765622|emb|Z78497.1|PDZ78497 GC%: 52.54
gi|2765621|emb|Z78496.1|PAZ78496 GC%: 51.81
gi|2765620|emb|Z78495.1|PEZ78495 GC%: 53.11
gi|2765619|emb|Z78494.1|PNZ78494 GC%: 50.0
gi|2765618|emb|Z78493.1|PGZ78493 GC%: 51.18
gi|2765617|emb|Z78492.1|PBZ78492 GC%: 50.61
gi|2765616|emb|Z78491.1|PCZ78491 GC%: 50.34
gi|2765615|emb|Z78490.1|PFZ78490 GC%: 51.37
gi|2765614|emb|Z78489.1|PDZ78489 GC%: 51.22
gi|2765613|emb|Z78488.1|PTZ78488 GC%: 51.44
gi|2765612|emb|Z78487.1|PHZ78487 GC%: 51.09
gi|2765611|emb|Z78486.1|PBZ78486 GC%: 51.03
gi|2765610|emb|Z78485.1|PHZ78485 GC%: 50.75
gi|2765609|emb|Z78484.1|PCZ78484 GC%: 50.83
gi|2765608|emb|Z78483.1|PVZ78483 GC%: 49.86
gi|2765607|emb|Z78482.1|PEZ78482 GC%: 52.15
gi|2765606|emb|Z78481.1|PIZ78481 GC%: 50.17
gi|2765605|emb|Z78480.1|PGZ78480 GC%: 50.09
gi|2765604|emb|Z78479.1|PPZ78479 GC%: 50.86
gi|2765603|emb|Z78478.1|PVZ78478 GC%: 51.1
gi|2765602|emb|Z78477.1|PVZ78477 GC%: 51.54
gi|2765601|emb|Z78476.1|PGZ78476 GC%: 50.51
gi|2765600|emb|Z78475.1|PSZ78475 GC%: 43.3
gi|2765599|emb|Z78474.1|PKZ78474 GC%: 50.75
gi|2765598|emb|Z78473.1|PSZ78473 GC%: 50.96
gi|2765597|emb|Z78472.1|PLZ78472 GC%: 49.39
gi|2765596|emb|Z78471.1|PDZ78471 GC%: 50.0
gi|2765595|emb|Z78470.1|PPZ78470 GC%: 48.78
gi|2765594|emb|Z78469.1|PHZ78469 GC%: 49.16
gi|2765593|emb|Z78468.1|PAZ78468 GC%: 50.0
gi|2765592|emb|Z78467.1|PSZ78467 GC%: 50.82
gi|2765591|emb|Z78466.1|PPZ78466 GC%: 50.86
gi|2765590|emb|Z78465.1|PRZ78465 GC%: 49.43
gi|2765589|emb|Z78464.1|PGZ78464 GC%: 50.34
gi|2765588|emb|Z78463.1|PGZ78463 GC%: 51.22
gi|2765587|emb|Z78462.1|PSZ78462 GC%: 32.34
gi|2765586|emb|Z78461.1|PWZ78461 GC%: 50.41
gi|2765585|emb|Z78460.1|PCZ78460 GC%: 50.74
gi|2765584|emb|Z78459.1|PDZ78459 GC%: 50.4
gi|2765583|emb|Z78458.1|PHZ78458 GC%: 51.08
gi|2765582|emb|Z78457.1|PCZ78457 GC%: 50.47
gi|2765581|emb|Z78456.1|PTZ78456 GC%: 50.54
gi|2765580|emb|Z78455.1|PJZ78455 GC%: 49.26
gi|2765579|emb|Z78454.1|PFZ78454 GC%: 50.65
gi|2765578|emb|Z78453.1|PSZ78453 GC%: 50.47
gi|2765577|emb|Z78452.1|PBZ78452 GC%: 49.39
gi|2765576|emb|Z78451.1|PHZ78451 GC%: 50.82
gi|2765575|emb|Z78450.1|PPZ78450 GC%: 50.57
gi|2765574|emb|Z78449.1|PMZ78449 GC%: 50.0
gi|2765573|emb|Z78448.1|PAZ78448 GC%: 50.4
gi|2765572|emb|Z78447.1|PVZ78447 GC%: 51.44
gi|2765571|emb|Z78446.1|PAZ78446 GC%: 50.14
gi|2765570|emb|Z78445.1|PUZ78445 GC%: 50.77
gi|2765569|emb|Z78444.1|PAZ78444 GC%: 48.69
gi|2765568|emb|Z78443.1|PLZ78443 GC%: 39.67
gi|2765567|emb|Z78442.1|PBZ78442 GC%: 50.76
gi|2765566|emb|Z78441.1|PSZ78441 GC%: 50.92
gi|2765565|emb|Z78440.1|PPZ78440 GC%: 49.87
gi|2765564|emb|Z78439.1|PBZ78439 GC%: 50.0
This code:

1. Reads each sequence
2. Counts the number of **G** and **C** bases
3. Calculates the percentage of GC content

GC content is important in many areas of genomics because it can influence gene expression, sequencing behavior, and genome stability.

---

## Writing a New FASTA File

Biopython can also write FASTA files using `SeqIO.write()`.

from Bio import SeqIO

# Filter sequences longer than 600 bases
records = []

for record in SeqIO.parse("orchids.fasta", "fasta"):
    if len(record.seq) > 600:
        records.append(record)

# Write filtered sequences to a new FASTA file
SeqIO.write(records, "long_sequences.fasta", "fasta")
88
* `SeqIO.write()` writes sequence records to a file.
* The first argument is a list of records.
* The second argument is the output filename.
* The third argument is the file format.

This example filters the sequences to keep only those longer than 600 bases and writes them into a new FASTA file.

---

## Creating FASTA Records Manually

You can also create new FASTA sequences programmatically.

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

record1 = SeqRecord(
    Seq("ATGCGTACGTAGCTAGCTAG"),
    id="Example1",
    description="Example DNA sequence"
)

record2 = SeqRecord(
    Seq("ATGGGCTAGCTAGGCTA"),
    id="Example2",
    description="Another DNA sequence"
)

records = [record1, record2]

SeqIO.write(records, "example_sequences.fasta", "fasta")
2
* **`Seq`** represents the biological sequence.
* **`SeqRecord`** stores sequence metadata like ID and description.
* The records are written to a FASTA file using `SeqIO.write()`.

This is useful when generating sequences from simulations, analyses, or custom pipelines.

---

## Conclusion

FASTA files are fundamental to bioinformatics, and **Biopython** makes them easy to work with in Python. Using the `SeqIO` module, you can efficiently read, analyze, and write sequence data.

In this tutorial, you learned how to:

* Parse FASTA files with `SeqIO.parse()`
* Access sequence IDs, descriptions, and sequences
* Count and analyze sequences
* Calculate GC content
* Write new FASTA files
* Create FASTA records programmatically

These skills form the foundation for many real-world bioinformatics workflows, including genome analysis, sequence filtering, and building data processing pipelines.