Large FASTA datasets?

Hello! Apologies if this is the wrong place to post this question.

I’m playing around with TranscodingStreams and I want to compare the compression of FASTA files with different codecs. I can make my own randomized FASTA files using BioSequences and FASTX easily enough, but I’m also looking for some real world examples.

I’m looking for a large public FASTA file dataset (at least 100+ files available) that’s well reviewed and varied. If it includes both DNA and RNA FASTA files (or there’s a separate dataset you know of) that would work quite well. If anyone can point me to where such datasets are located that would be awesome!

FormatSpecimens.jl includes a varied set of FASTA files, but they are small.
Normally, most large datasets are produced in a homogenous manner, and so do not have much variation of the data inside the dataset.

You’d probably want to make it yourself. I would include the following different kinds of dataset

  • Plant genome, especially maize, strawberry or pine, which are (in)famous for their repeats
  • Oher eukaryote genome, maybe human or yeast
  • Assembly of a metagenomic dataset
  • Set of variants of e.g. a virus - these are 99% identical so should have unique compression abilities. Look at a covid or flu DB
  • Maybe something from mass spec? I’m not too into it, but if they represent the protein fragments as AA, that would be interesting as well
1 Like

Also consider adding amplicon data if it fits your use case (quite different characteristics from Whole Genome Shotgun). Microbiome research produced >1mio such samples these days, uploaded typically to NCBI SRA or ENA.

1 Like

@jakobnissen @jtackm Thank you both for your help! While I was doing some research I actually found some references to an NCBI FTP server containing almost 900 MB of *.fna file data! Far more than I need, but will definitely be creating some artifacts :slight_smile:

These look to be assemblies of microbes from the human gut. Note that these will have completely different compression characteristics from e.g. genomic repeats or a selection of variants of the same sequence.

2 Likes

Hm, that is something I should mention in the Discussion section for my paper. As this is for a small research project at uni, I think FormatSpecimens.jl might actually work well enough to try out Zstd dictionary compression. Either or should work fine for the purposes of my paper!

1 Like

I just saw your request, but if you are still looking for real sequences in FASTA format you can download a file of 56K+ DNA sequences from the Sanger Lab’s Catalog of Somatic Mutations in Cancer (COSMIC) database (Download Files). You’ll have to register for a free account. FASTA data for RNA sequences are typically reverse-transcribed to DNA. Also, sequences from the minus strand are usually reoriented to the 5 to 3 direction of the positive strand (Negative Strand Coordinates in Fasta Files?). Hope this helps.

1 Like

Thank you! Will keep it bookmarked for later use.