Large FASTA datasets?

Hello! Apologies if this is the wrong place to post this question.

I’m playing around with TranscodingStreams and I want to compare the compression of FASTA files with different codecs. I can make my own randomized FASTA files using BioSequences and FASTX easily enough, but I’m also looking for some real world examples.

I’m looking for a large public FASTA file dataset (at least 100+ files available) that’s well reviewed and varied. If it includes both DNA and RNA FASTA files (or there’s a separate dataset you know of) that would work quite well. If anyone can point me to where such datasets are located that would be awesome!

FormatSpecimens.jl includes a varied set of FASTA files, but they are small.
Normally, most large datasets are produced in a homogenous manner, and so do not have much variation of the data inside the dataset.

You’d probably want to make it yourself. I would include the following different kinds of dataset

  • Plant genome, especially maize, strawberry or pine, which are (in)famous for their repeats
  • Oher eukaryote genome, maybe human or yeast
  • Assembly of a metagenomic dataset
  • Set of variants of e.g. a virus - these are 99% identical so should have unique compression abilities. Look at a covid or flu DB
  • Maybe something from mass spec? I’m not too into it, but if they represent the protein fragments as AA, that would be interesting as well

Also consider adding amplicon data if it fits your use case (quite different characteristics from Whole Genome Shotgun). Microbiome research produced >1mio such samples these days, uploaded typically to NCBI SRA or ENA.

@jakobnissen @jtackm Thank you both for your help! While I was doing some research I actually found some references to an NCBI FTP server containing almost 900 MB of *.fna file data! Far more than I need, but will definitely be creating some artifacts :slight_smile:

These look to be assemblies of microbes from the human gut. Note that these will have completely different compression characteristics from e.g. genomic repeats or a selection of variants of the same sequence.

Hm, that is something I should mention in the Discussion section for my paper. As this is for a small research project at uni, I think FormatSpecimens.jl might actually work well enough to try out Zstd dictionary compression. Either or should work fine for the purposes of my paper!

1 Like