Hello! Apologies if this is the wrong place to post this question.
I’m playing around with TranscodingStreams and I want to compare the compression of FASTA files with different codecs. I can make my own randomized FASTA files using BioSequences and FASTX easily enough, but I’m also looking for some real world examples.
I’m looking for a large public FASTA file dataset (at least 100+ files available) that’s well reviewed and varied. If it includes both DNA and RNA FASTA files (or there’s a separate dataset you know of) that would work quite well. If anyone can point me to where such datasets are located that would be awesome!
FormatSpecimens.jl includes a varied set of FASTA files, but they are small.
Normally, most large datasets are produced in a homogenous manner, and so do not have much variation of the data inside the dataset.
You’d probably want to make it yourself. I would include the following different kinds of dataset
Plant genome, especially maize, strawberry or pine, which are (in)famous for their repeats
Oher eukaryote genome, maybe human or yeast
Assembly of a metagenomic dataset
Set of variants of e.g. a virus - these are 99% identical so should have unique compression abilities. Look at a covid or flu DB
Maybe something from mass spec? I’m not too into it, but if they represent the protein fragments as AA, that would be interesting as well
Also consider adding amplicon data if it fits your use case (quite different characteristics from Whole Genome Shotgun). Microbiome research produced >1mio such samples these days, uploaded typically to NCBI SRA or ENA.
@jakobnissen@jtackm Thank you both for your help! While I was doing some research I actually found some references to an NCBI FTP server containing almost 900 MB of *.fna file data! Far more than I need, but will definitely be creating some artifacts
These look to be assemblies of microbes from the human gut. Note that these will have completely different compression characteristics from e.g. genomic repeats or a selection of variants of the same sequence.
Hm, that is something I should mention in the Discussion section for my paper. As this is for a small research project at uni, I think FormatSpecimens.jl might actually work well enough to try out Zstd dictionary compression. Either or should work fine for the purposes of my paper!
I just saw your request, but if you are still looking for real sequences in FASTA format you can download a file of 56K+ DNA sequences from the Sanger Lab’s Catalog of Somatic Mutations in Cancer (COSMIC) database (Download Files). You’ll have to register for a free account. FASTA data for RNA sequences are typically reverse-transcribed to DNA. Also, sequences from the minus strand are usually reoriented to the 5 to 3 direction of the positive strand (Negative Strand Coordinates in Fasta Files?). Hope this helps.