Large FASTA datasets?

M-PERSIC · November 25, 2022, 2:35am

Hello! Apologies if this is the wrong place to post this question.

I’m playing around with TranscodingStreams and I want to compare the compression of FASTA files with different codecs. I can make my own randomized FASTA files using BioSequences and FASTX easily enough, but I’m also looking for some real world examples.

I’m looking for a large public FASTA file dataset (at least 100+ files available) that’s well reviewed and varied. If it includes both DNA and RNA FASTA files (or there’s a separate dataset you know of) that would work quite well. If anyone can point me to where such datasets are located that would be awesome!

jakobnissen · November 25, 2022, 8:02am

FormatSpecimens.jl includes a varied set of FASTA files, but they are small.
Normally, most large datasets are produced in a homogenous manner, and so do not have much variation of the data inside the dataset.

You’d probably want to make it yourself. I would include the following different kinds of dataset

Plant genome, especially maize, strawberry or pine, which are (in)famous for their repeats
Oher eukaryote genome, maybe human or yeast
Assembly of a metagenomic dataset
Set of variants of e.g. a virus - these are 99% identical so should have unique compression abilities. Look at a covid or flu DB
Maybe something from mass spec? I’m not too into it, but if they represent the protein fragments as AA, that would be interesting as well

jtackm · November 25, 2022, 8:31am

Also consider adding amplicon data if it fits your use case (quite different characteristics from Whole Genome Shotgun). Microbiome research produced >1mio such samples these days, uploaded typically to NCBI SRA or ENA.

M-PERSIC · November 25, 2022, 8:56am

@jakobnissen @jtackm Thank you both for your help! While I was doing some research I actually found some references to an NCBI FTP server containing almost 900 MB of *.fna file data! Far more than I need, but will definitely be creating some artifacts

jakobnissen · November 25, 2022, 9:02am

These look to be assemblies of microbes from the human gut. Note that these will have completely different compression characteristics from e.g. genomic repeats or a selection of variants of the same sequence.

M-PERSIC · November 25, 2022, 9:12am

Hm, that is something I should mention in the Discussion section for my paper. As this is for a small research project at uni, I think FormatSpecimens.jl might actually work well enough to try out Zstd dictionary compression. Either or should work fine for the purposes of my paper!

fcriscuo · December 11, 2022, 8:44pm

I just saw your request, but if you are still looking for real sequences in FASTA format you can download a file of 56K+ DNA sequences from the Sanger Lab’s Catalog of Somatic Mutations in Cancer (COSMIC) database (Download Files). You’ll have to register for a free account. FASTA data for RNA sequences are typically reverse-transcribed to DNA. Also, sequences from the minus strand are usually reoriented to the 5 to 3 direction of the positive strand (Negative Strand Coordinates in Fasta Files?). Hope this helps.

M-PERSIC · December 15, 2022, 6:21am

Thank you! Will keep it bookmarked for later use.

Topic		Replies	Views
Performant reading of .tar.xz files Performance question , speed-optimization	15	707	September 18, 2023
Optimizing performance with FASTX I/O stream and Codon Counting Biology, Health, and Medicine question	5	812	April 26, 2023
Interest in RNA-seq specific convenience package based on BioJulia? Biology, Health, and Medicine	16	1398	July 17, 2023
Using bgzipped VCF files with GeneticVariation.jl Biology, Health, and Medicine question	3	1846	August 3, 2017
Streaming gziped file to FASTQ.Reader - where to add method? General Usage question , biology , input-output	2	1038	March 20, 2020

Large FASTA datasets?

Related topics