I am trying to rapidly load data that is stored within .tar.xz files. Specifically, I am working with the genomic sequences and associated metadata for SARS-CoV-2 from GISAID. These come as separate .tar.xz files containing the following:
File #1: genetic sequences in FASTA format (.fa extension). Essentially a giant text file. File also contains a .txt README and a .html terms-of-use file.
File #2: metadata in .tsv format (a tab-delimited .csv, basically). Archive also contains a .txt README.
I’ve been accessing the data using
Tar.jl. Also, because the two archives contain files other than the ones I am specifically interested in, I’m using
TarIterators.jl to select the specific files I want. Here’s a basic snippet of the FASTA-reading code I’ve gotten to thus far:
using TranscodingStreams, Tar, CodecXz, TarIterators msa = raw[file path as string literal] open(msa) do stream io = TranscodingStream(XzDecompressor(), stream) io = open(TarIterator(io, x -> occursin(".fa", x.path))) for line in eachline(io) [at this point I'd pass the line to a data structure] end end
The code is working; the trouble is that it’s not nearly as fast as I’d want. I’m coming over from Python, and on my laptop, my Cython implementation of the FASTA parser is about 3x faster than the Julia code I’ve shown above. (Around 5.4 seconds to read 10k sequences in Julia vs. around 1.8 seconds to read 10k sequences in Cython.)
Note, the uncompressed size of the FASTA file is getting close to 500 GB now, so putting everything in memory definitely isn’t an option here.
Is there anything I should be doing differently to access this data that will speed things up? As-is, just reading through the file (contains >16 million sequences as of today) will take about 2.5 hours, so I’d definitely like to improve that!
If needed, I can generate small representative files containing mock data and share them as well.
Thanks for any help y’all can provide.
Versions of the packages I’m using: