Hi there,
I am trying to rapidly load data that is stored within .tar.xz files. Specifically, I am working with the genomic sequences and associated metadata for SARS-CoV-2 from GISAID. These come as separate .tar.xz files containing the following:
File #1: genetic sequences in FASTA format (.fa extension). Essentially a giant text file. File also contains a .txt README and a .html terms-of-use file.
File #2: metadata in .tsv format (a tab-delimited .csv, basically). Archive also contains a .txt README.
Iāve been accessing the data using TranscodingStreams.jl
, CodecXz.jl
, and Tar.jl
. Also, because the two archives contain files other than the ones I am specifically interested in, Iām using TarIterators.jl
to select the specific files I want. Hereās a basic snippet of the FASTA-reading code Iāve gotten to thus far:
using TranscodingStreams, Tar, CodecXz, TarIterators
msa = raw[file path as string literal]
open(msa) do stream
io = TranscodingStream(XzDecompressor(), stream)
io = open(TarIterator(io, x -> occursin(".fa", x.path)))
for line in eachline(io)
[at this point I'd pass the line to a data structure]
end
end
The code is working; the trouble is that itās not nearly as fast as Iād want. Iām coming over from Python, and on my laptop, my Cython implementation of the FASTA parser is about 3x faster than the Julia code Iāve shown above. (Around 5.4 seconds to read 10k sequences in Julia vs. around 1.8 seconds to read 10k sequences in Cython.)
Note, the uncompressed size of the FASTA file is getting close to 500 GB now, so putting everything in memory definitely isnāt an option here.
Is there anything I should be doing differently to access this data that will speed things up? As-is, just reading through the file (contains >16 million sequences as of today) will take about 2.5 hours, so Iād definitely like to improve that!
If needed, I can generate small representative files containing mock data and share them as well.
Thanks for any help yāall can provide.
Versions of the packages Iām using:
Julia: v1.9.3
CodecXz: v0.7.0
TarIterators: v0.2.2
TranscodingStreams: v0.9.13
Tar: v1.10.0