Performant reading of .tar.xz files

Hi there,

I am trying to rapidly load data that is stored within .tar.xz files. Specifically, I am working with the genomic sequences and associated metadata for SARS-CoV-2 from GISAID. These come as separate .tar.xz files containing the following:

File #1: genetic sequences in FASTA format (.fa extension). Essentially a giant text file. File also contains a .txt README and a .html terms-of-use file.
File #2: metadata in .tsv format (a tab-delimited .csv, basically). Archive also contains a .txt README.

I’ve been accessing the data using TranscodingStreams.jl, CodecXz.jl, and Tar.jl. Also, because the two archives contain files other than the ones I am specifically interested in, I’m using TarIterators.jl to select the specific files I want. Here’s a basic snippet of the FASTA-reading code I’ve gotten to thus far:

using TranscodingStreams, Tar, CodecXz, TarIterators
msa = raw[file path as string literal]
open(msa) do stream
    io = TranscodingStream(XzDecompressor(), stream)
    io = open(TarIterator(io, x -> occursin(".fa", x.path)))
    for line in eachline(io)
        [at this point I'd pass the line to a data structure]
    end
end

The code is working; the trouble is that it’s not nearly as fast as I’d want. I’m coming over from Python, and on my laptop, my Cython implementation of the FASTA parser is about 3x faster than the Julia code I’ve shown above. (Around 5.4 seconds to read 10k sequences in Julia vs. around 1.8 seconds to read 10k sequences in Cython.)

Note, the uncompressed size of the FASTA file is getting close to 500 GB now, so putting everything in memory definitely isn’t an option here.

Is there anything I should be doing differently to access this data that will speed things up? As-is, just reading through the file (contains >16 million sequences as of today) will take about 2.5 hours, so I’d definitely like to improve that!

If needed, I can generate small representative files containing mock data and share them as well.

Thanks for any help y’all can provide.

Versions of the packages I’m using:
Julia: v1.9.3
CodecXz: v0.7.0
TarIterators: v0.2.2
TranscodingStreams: v0.9.13
Tar: v1.10.0

2 Likes

Welcome! Fun problem.

Just curious, have you profiled to see where your time is going?

Thank you! Glad to be here :slight_smile:

I tried using StatProfilerHTML.jl. Looks like 23% of CPU time is spent in line 1061 of array.jl; it’s the _growend!(a, 1) call in push!.

Almost all the rest is spent in BoundedStreams. 28% of CPU time is in Base.eof, calling eof. Another 45% is in Base.read, calling either bytesavailable or read.

good news incoming then: Add `Memory` type by oscardssmith · Pull Request #51319 · JuliaLang/julia · GitHub will hopefully get into 1.11 and makes _growend! about 2x faster. Not sure yet about the rest of the overhead.

It would be good to get a sample file so I (or someone else) can take a closer look.

Good news indeed on the _growend1 overhead if it makes it in!

I generated an example file and posted it in this Google Drive folder. Its size is about 5 MB, while the FASTA-formatted text file inside is about 290 MB uncompressed.

It’s mock data since the source database for my real data has strict sharing rules, but the format is the same: lines beginning with “>” are sequence headers, and all following lines (up to the next “>” line) are that header’s sequence. So, “>Sequence1”, then lines of DNA/RNA/amino acid characters, then “>Sequence2”, and so on. This particular FASTA file has 10,000 sequences and 20k lines total, and it profiled the same as the real file I was working with. I used the same kinds of characters to keep encoding consistent as well.

1 Like

Not sure if parsing has anything to do with your bottleneck, but if you’re reading things in as strings and then doing stuff, you might have better luck with FASTX.jl and BioSequences.jl

2 Likes

FASTX.jl does indeed work great for uncompressed FASTA files. If I manually extract that .tar.xz file I shared and then read it with FASTA.Reader, that takes about 0.27 sec on my laptop vs about 0.55 sec for my Cythonized FASTA reader (also on the uncompressed file). So far, so good!

The issue seems to be decompressing that .xz format (specifically, it’s a LZMA2:23 CRC64 codec). I’m able to (slowly) read it line-by-line using the code in my original post. However, I’m not having much luck with FASTA.Reader, as it doesn’t like anything I try to pass to it after opening the file. Forgive me if I’m missing something obvious as I’m new to Julia, but the method signature for FASTA.Reader says it takes a TranscodingStream, so that’s what I have been trying to give to it.

The main two errors I’ve been getting are:

sequences = begin
	stream = open(xz_file)
	io = TranscodingStream(XzDecompressor(), stream)
	io = open(TarIterator(io, x -> occursin(".fa", x.path)))
	seqs = collect(FASTA.Reader(io))
end

leading to

MethodError: no method matching isopen(::BoundedStreams.BoundedInputStream{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}})

Closest candidates are:

isopen(!Matched::Union{Base.DevNull, Core.CoreSTDERR, Core.CoreSTDOUT})
@ Base coreio.jl:24
isopen(!Matched::Union{Base.AsyncCondition, Timer})
@ Base asyncevent.jl:160
isopen(!Matched::TranscodingStreams.TranscodingStream)

while

sequences = begin
	stream = open(xz_file)
	io = XzDecompressorStream(stream)
	seqs = collect(FASTA.Reader(io))
end

leads to

Error when parsing FASTX file. Saw unexpected byte '.' on line 1

    1. error(::String)@error.jl:35
    2. throw_parser_error(::Vector{UInt8}, ::Int64, ::Int64)@FASTX.jl:122
    3. macro expansion@readrecord.jl:88[inlined]
    4. readrecord!(::TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}, ::FASTX.FASTA.Record, ::Tuple{Int64, Int64})@stream.jl:83
    5. _read!@reader.jl:104[inlined]
    6. iterate(::FASTX.FASTA.Reader{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}}, ::Nothing)@reader.jl:79
    7. iterate@reader.jl:79[inlined]
    8. _collect(::UnitRange{Int64}, ::FASTX.FASTA.Reader{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}}, ::Base.HasEltype, ::Base.SizeUnknown)@array.jl:718
    9. collect(::FASTX.FASTA.Reader{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}})@array.jl:707

I also ran into a stacktrace that kept saying things like “unsafe read,” but I haven’t been able to reproduce it.

(Also, I certainly wouldn’t use collect() with the real, giant data file. It’s just for convenience here!)

2 Likes

Seems to me that the FASTX code is calling isopen on the stream you pass to it but this was not implemented for the BoundedInputStream type that the Tar iterator produces. You could try to work around this by defining

Base.isopen(b::BoundedStreams.BoundedInputStream) = isopen(somehow_get_the_wrapped_stream_from_b)

If it works after this I’d open an issue at BoundedStreams that some part of the stream interface is missing, which is what I assume isopen is part of, but I’m not sure

I don’t think Tar.jl at this time supports reading the files in the tarball. At least I couldn’t find any functionality to do this when reading the documentation (also see: https://github.com/JuliaIO/Tar.jl/pull/95)

You’ll need to either:

  • Extract the tarball to files, then use FASTX.jl on the extracted files, or
  • Implement something like https://github.com/JuliaIO/Tar.jl/pull/95 yourself (I’m sure the PR will be very appreciated, although it’s a lot of work!)

I thought that’s what TarIterators is for.

This worked as I assumed:

julia> using TarIterators.BoundedStreams

julia> Base.isopen(b::BoundedStreams.BoundedInputStream) = isopen(b.source)

julia> sequences = begin
               stream = open("example_file.tar.xz")
               io = TranscodingStream(XzDecompressor(), stream)
               io2 = open(TarIterator(io, x -> occursin(".fa", x.path)))
               seqs = collect(FASTA.Reader(io2))
       end
10000-element Vector{FASTX.FASTA.Record}:
 FASTX.FASTA.Record("Sequence0", "CCTCAGGCCGT-ATTGGATGTCCGTCAG-ATAGATAACC…")
 FASTX.FASTA.Record("Sequence1", "GCAAGTCCCCTGCAGCGTAGTAAGCATTTACGCATGATG…")
 FASTX.FASTA.Record("Sequence2", "GCAAGTCCCCTGCAGCGTAGTAAGCATTTACGCATGATG…")
 FASTX.FASTA.Record("Sequence3", "CGGTGACGGAGTACCTGTGGCTGGACGGAGTGATGATCG…")
 FASTX.FASTA.Record("Sequence4", "GACGCATCGCTGCCGTCGTATCCACCGCCAATCATTTTC…")
 FASTX.FASTA.Record("Sequence5", "TGTCTTGACGTATGGTTATCACCAATATGTATCACATCA…")
 FASTX.FASTA.Record("Sequence6", "AGGGAAAGCTTACTTTATCATACTCACCCGAATGGCTGT…")
 FASTX.FASTA.Record("Sequence7", "GCCAGAAATGGTCTCCCAATGA-GATG-TCTCTTATCTT…")
 FASTX.FASTA.Record("Sequence8", "TGGAGAAGGCCTTCTTGACCACTATTATCCAG-CTACTC…")
 FASTX.FASTA.Record("Sequence9", "AGATTATCCCCATGAATTCTCTCTCGAGACTCTCGTGAG…")
 FASTX.FASTA.Record("Sequence10", "GCGAAACTTAGCGCACCACCTATTAACCAATAGAGCTAG…")
 FASTX.FASTA.Record("Sequence11", "CAACACATCTCCGCCGTGGGACAAGTCCGTGGCAGTTCG…")
3 Likes

And how’s the speeeeed?

I think it was pretty bad and allocated a lot but I had no time to check

So is the TLDR that we need a better (faster) tar.xz reader?

Thanks all! jules’ code worked for reading the files using FASTA.Reader. It’s about 30% slower than the Cython implementation I have, but at least it’s functional.

Going into this, I didn’t realize how niche these files are. Almost everyone uses GZip rather than the LZMA-based encoding in .tar.xz, and indeed, tons of packages in many programming languages support GZip right out of the box. The real FASTA files from GISAID are >400 GB uncompressed though, so crunching that down to ≈1 GB with LZMA is definitely useful for a lot of reasons.

1 Like

yeah LXMA is pretty rare because the compression is typically only ~50% better than gzip (<20% better than ZSTD on the higher settings) and the decompression speed is pretty awful. I realize that you might not be able to switch people’s pipelines, but if you do have control over the data source, I would recommend seeing how ZSTD compares. It will likely be a little bit larger than LZMA, but you should be able to read it at over a gigabyte per second.

2 Likes