Performant reading of .tar.xz files

trace_forms · September 15, 2023, 6:31pm

Hi there,

I am trying to rapidly load data that is stored within .tar.xz files. Specifically, I am working with the genomic sequences and associated metadata for SARS-CoV-2 from GISAID. These come as separate .tar.xz files containing the following:

File #1: genetic sequences in FASTA format (.fa extension). Essentially a giant text file. File also contains a .txt README and a .html terms-of-use file.
File #2: metadata in .tsv format (a tab-delimited .csv, basically). Archive also contains a .txt README.

I’ve been accessing the data using TranscodingStreams.jl, CodecXz.jl, and Tar.jl. Also, because the two archives contain files other than the ones I am specifically interested in, I’m using TarIterators.jl to select the specific files I want. Here’s a basic snippet of the FASTA-reading code I’ve gotten to thus far:

using TranscodingStreams, Tar, CodecXz, TarIterators
msa = raw[file path as string literal]
open(msa) do stream
    io = TranscodingStream(XzDecompressor(), stream)
    io = open(TarIterator(io, x -> occursin(".fa", x.path)))
    for line in eachline(io)
        [at this point I'd pass the line to a data structure]
    end
end

The code is working; the trouble is that it’s not nearly as fast as I’d want. I’m coming over from Python, and on my laptop, my Cython implementation of the FASTA parser is about 3x faster than the Julia code I’ve shown above. (Around 5.4 seconds to read 10k sequences in Julia vs. around 1.8 seconds to read 10k sequences in Cython.)

Note, the uncompressed size of the FASTA file is getting close to 500 GB now, so putting everything in memory definitely isn’t an option here.

Is there anything I should be doing differently to access this data that will speed things up? As-is, just reading through the file (contains >16 million sequences as of today) will take about 2.5 hours, so I’d definitely like to improve that!

If needed, I can generate small representative files containing mock data and share them as well.

Thanks for any help y’all can provide.

Versions of the packages I’m using:
Julia: v1.9.3
CodecXz: v0.7.0
TarIterators: v0.2.2
TranscodingStreams: v0.9.13
Tar: v1.10.0

mbauman · September 15, 2023, 8:11pm

Welcome! Fun problem.

Just curious, have you profiled to see where your time is going?

trace_forms · September 16, 2023, 1:30am

Thank you! Glad to be here

I tried using StatProfilerHTML.jl. Looks like 23% of CPU time is spent in line 1061 of array.jl; it’s the _growend!(a, 1) call in push!.

Almost all the rest is spent in BoundedStreams. 28% of CPU time is in Base.eof, calling eof. Another 45% is in Base.read, calling either bytesavailable or read.

Oscar_Smith · September 16, 2023, 1:47am

good news incoming then: Add `Memory` type by oscardssmith · Pull Request #51319 · JuliaLang/julia · GitHub will hopefully get into 1.11 and makes _growend! about 2x faster. Not sure yet about the rest of the overhead.

Oscar_Smith · September 16, 2023, 1:54am

It would be good to get a sample file so I (or someone else) can take a closer look.

trace_forms · September 16, 2023, 5:53pm

Good news indeed on the _growend1 overhead if it makes it in!

I generated an example file and posted it in this Google Drive folder. Its size is about 5 MB, while the FASTA-formatted text file inside is about 290 MB uncompressed.

It’s mock data since the source database for my real data has strict sharing rules, but the format is the same: lines beginning with “>” are sequence headers, and all following lines (up to the next “>” line) are that header’s sequence. So, “>Sequence1”, then lines of DNA/RNA/amino acid characters, then “>Sequence2”, and so on. This particular FASTA file has 10,000 sequences and 20k lines total, and it profiled the same as the real file I was working with. I used the same kinds of characters to keep encoding consistent as well.

kevbonham · September 17, 2023, 11:16am

Not sure if parsing has anything to do with your bottleneck, but if you’re reading things in as strings and then doing stuff, you might have better luck with FASTX.jl and BioSequences.jl

trace_forms · September 18, 2023, 4:55am

FASTX.jl does indeed work great for uncompressed FASTA files. If I manually extract that .tar.xz file I shared and then read it with FASTA.Reader, that takes about 0.27 sec on my laptop vs about 0.55 sec for my Cythonized FASTA reader (also on the uncompressed file). So far, so good!

The issue seems to be decompressing that .xz format (specifically, it’s a LZMA2:23 CRC64 codec). I’m able to (slowly) read it line-by-line using the code in my original post. However, I’m not having much luck with FASTA.Reader, as it doesn’t like anything I try to pass to it after opening the file. Forgive me if I’m missing something obvious as I’m new to Julia, but the method signature for FASTA.Reader says it takes a TranscodingStream, so that’s what I have been trying to give to it.

The main two errors I’ve been getting are:

sequences = begin
	stream = open(xz_file)
	io = TranscodingStream(XzDecompressor(), stream)
	io = open(TarIterator(io, x -> occursin(".fa", x.path)))
	seqs = collect(FASTA.Reader(io))
end

leading to

MethodError: no method matching isopen(::BoundedStreams.BoundedInputStream{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}})

Closest candidates are:

isopen(!Matched::Union{Base.DevNull, Core.CoreSTDERR, Core.CoreSTDOUT})
@ Base coreio.jl:24
isopen(!Matched::Union{Base.AsyncCondition, Timer})
@ Base asyncevent.jl:160
isopen(!Matched::TranscodingStreams.TranscodingStream)

while

sequences = begin
	stream = open(xz_file)
	io = XzDecompressorStream(stream)
	seqs = collect(FASTA.Reader(io))
end

leads to

Error when parsing FASTX file. Saw unexpected byte '.' on line 1

    1. error(::String)@error.jl:35
    2. throw_parser_error(::Vector{UInt8}, ::Int64, ::Int64)@FASTX.jl:122
    3. macro expansion@readrecord.jl:88[inlined]
    4. readrecord!(::TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}, ::FASTX.FASTA.Record, ::Tuple{Int64, Int64})@stream.jl:83
    5. _read!@reader.jl:104[inlined]
    6. iterate(::FASTX.FASTA.Reader{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}}, ::Nothing)@reader.jl:79
    7. iterate@reader.jl:79[inlined]
    8. _collect(::UnitRange{Int64}, ::FASTX.FASTA.Reader{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}}, ::Base.HasEltype, ::Base.SizeUnknown)@array.jl:718
    9. collect(::FASTX.FASTA.Reader{TranscodingStreams.TranscodingStream{CodecXz.XzDecompressor, IOStream}})@array.jl:707

I also ran into a stacktrace that kept saying things like “unsafe read,” but I haven’t been able to reproduce it.

(Also, I certainly wouldn’t use collect() with the real, giant data file. It’s just for convenience here!)

jules · September 18, 2023, 5:05am

Seems to me that the FASTX code is calling isopen on the stream you pass to it but this was not implemented for the BoundedInputStream type that the Tar iterator produces. You could try to work around this by defining

Base.isopen(b::BoundedStreams.BoundedInputStream) = isopen(somehow_get_the_wrapped_stream_from_b)

If it works after this I’d open an issue at BoundedStreams that some part of the stream interface is missing, which is what I assume isopen is part of, but I’m not sure

jakobnissen · September 18, 2023, 6:27am

I don’t think Tar.jl at this time supports reading the files in the tarball. At least I couldn’t find any functionality to do this when reading the documentation (also see: https://github.com/JuliaIO/Tar.jl/pull/95)

You’ll need to either:

Extract the tarball to files, then use FASTX.jl on the extracted files, or
Implement something like https://github.com/JuliaIO/Tar.jl/pull/95 yourself (I’m sure the PR will be very appreciated, although it’s a lot of work!)

jules · September 18, 2023, 7:37am

I thought that’s what TarIterators is for.

This worked as I assumed:

julia> using TarIterators.BoundedStreams

julia> Base.isopen(b::BoundedStreams.BoundedInputStream) = isopen(b.source)

julia> sequences = begin
               stream = open("example_file.tar.xz")
               io = TranscodingStream(XzDecompressor(), stream)
               io2 = open(TarIterator(io, x -> occursin(".fa", x.path)))
               seqs = collect(FASTA.Reader(io2))
       end
10000-element Vector{FASTX.FASTA.Record}:
 FASTX.FASTA.Record("Sequence0", "CCTCAGGCCGT-ATTGGATGTCCGTCAG-ATAGATAACC…")
 FASTX.FASTA.Record("Sequence1", "GCAAGTCCCCTGCAGCGTAGTAAGCATTTACGCATGATG…")
 FASTX.FASTA.Record("Sequence2", "GCAAGTCCCCTGCAGCGTAGTAAGCATTTACGCATGATG…")
 FASTX.FASTA.Record("Sequence3", "CGGTGACGGAGTACCTGTGGCTGGACGGAGTGATGATCG…")
 FASTX.FASTA.Record("Sequence4", "GACGCATCGCTGCCGTCGTATCCACCGCCAATCATTTTC…")
 FASTX.FASTA.Record("Sequence5", "TGTCTTGACGTATGGTTATCACCAATATGTATCACATCA…")
 FASTX.FASTA.Record("Sequence6", "AGGGAAAGCTTACTTTATCATACTCACCCGAATGGCTGT…")
 FASTX.FASTA.Record("Sequence7", "GCCAGAAATGGTCTCCCAATGA-GATG-TCTCTTATCTT…")
 FASTX.FASTA.Record("Sequence8", "TGGAGAAGGCCTTCTTGACCACTATTATCCAG-CTACTC…")
 FASTX.FASTA.Record("Sequence9", "AGATTATCCCCATGAATTCTCTCTCGAGACTCTCGTGAG…")
 FASTX.FASTA.Record("Sequence10", "GCGAAACTTAGCGCACCACCTATTAACCAATAGAGCTAG…")
 FASTX.FASTA.Record("Sequence11", "CAACACATCTCCGCCGTGGGACAAGTCCGTGGCAGTTCG…")

kevbonham · September 18, 2023, 11:58am

And how’s the speeeeed?

jules · September 18, 2023, 12:32pm

I think it was pretty bad and allocated a lot but I had no time to check

Oscar_Smith · September 18, 2023, 12:50pm

So is the TLDR that we need a better (faster) tar.xz reader?

trace_forms · September 18, 2023, 1:50pm

Thanks all! jules’ code worked for reading the files using FASTA.Reader. It’s about 30% slower than the Cython implementation I have, but at least it’s functional.

Going into this, I didn’t realize how niche these files are. Almost everyone uses GZip rather than the LZMA-based encoding in .tar.xz, and indeed, tons of packages in many programming languages support GZip right out of the box. The real FASTA files from GISAID are >400 GB uncompressed though, so crunching that down to ≈1 GB with LZMA is definitely useful for a lot of reasons.

Oscar_Smith · September 18, 2023, 1:55pm

yeah LXMA is pretty rare because the compression is typically only ~50% better than gzip (<20% better than ZSTD on the higher settings) and the decompression speed is pretty awful. I realize that you might not be able to switch people’s pipelines, but if you do have control over the data source, I would recommend seeing how ZSTD compares. It will likely be a little bit larger than LZMA, but you should be able to read it at over a gigabyte per second.

Topic		Replies	Views
How to read files from a compressed file (zip/gz) lazily? New to Julia question , lazy-evaluation , zip	3	2483	January 14, 2021
Streaming gziped file to FASTQ.Reader - where to add method? General Usage question , biology , input-output	2	1055	March 20, 2020
CSV Reading (rewrite in C?) Internals & Design	50	5131	October 1, 2018
ANN: Tar.jl Package Announcements	20	3417	April 24, 2021
[ANN] TranscodingStreams.jl - new APIs to zlib, bzip2, xz, zstd and more! Community package , announcement	2	1472	August 18, 2017

Performant reading of .tar.xz files

Related topics