Speed comparison on reading a gzip file

Hi there,

perl using shell piping took ~32 seconds.
python using shell piping took ~ 36 seconds.
julia using shell piping took ~ 70 seconds.
julia using GzipDecompressorStream from CodecZlib, which has been recommended to me by more than one julia user, took ~ 110 seconds.

The gz file is from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz.

Here are the codes.
perl:

perl -e 'open( DATA, "pigz -cd ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz |");
while( <DATA>) { $n = $n + 1}
print( $n, "\n");'

python:

import subprocess
with subprocess.Popen( "pigz -cd ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz", shell = True, stdout = subprocess.PIPE) as gz:
    n1 = 0
    for line1 in gz.stdout:
        n1 = n1 + 1
    print( n1)

julia:

function fun1( file1)
    open( `pigz -cd $( expanduser( file1))`) do io
        n1 = 0
        for line1 in eachline( io)
            n1 = n1 + 1
        end
        print( n1, '\n')
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

julia using GzipDecompressorStream from CodecZlib:

using CodecZlib, TranscodingStreams
function fun1( file1)
    io1 = GzipDecompressorStream( open( expanduser( file1)))
    n1 = 0
    for line1 in eachline( io1)
        n1 = n1 + 1
    end
    print( n1, '\n')
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")
6 Likes

Did you run the Julia code twice to make sure you aren’t taking compilation time in your Julia timings ? That said apparently you’ve not the first one to observe this (e.g. see this discussion on reading fastq.gz files).

If there’s really that big of a difference that’s a big opportunity for improvement, since many file type in bioinformatics comes gzip’ed.

2 Likes

I took a quick profile, and most of the time is spent in a single ccall: CodecZlib.jl/libz.jl at a777d8f53aebd223fe7c7399436a5050784d210f · JuliaIO/CodecZlib.jl · GitHub. Interestingly, it’s coming from an eof call, but a brief inspection suggested it’s behaving sensibly. So short of rewriting he C code in more optimal form in Julia, it’s nontrivial to know what to do here.

That said, this isn’t really my area so others who need this more should take a look.

3 Likes

Just now I rerun the function call, and the times were close to previous ones.

2 Likes

I am wondering is there an update on this? I am planning to process large gz-compressed FASTQ files.

You could try LibDeflate.jl, which is maintained by a fellow bioinformatician, so probably with needs similar to yours.

1 Like

but libdeflate does not support streaming

1 Like

as 2023-4-17, julia version 1.8.5, there is no improvement for julia

Tried this with Julia 1.9.4:

  • Python 35.26s
  • Julia-1 1m3s
  • Julia-2 1m9s
  • time pigz -dc gene2accession.gz | wc \to 1m10s
  • time pigz -dc gene2accession.gz | wc -l \to 34.6s
  • time gzip -dc gene2accession.gz | wc \to 1m16s
  • time gzip -dc gene2accession.gz | wc -l \to 1m15s

Seems the bottleneck is wc/line counting. Julia is kind of equivalent to wc, and python is something like wc -l.