Speed comparison on reading a gzip file

Hi there,

perl using shell piping took ~32 seconds.
python using shell piping took ~ 36 seconds.
julia using shell piping took ~ 70 seconds.
julia using GzipDecompressorStream from CodecZlib, which has been recommended to me by more than one julia user, took ~ 110 seconds.

The gz file is from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz.

Here are the codes.
perl:

perl -e 'open( DATA, "pigz -cd ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz |");
while( <DATA>) { $n = $n + 1}
print( $n, "\n");'

python:

import subprocess
with subprocess.Popen( "pigz -cd ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz", shell = True, stdout = subprocess.PIPE) as gz:
    n1 = 0
    for line1 in gz.stdout:
        n1 = n1 + 1
    print( n1)

julia:

function fun1( file1)
    open( `pigz -cd $( expanduser( file1))`) do io
        n1 = 0
        for line1 in eachline( io)
            n1 = n1 + 1
        end
        print( n1, '\n')
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

julia using GzipDecompressorStream from CodecZlib:

using CodecZlib, TranscodingStreams
function fun1( file1)
    io1 = GzipDecompressorStream( open( expanduser( file1)))
    n1 = 0
    for line1 in eachline( io1)
        n1 = n1 + 1
    end
    print( n1, '\n')
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")
3 Likes

Did you run the Julia code twice to make sure you aren’t taking compilation time in your Julia timings ? That said apparently you’ve not the first one to observe this (e.g. see this discussion on reading fastq.gz files).

If there’s really that big of a difference that’s a big opportunity for improvement, since many file type in bioinformatics comes gzip’ed.

1 Like

I took a quick profile, and most of the time is spent in a single ccall: CodecZlib.jl/libz.jl at a777d8f53aebd223fe7c7399436a5050784d210f · JuliaIO/CodecZlib.jl · GitHub. Interestingly, it’s coming from an eof call, but a brief inspection suggested it’s behaving sensibly. So short of rewriting he C code in more optimal form in Julia, it’s nontrivial to know what to do here.

That said, this isn’t really my area so others who need this more should take a look.

3 Likes

Just now I rerun the function call, and the times were close to previous ones.

1 Like