Speed comparison on reading a gzip file

calvin · March 4, 2021, 11:12am

Hi there,

perl using shell piping took ~32 seconds.
python using shell piping took ~ 36 seconds.
julia using shell piping took ~ 70 seconds.
julia using GzipDecompressorStream from CodecZlib, which has been recommended to me by more than one julia user, took ~ 110 seconds.

The gz file is from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz.

Here are the codes.
perl:

perl -e 'open( DATA, "pigz -cd ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz |");
while( <DATA>) { $n = $n + 1}
print( $n, "\n");'

python:

import subprocess
with subprocess.Popen( "pigz -cd ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz", shell = True, stdout = subprocess.PIPE) as gz:
    n1 = 0
    for line1 in gz.stdout:
        n1 = n1 + 1
    print( n1)

julia:

function fun1( file1)
    open( `pigz -cd $( expanduser( file1))`) do io
        n1 = 0
        for line1 in eachline( io)
            n1 = n1 + 1
        end
        print( n1, '\n')
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

julia using GzipDecompressorStream from CodecZlib:

using CodecZlib, TranscodingStreams
function fun1( file1)
    io1 = GzipDecompressorStream( open( expanduser( file1)))
    n1 = 0
    for line1 in eachline( io1)
        n1 = n1 + 1
    end
    print( n1, '\n')
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

jonathanBieler · March 4, 2021, 3:32pm

Did you run the Julia code twice to make sure you aren’t taking compilation time in your Julia timings ? That said apparently you’ve not the first one to observe this (e.g. see this discussion on reading fastq.gz files).

If there’s really that big of a difference that’s a big opportunity for improvement, since many file type in bioinformatics comes gzip’ed.

tim.holy · March 4, 2021, 3:58pm

I took a quick profile, and most of the time is spent in a single ccall: CodecZlib.jl/libz.jl at a777d8f53aebd223fe7c7399436a5050784d210f · JuliaIO/CodecZlib.jl · GitHub. Interestingly, it’s coming from an eof call, but a brief inspection suggested it’s behaving sensibly. So short of rewriting he C code in more optimal form in Julia, it’s nontrivial to know what to do here.

That said, this isn’t really my area so others who need this more should take a look.

calvin · March 4, 2021, 4:05pm

Just now I rerun the function call, and the times were close to previous ones.

hannesbecher · December 16, 2022, 5:56pm

I am wondering is there an update on this? I am planning to process large gz-compressed FASTQ files.

giordano · December 16, 2022, 6:08pm

You could try LibDeflate.jl, which is maintained by a fellow bioinformatician, so probably with needs similar to yours.

calvin · December 21, 2022, 6:49am

but libdeflate does not support streaming

calvin · April 17, 2023, 9:20am

as 2023-4-17, julia version 1.8.5, there is no improvement for julia

Xijiang_Yu · December 10, 2023, 8:45pm

Tried this with Julia 1.9.4:

Python 35.26s
Julia-1 1m3s
Julia-2 1m9s
time pigz -dc gene2accession.gz | wc \to 1m10s
time pigz -dc gene2accession.gz | wc -l \to 34.6s
time gzip -dc gene2accession.gz | wc \to 1m16s
time gzip -dc gene2accession.gz | wc -l \to 1m15s

Seems the bottleneck is wc/line counting. Julia is kind of equivalent to wc, and python is something like wc -l.

Topic		Replies	Views
Slow gzip streaming in julia but not in python General Usage performance	20	1962	March 18, 2021
GzipDecompressionStream compared to GZip.jl? General Usage	2	1426	August 22, 2017
[ANN] LibDeflate.jl and CodecBGZF.jl - really fast blocked de/compression Package Announcements	9	1488	March 17, 2021
Read/write compressed files in julia 0.7 General Usage	3	1132	June 13, 2019
Julia1.0 linux GZip.jl cannout find libz.so New to Julia package	3	1110	September 2, 2018

Speed comparison on reading a gzip file

Related topics