Slow gzip streaming in julia but not in python

Hi there,

I am new to Julia, and come to Julia for performance with fewer codes. I have been using R, and encountered a simple task: read from a gziped file and do something to each line. It was slow in R, of course. So I went to Perl first. Here is the code:

perl -e '
open( DATA, "zcat ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz|");
while( <DATA>) {
  @F = split( /\t/);
}
close( DATA);'

I found the gzip command was using about 48% CPU of a thread. Then I came to Julia, and here is the code:

function fun1( file1)
    for line1 in eachline( `zcat $( expanduser( file1))`)
        F = split( line1, '\t')
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

And I fount that the gzip command was also using about 48% CPU. Then I came to Python, and here is the code:

import subprocess
with subprocess.Popen( "zcat ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz", shell = True, stdout = subprocess.PIPE) as gz:
    for line1 in gz.stdout:
        F = line1.split( b'\t')

Now, the gzip is using nearly 100% CPU, and so is Python.

I have not recoreded times spent by these 3 approaches. But I can feel that the python code is obviously faster than the other two.

How come?

How does the timed performance compare?

I’m not sure it will make much difference in this case, but it is generally bad for performance to refer to non-const global variables inside a function.

Also, I would recommend using CodecZlib.jl over shell commands.

I will later record the times spent. But I have the impression that the python code is obviously faster than the other two.

For the julia code, there is only one global variable, which was not used repeatedly. So I guess it would not affect the performance much.

The issue with non const global variables is in how they affect type inference. Basically it and every value derived from that are a type which cannot be known at compile time which results in slower code.

If there are still speed differences after that it could be due to whether println buffers IO operations.

This could also be related to the fact that println is slower than Python’s print in some terminals: https://github.com/JuliaLang/julia/issues/36639

Do your results change if you skip the print statements?

I see. I have modified the code, but noticed no improvement in performance.

I later skipped printing, but noticed no improvement in performance - gzip was using ~45% CPU, and Julia 100%.

Yea why are you calling externally here and not using a Julia package? I read in large *.csv.gz files via streaming using CSV.jl and CodecZLib.jl and it is very fast.

Is this code appropriate in using CodecZlib? It is even more slow…

using CSV, CodecZlib, Mmap, TranscodingStreams
function fun1( file1)
    io1 = TranscodingStream( GzipDecompressor(), open( expanduser( file1)))
    N = 10
    n1 = 0
    for line1 in eachline( io1)
        F = split( line1, '\t')
        if n1 < N # print the first 10 lines
            print( line1, '\n')
            n1 = n1 + 1
        end
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")
1 Like

just a side note - with a multi core cpu :
the "unpigz -c sample.gz | .... " is ~ 2x faster than a simple zcat sample.gz | ...
https://unix.stackexchange.com/questions/363644/fastest-and-most-efficient-way-to-get-number-of-records-lines-in-a-gzip-compre/363739#363739

2 Likes

I know. However, gzip is not the bottleneck here.

1 Like

There was a related thread recently where it was mentioned that reading stdin via a pipe can be quite slow in Julia. CodecZlib definitely seems like the more idiomatic option (just like you’d use gzip.open in Python), so hopefully someone can comment on why the performance of your code snippet isn’t amazing (lack of buffering, maybe)?

5 Likes

I think that’s right (you should be able to just do GzipDecompressorStream(open( expanduser( file1))), but that is equivalent).

Is the file (or something similar) publicly available? If you can post a link, I suspect you will find some people who will try to optimise it.

OK. Thanks for the answer. The file is 2+ gb and is publicly available: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz.

1 Like

A post was split to a new topic: Edit limits

Thanks, doing some basic timings, I think there are actually two distinct performance problems:

  1. calling eachline: simply iterating over the gzcat pipe takes ~1 minute in julia, vs ~30 seconds in Python. CodecZlib.jl seems to have similar performance to gzcat.
  2. split in Julia is itself slower than in Python

Not sure exactly what the best approach here would be.

6 Likes

Out of curiosity, you could try LibDeflate.jl: [ANN] LibDeflate.jl and CodecBGZF.jl - really fast blocked de/compression - #8 by jakobnissen

1 Like

However, streaming is not supported…

Unlike libz or gzip, libdeflate does not support streaming, and so is intended for use in of files that fit in-memory or for block-compressed files like bgzip.

Ah, sorry, I’ve missed it.

maybe we can use the new zlib-ng lib.

x86-64 benchmarks:
Zlib-ng is about 4x faster than zlib, and 2.1x faster than gzip for compression.*
Zlib-ng is about 2.4x faster than zlib and 1.8x faster than gzip when decompressing."*