Slow gzip streaming in julia but not in python

calvin · March 2, 2021, 4:08pm

Hi there,

I am new to Julia, and come to Julia for performance with fewer codes. I have been using R, and encountered a simple task: read from a gziped file and do something to each line. It was slow in R, of course. So I went to Perl first. Here is the code:

perl -e '
open( DATA, "zcat ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz|");
while( <DATA>) {
  @F = split( /\t/);
}
close( DATA);'

I found the gzip command was using about 48% CPU of a thread. Then I came to Julia, and here is the code:

function fun1( file1)
    for line1 in eachline( `zcat $( expanduser( file1))`)
        F = split( line1, '\t')
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

And I fount that the gzip command was also using about 48% CPU. Then I came to Python, and here is the code:

import subprocess
with subprocess.Popen( "zcat ~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz", shell = True, stdout = subprocess.PIPE) as gz:
    for line1 in gz.stdout:
        F = line1.split( b'\t')

Now, the gzip is using nearly 100% CPU, and so is Python.

I have not recoreded times spent by these 3 approaches. But I can feel that the python code is obviously faster than the other two.

How come?

simonbyrne · March 2, 2021, 4:21pm

How does the timed performance compare?

I’m not sure it will make much difference in this case, but it is generally bad for performance to refer to non-const global variables inside a function.

Also, I would recommend using CodecZlib.jl over shell commands.

calvin · March 2, 2021, 4:28pm

I will later record the times spent. But I have the impression that the python code is obviously faster than the other two.

For the julia code, there is only one global variable, which was not used repeatedly. So I guess it would not affect the performance much.

WschW · March 2, 2021, 4:35pm

The issue with non const global variables is in how they affect type inference. Basically it and every value derived from that are a type which cannot be known at compile time which results in slower code.

If there are still speed differences after that it could be due to whether println buffers IO operations.

rdeits · March 2, 2021, 4:40pm

This could also be related to the fact that println is slower than Python’s print in some terminals: https://github.com/JuliaLang/julia/issues/36639

Do your results change if you skip the print statements?

calvin · March 2, 2021, 4:46pm

I see. I have modified the code, but noticed no improvement in performance.

calvin · March 2, 2021, 4:48pm

I later skipped printing, but noticed no improvement in performance - gzip was using ~45% CPU, and Julia 100%.

tbeason · March 2, 2021, 4:48pm

Yea why are you calling externally here and not using a Julia package? I read in large *.csv.gz files via streaming using CSV.jl and CodecZLib.jl and it is very fast.

calvin · March 2, 2021, 5:36pm

Is this code appropriate in using CodecZlib? It is even more slow…

using CSV, CodecZlib, Mmap, TranscodingStreams
function fun1( file1)
    io1 = TranscodingStream( GzipDecompressor(), open( expanduser( file1)))
    N = 10
    n1 = 0
    for line1 in eachline( io1)
        F = split( line1, '\t')
        if n1 < N # print the first 10 lines
            print( line1, '\n')
            n1 = n1 + 1
        end
    end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")

ImreSamu · March 2, 2021, 6:32pm

just a side note - with a multi core cpu :
the "unpigz -c sample.gz | .... " is ~ 2x faster than a simple zcat sample.gz | ...
https://unix.stackexchange.com/questions/363644/fastest-and-most-efficient-way-to-get-number-of-records-lines-in-a-gzip-compre/363739#363739

calvin · March 2, 2021, 10:48pm

I know. However, gzip is not the bottleneck here.

ToucheSir · March 3, 2021, 12:15am

There was a related thread recently where it was mentioned that reading stdin via a pipe can be quite slow in Julia. CodecZlib definitely seems like the more idiomatic option (just like you’d use gzip.open in Python), so hopefully someone can comment on why the performance of your code snippet isn’t amazing (lack of buffering, maybe)?

simonbyrne · March 3, 2021, 9:08pm

I think that’s right (you should be able to just do GzipDecompressorStream(open( expanduser( file1))), but that is equivalent).

Is the file (or something similar) publicly available? If you can post a link, I suspect you will find some people who will try to optimise it.

calvin · March 4, 2021, 1:28am

OK. Thanks for the answer. The file is 2+ gb and is publicly available: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz.

simonbyrne · March 4, 2021, 7:18pm

A post was split to a new topic: Edit limits

simonbyrne · March 4, 2021, 8:31pm

Thanks, doing some basic timings, I think there are actually two distinct performance problems:

calling eachline: simply iterating over the gzcat pipe takes ~1 minute in julia, vs ~30 seconds in Python. CodecZlib.jl seems to have similar performance to gzcat.
split in Julia is itself slower than in Python

Not sure exactly what the best approach here would be.

Skoffer · March 17, 2021, 1:07pm

Out of curiosity, you could try LibDeflate.jl: [ANN] LibDeflate.jl and CodecBGZF.jl - really fast blocked de/compression - #8 by jakobnissen

calvin · March 17, 2021, 2:08pm

However, streaming is not supported…

Unlike libz or gzip, libdeflate does not support streaming, and so is intended for use in of files that fit in-memory or for block-compressed files like bgzip.

Skoffer · March 17, 2021, 2:08pm

Ah, sorry, I’ve missed it.

ImreSamu · March 17, 2021, 3:20pm

maybe we can use the new zlib-ng lib.

x86-64 benchmarks:
Zlib-ng is about 4x faster than zlib, and 2.1x faster than gzip for compression.*
Zlib-ng is about 2.4x faster than zlib and 1.8x faster than gzip when decompressing."*

Topic		Replies	Views
Speed comparison on reading a gzip file General Usage performance	8	1481	December 10, 2023
GzipDecompressionStream compared to GZip.jl? General Usage	2	1426	August 22, 2017
[ANN] LibDeflate.jl and CodecBGZF.jl - really fast blocked de/compression Package Announcements	9	1488	March 17, 2021
Julia1.0 linux GZip.jl cannout find libz.so New to Julia package	3	1110	September 2, 2018
Read/write compressed files in julia 0.7 General Usage	3	1132	June 13, 2019

Slow gzip streaming in julia but not in python

Related topics