How does the timed performance compare?
I’m not sure it will make much difference in this case, but it is generally bad for performance to refer to non-
const global variables inside a function.
Also, I would recommend using CodecZlib.jl over shell commands.
I will later record the times spent. But I have the impression that the python code is obviously faster than the other two.
For the julia code, there is only one global variable, which was not used repeatedly. So I guess it would not affect the performance much.
The issue with non const global variables is in how they affect type inference. Basically it and every value derived from that are a type which cannot be known at compile time which results in slower code.
If there are still speed differences after that it could be due to whether
println buffers IO operations.
This could also be related to the fact that
println is slower than Python’s
print in some terminals: slow printing in terminals · Issue #36639 · JuliaLang/julia · GitHub
Do your results change if you skip the
I see. I have modified the code, but noticed no improvement in performance.
I later skipped printing, but noticed no improvement in performance - gzip was using ~45% CPU, and Julia 100%.
Yea why are you calling externally here and not using a Julia package? I read in large
*.csv.gz files via streaming using CSV.jl and CodecZLib.jl and it is very fast.
Is this code appropriate in using CodecZlib? It is even more slow…
using CSV, CodecZlib, Mmap, TranscodingStreams
function fun1( file1)
io1 = TranscodingStream( GzipDecompressor(), open( expanduser( file1)))
N = 10
n1 = 0
for line1 in eachline( io1)
F = split( line1, '\t')
if n1 < N # print the first 10 lines
print( line1, '\n')
n1 = n1 + 1
just a side note - with a multi core cpu :
unpigz -c sample.gz | .... ” is ~ 2x faster than a simple
zcat sample.gz | ...
I know. However, gzip is not the bottleneck here.
There was a related thread recently where it was mentioned that reading stdin via a pipe can be quite slow in Julia.
CodecZlib definitely seems like the more idiomatic option (just like you’d use
gzip.open in Python), so hopefully someone can comment on why the performance of your code snippet isn’t amazing (lack of buffering, maybe)?
I think that’s right (you should be able to just do
GzipDecompressorStream(open( expanduser( file1))), but that is equivalent).
Is the file (or something similar) publicly available? If you can post a link, I suspect you will find some people who will try to optimise it.
OK. Thanks for the answer. The file is 2+ gb and is publicly available: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz.
A post was split to a new topic: Edit limits
Thanks, doing some basic timings, I think there are actually two distinct performance problems:
eachline: simply iterating over the
gzcat pipe takes ~1 minute in julia, vs ~30 seconds in Python. CodecZlib.jl seems to have similar performance to
split in Julia is itself slower than in Python
Not sure exactly what the best approach here would be.
However, streaming is not supported…
Unlike libz or gzip, libdeflate does not support streaming, and so is intended for use in of files that fit in-memory or for block-compressed files like bgzip.
Ah, sorry, I’ve missed it.
maybe we can use the new zlib-ng lib.
Zlib-ng is about 4x faster than zlib, and 2.1x faster than gzip for compression.*
Zlib-ng is about 2.4x faster than zlib and 1.8x faster than gzip when decompressing."*
I have not tried this. However, I guess Julia has some intrinsic optimisation space.