I’m not sure it will make much difference in this case, but it is generally bad for performance to refer to non-const global variables inside a function.
Also, I would recommend using CodecZlib.jl over shell commands.
The issue with non const global variables is in how they affect type inference. Basically it and every value derived from that are a type which cannot be known at compile time which results in slower code.
If there are still speed differences after that it could be due to whether println buffers IO operations.
Yea why are you calling externally here and not using a Julia package? I read in large *.csv.gz files via streaming using CSV.jl and CodecZLib.jl and it is very fast.
Is this code appropriate in using CodecZlib? It is even more slow…
using CSV, CodecZlib, Mmap, TranscodingStreams
function fun1( file1)
io1 = TranscodingStream( GzipDecompressor(), open( expanduser( file1)))
N = 10
n1 = 0
for line1 in eachline( io1)
F = split( line1, '\t')
if n1 < N # print the first 10 lines
print( line1, '\n')
n1 = n1 + 1
end
end
end
fun1( "~/in2/swxx/sj/db/ncbi/gene/gene2accession.gz")
There was a related thread recently where it was mentioned that reading stdin via a pipe can be quite slow in Julia. CodecZlib definitely seems like the more idiomatic option (just like you’d use gzip.open in Python), so hopefully someone can comment on why the performance of your code snippet isn’t amazing (lack of buffering, maybe)?
Thanks, doing some basic timings, I think there are actually two distinct performance problems:
calling eachline: simply iterating over the gzcat pipe takes ~1 minute in julia, vs ~30 seconds in Python. CodecZlib.jl seems to have similar performance to gzcat.
split in Julia is itself slower than in Python
Not sure exactly what the best approach here would be.
Unlike libz or gzip, libdeflate does not support streaming, and so is intended for use in of files that fit in-memory or for block-compressed files like bgzip.
x86-64 benchmarks:
Zlib-ng is about 4x faster than zlib, and 2.1x faster than gzip for compression.*
Zlib-ng is about 2.4x faster than zlib and 1.8x faster than gzip when decompressing."*