[ANN] LibDeflate.jl and CodecBGZF.jl - really fast blocked de/compression

Here I introduce the new packages LibDeflate.jl and CodecBGZF.jl. Both packages are newly registered, and I would love to get comments.


LibDeflate.jl is a thin wrapper around libdeflate, the fastest implementation of the DEFLATE compression algorithm that I’m aware of. DEFLATE is used in the zip, gzip and bgzip formats. As a wrapper library with a minimal interface, LibDeflate.jl is intended as a low-level building block for writing higher-level Julia packages. The package also offers a very fast implementation of the crc32 checksum.

LibDeflate.jl differs from the more common DEFLATE-related package CodecZlib in the following ways:

  • LibDeflate.jl is multiple times faster than CodecZlib.jl.
  • LibDeflate.jl only supports in-memory inflation/deflation, and as such does not support streaming IO unless the stream is composed of smaller compressed blocks, each which can be de/compressed in memory.
  • LibDeflate.jl’s interface is lower-level and does not provide convenience methods.


CodecBGZF.jl is a higher-level package built on LibDeflate.jl with the purpose of reading and writing blocked gzip format (BGZF) files. It is a codec for TranscodingStreams.jl, and therefore integrates well with the Julia file IO ecosystem.

BGZF is a format backwards compatible with gzip. It offer slightly worse compression ratios than normal gzip files, but offers random access and is significantly faster to de/compress.

Compared to the existing BGZFStreams.jl package, CodecBGZF offers:

  • Leverages LibDeflate.jl for 4-5x faster single threaded read/write operations
  • Automatic asyncronous and muiltithreaded de/compression which can make it tens of times faster when using 8 threads
  • Being a codec of TranscodingStreams.jl, CodecBGZF automatically “inherits” all the useful high-level methods of TranscodingStreams
  • CodecBGZF offers more input validation such as crc32 checksumming and validation of stated decompressed size.

Hi there, thanks for your hard work on this - I think it will be really useful! I’m having a hard time getting good performance, for instance when compared to the standard python gzip library. I have a bgzipped fastq file with about 30M lines (7.5M reads). Decompressing and emitting each line to output takes about 60 seconds in python 3.8 using standard gzip. Using CodecBGZF in julia 1.5.2 and the implementation below, the same operation takes about ~425 seconds. I’m new to julia so perhaps I’m missing something obvious?

import CodecBGZF

function justwriteit(file)
    stream = CodecBGZF.BGZFDecompressorStream(open(file))
    for fq in eachline(stream)


Happy to supply more details if it would help. I think I’m just looking to see if this performance is typical or if (more likely) I’m doing something horrible to bottleneck it…

Hey there!

Maybe you’re seeing https://github.com/JuliaLang/julia/issues/36639? I just did a test on my laptop where I, instead of printing each line of a 6 MiB BGZF-compressed FASTA file, simply counted the lines. CodecBGZF.jl (single threaded) finished in 25 ms, whereas gunzip -dc myfile.fna.bgz > /dev/null took 62 ms according to hyperfine. So it should be mutliple times faster than Python’s gzip.

Edit: For a 250 MiB file, it’s only somewhat faster at 1.5s vs 2.2s.

Try to print each line to a file instead of stdout, to bypass the terminal that way

1 Like

Thanks for the tip and yes, that issue looks like that might have something to do with it. If I change the function to count lines instead of write them, the time drops from 425 seconds to about 15 seconds! Python is about 35 seconds, and ‘gunzip -c | wc -l’ takes about 20 seconds, so pretty nifty that it’s faster than unix shell commands. Interestingly, if I replace julia’s println(line) with write(stdin, line) the time drops from 425s to 218s, which is nice but still much slower than python… I’ll dig more into other output options. So basically, it looks like the issue is not with this library but with julia’s println or something :confused:

1 Like

I’m not sure I can help, but when I run the exact code you posted, including println, it prints a 152 MB file, 1.9 M lines, in 780 milliseconds.
Can you try to make a Julia script which simply contains

open(x -> foreach(println, eachline(x)), ARGS[1])

And then call it from the shell on a plaintext file, redirecting output:
$ julia my_script.jl my_fasta.fna > /dev/null
And compare it to a similar Python script:

import sys
with open(sys.argv[1]) as file:
    for line in file:
        print(line, end="")

On my computer, when I run both on my 1.9 M line (80 chars per line) FASTA file, Python uses 0.85 seconds, and Julia uses 1.2 seconds (including precompilation). If your Julia script is significantly slower, then we have narrowed down the problem:

  • There are no dependencies in the script, so it’s not LibDeflate.jl or any other package
  • It doesn’t print to the terminal, so it’s not the terminal being slow
  • It can’t be the OS or the filesystem, because then Python wouldn’t be much faster.

And then it would be a good idea to make an issue on the Julia GitHub to get to the bottom of this.

Edit @brendanofallon the fact that your time spend roughtly halves when you replace println with write might suggest that you are facing trouble with IO locking on the operating system. In Julia, println(x) simply calls print(x, '\n'), which does two write operations instead of one. For each write operation, the file is locked and unlocked. Python doesn’t spend time doing that, because it has the Global Interpreter Lock, when prevents multithreading, and so there is no worry about thread safety.
To test this, please also test the following program:

open(x -> foreach(println, eachline(x)), ARGS[1], lock=false)
1 Like

OK, I made a 2M line plain text file uncompressed, and ran the code you posted above (directing output to /dev/null). Time for julia to just write the text file was 1.2 seconds (average of a few runs), and the python code ran in about 1.4 seconds. Using lock=false seemed to be a bit faster, with runtimes closer to 1.0-1.1 seconds.
Together, these results are a little confusing… Since counting the lines in the file (and not emitting anything) was so much faster than writing the output, that makes me think writing the output was the slow part… but these experiments demonstrate that julia is a bit faster, if anything, than python… so maybe it’s something about the particular way I was using println in the original code? I’ll dig more in to it, maybe using buffered streams for output to avoid grabbing and then releasing millions of file locks…

Anyway, thanks for your help with this. I don’t think there’s any issue with the CodecBGZF or LibDeflate code here. It’s something funky about writing lots of output that is causing a slowdown, not reading the data.

Yeah. You can also try the same but redirecting it to a file. That’s probably where the IO locking takes its toll.

Version 0.2 has just been released.

This version adds gzip functionality: Both compression and decompression, but also extraction of the internal fields of gzip data (like comments, filenames, timestamp and any “extra fields” as described in the specification).

It remains around 4-5x faster than CodecZlib.


Awesome thanks. Any blockers for your PR in XAM.jl ? I’ve got a lot of old code that reads BAMs, I’d be curious to see the difference with all these improvements.

Yeah, sorry, I’ve been meaning to get around to do it, but there is so much to do. CodecBGZF has a multithreading issue at the moment that I need to fix. After that there is quite a bit of work on XAM to do. I’m not sure when I’m going to get the time. It probably won’t be finished until summer.

1 Like