[ANN] BGZFLib.jl: BGZF successor library

Hey all,

In the next couple of months I’ll register my latest package: BGZFLib.jl. This package is used to read and write BGZF compressed files often used in bioinformatics.

We already have two BGZF packages in Julia: BGZFStreams and CodecBGZF. Why add another?
CodecBGZF (my own) is broken and should not be used. I should not have registered it in the first place, and will be archiving and deprecating it as soon as BGZFLib is registered.

Compared to BGZFStreams, BGZFLib is different in the following ways:

  1. BGZFLib uses the BufferIO interface, whereas BGZFStreams uses the Base.IO interface. Most users probably won’t care, since BufferIO is mostly backwards compatible with Base.IO, but it does offer some greater control for performance-sensitive applications.

  2. BGZFLib is faster, because it uses LibDeflate.jl as a backend, whereas BGZFStreams uses the slower CodecZlib.jl backend

  3. BGZFLib (de)compresses both in parallel, and concurrently with writing (reading). BGZFStreams is parallel, but not concurrent, and is for that reason also slower.

  4. BGZFLib have more features and offers more control. For example, you can create, write and load GZI files (index files for BGZF), write empty blocks, choose how to handle files without EOF markers, choose not to write EOF markers, and recover errored streams.

Here is a benchmark decompressing and XOR’ing the bytes of a 3.9 GB BAM file (9.1 GB decompressed) on my laptop. The four programs compared are:

  • BGZFStreams - the competing package
  • BGZFLib - this package
  • CodecZlib - general purpose, single-threaded gzip decompression
  • bgzip - the reference CLI tool written in C from the author of the BGZF format. Timed by bgzip -dc -@ [threads] [file] > /dev/null
threads BGZFStreams BGZFLib CodecZlib bgzip
1             27.0    12.3      27.3  11.7
2             15.3     6.50     27.4   6.27
4              9.16    3.45     27.4   3.66
8              6.73    2.33     27.3   2.41

As you can see, BGZFLib is slightly slower than bgzip. Since the author of bgzip is very performance oriented, I consider that a reasonable performance ceiling and I find it unlikely someone is able to make a significantly faster library. It also corresponds to ~300 MB/s decompression per thread with good multithreaded scaling.

Happy hacking

9 Likes

Why not reuse the name CodecBGZF, as a “complete rewrite” with a breaking release?

Because all the Codec* packages, including CodecBGZF are plug-ins for TranscodingStreams.jl (that’s what the “codec” part of the names mean), and BGZFLib is not, as I opted for a different interface.

The reason for that is twofold. First, after already committing to the TranscodingStreams (TS) interface, I found that one of the core abstractions of TS doesn’t really work for BGZF - namely that codecs can be chained together arbitrarily. That intuitively seem like it should work, but it makes seeking for a concurrent BGZF reader nearly impossible to implement, and seeking is one of the main motivating factors of BGZF.

Second and less importantly, I really like the BufferIO interface and much prefer working with (both as an author of the IO objects, and as a user of them) it compared to the TS interface.

4 Likes