Compressed array-on-disk

I am working on an algorithm that fits an HMM model to some data. As part of the fitting I need to run the so-called [forward-backward] algorithm (Forward–backward algorithm - Wikipedia). This involves constructing matrices that easily exceed the available memory and so I am trying to find an efficient way to temporarily write and read these matrices to disk. Because of the way the transition matrices are structured, a lot of the values in these matrices will be identical, and so would probably benefit greatly from some form of compression (.e.g Blosc.jl). Ideally, I would like to be able to access these matrices as though they were normal matrices, and have the underlying chunking/compression/decompression handled automatically. I could put this together myself, I think, but I wanted to check if anyone is aware of a package that already does something like this?

1 Like

I think memory mapping might do something that you’ve asked for.

Others I have seen, but not necessarily have everything in place but might be worth looking at are Zarr.jl and DiskArrays.jl might be useful.

If you blosc compress something then you can’t access the elements without uncompress, so you might need to compress in small blocks etc.

Anyway, I also want to know the answer to this but I am just sharing some fragments of things I remember from the past for this.

Thanks! Those look like valuable resources. The main functionality would be the ability of seamlessly read and write the chunked data. Basically, the chunks would be as large as can fit in memory. I had very manual python version of this at some point where I would write individual chunks to separate files, and then have a bunch bookkeeping code to keep track of the chunks and loading the correct one. This worked, but made the code a bit messy, so I was hoping to refactor that part out into a separate package in my julia implementation. I will look into Zarr and DiskArrays in more details and see if they fulfil my needs.

I would use an uncompressed HDF5 with mmap. This should be quite efficient in terms of LOC on your side.

Zarr.jl might also be an option for you:

https://github.com/meggart/Zarr.jl

Am I correct in my understanding that Zarr arrays are fast if you read them in chunks? In other words, if I try to access single elements of a Zarr array in my code, for example in a loop, that would be slow?

Yes, Zarr is slow for random access of single data points, but this will be tru for every compressed data format.

Thanks for the clarification. What I had in mind is something that automatically figures out which chunk the requested index “belongs to” and then reads the chunk of data and decompresses it in memory. This would work will for my usage case, where I’m accessing the data sequentially, i.e. column by column. All I would need to do is to check when the index “crosses the chunk border” and the load the next chunk.
It sounds like I could probably add this functionality on top of Zarr, again for my very special usage case.

I am not sure if your use case is so special, to me it actually sounds quite common. In YAXArrays.jl we have implemented a chunk-aware mapslices, which should work out-of-the box if your input data comes from NetCDF or xarray-compatible Zarr. You just map your function over your data and in the background the data is read chunk by chunk, processed and written to disk again. You also get multi-threading and distributed-processing for free. Unfortunately the documentation has not yet been migrated from ESDL.jl, so if you want to read about it, you could start here: https://esa-esdl.github.io/ESDL.jl/latest/analysis/#Applying-custom-functions-1

2 Likes

That sounds like exactly what I need, I’ll check it out. Thanks!