Compressed array-on-disk

grero · October 8, 2021, 3:58am

I am working on an algorithm that fits an HMM model to some data. As part of the fitting I need to run the so-called [forward-backward] algorithm (Forward–backward algorithm - Wikipedia). This involves constructing matrices that easily exceed the available memory and so I am trying to find an efficient way to temporarily write and read these matrices to disk. Because of the way the transition matrices are structured, a lot of the values in these matrices will be identical, and so would probably benefit greatly from some form of compression (.e.g Blosc.jl). Ideally, I would like to be able to access these matrices as though they were normal matrices, and have the underlying chunking/compression/decompression handled automatically. I could put this together myself, I think, but I wanted to check if anyone is aware of a package that already does something like this?

xiaodai · October 8, 2021, 4:01am

I think memory mapping might do something that you’ve asked for.

Others I have seen, but not necessarily have everything in place but might be worth looking at are Zarr.jl and DiskArrays.jl might be useful.

If you blosc compress something then you can’t access the elements without uncompress, so you might need to compress in small blocks etc.

Anyway, I also want to know the answer to this but I am just sharing some fragments of things I remember from the past for this.

grero · October 8, 2021, 4:10am

Thanks! Those look like valuable resources. The main functionality would be the ability of seamlessly read and write the chunked data. Basically, the chunks would be as large as can fit in memory. I had very manual python version of this at some point where I would write individual chunks to separate files, and then have a bunch bookkeeping code to keep track of the chunks and loading the correct one. This worked, but made the code a bit messy, so I was hoping to refactor that part out into a separate package in my julia implementation. I will look into Zarr and DiskArrays in more details and see if they fulfil my needs.

laborg · October 8, 2021, 5:25am

I would use an uncompressed HDF5 with mmap. This should be quite efficient in terms of LOC on your side.

fabiangans · October 8, 2021, 6:26am

Zarr.jl might also be an option for you:

https://github.com/meggart/Zarr.jl

grero · October 8, 2021, 8:06am

Am I correct in my understanding that Zarr arrays are fast if you read them in chunks? In other words, if I try to access single elements of a Zarr array in my code, for example in a loop, that would be slow?

fabiangans · October 8, 2021, 8:11am

Yes, Zarr is slow for random access of single data points, but this will be tru for every compressed data format.

grero · October 8, 2021, 8:15am

Thanks for the clarification. What I had in mind is something that automatically figures out which chunk the requested index “belongs to” and then reads the chunk of data and decompresses it in memory. This would work will for my usage case, where I’m accessing the data sequentially, i.e. column by column. All I would need to do is to check when the index “crosses the chunk border” and the load the next chunk.
It sounds like I could probably add this functionality on top of Zarr, again for my very special usage case.

fabiangans · October 11, 2021, 11:27am

I am not sure if your use case is so special, to me it actually sounds quite common. In YAXArrays.jl we have implemented a chunk-aware mapslices, which should work out-of-the box if your input data comes from NetCDF or xarray-compatible Zarr. You just map your function over your data and in the background the data is read chunk by chunk, processed and written to disk again. You also get multi-threading and distributed-processing for free. Unfortunately the documentation has not yet been migrated from ESDL.jl, so if you want to read about it, you could start here: https://esa-esdl.github.io/ESDL.jl/latest/analysis/#Applying-custom-functions-1

grero · October 11, 2021, 1:36pm

That sounds like exactly what I need, I’ll check it out. Thanks!

Topic		Replies	Views
A Julia-compatible alternative to zarr Data data-compression	19	4452	December 18, 2019
[ANN] DiskArrays.jl Package Announcements	3	1626	November 2, 2020
How to save an array to disk in compressed form? General Usage question , data-compression	9	2859	January 24, 2023
Data Storage Quo Vadis under Julia: HDF5 - JLD2 - MAT - Performance big-data , data-compression	0	661	April 27, 2022
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1730	November 4, 2020

Compressed array-on-disk

Related topics