there is quite a zoo of data formats that allow you to store multidimensional arrays tiled into chunks, with or without compression. Examples are HDF5.jl, NetCDF.jl, Zarr.jl, or BigArrays.jl, maybe there will be a TileDB wrapper soon.
It would be really nice if there was a generic way to access and process these arrays with respect to their chunked nature, so that one could iterate over chunks, copy from one format to another, efficiently distribute computations over workers respecting the chunks, etc through a common interface. This would greatly enhance possible coupling to tiled in-memory data formats and processing tools like TiledIteration.jl or DistributedArrays.jl
When searching for existing interfaces, the closest thing I found was BlockArrays.jl. However, the model does not completely fit the formats mentioned above, since it seems to focues on variable-sized “Blocks” while in the formats mentioned above, the chunks have all equal size. Then there is Blocks.jl from JuliaParrallel, which seems to provide exactly the interface I thought about but is abandoned.
My main question is if there are people who would support an effort to harmonize a chunk interface for some of the above-mentioned data formats, ideally contributors/authors of these packages and if there are other already existing interfaces that might be ready to use and that I have missed in my search.
It is also unclear to me if Chunk type defined in Dagger.jl is generic enough to deal with our use case and would give us a lot of chunked processing for free. So far I have seen only examples of Dagger being applied to JuliaDB instances, although it should be generic enough to deal with arbitrary chunked datasets, so maybe this is the way to go. Any explanation/input here would be welcome as well.
TLDR
What is the best interface/framework to deal with chunked, dense multidimensional arrays data formats
I think that there are at least two parallel goals the libraries you mention address:
some structure, typically sparsity, that can be taken advantage of (BlockArrays),
a purely mechanical partition of elements to facilitate processing the data in parts (HDF & friends).
They have some intersection, but neither is a subset of the other.
As for a common API for chunks, one could imagine an eachchunk(A) iterator that returns some I::CartesianIndices that conform to the storage layout so that A[I] can be retrieved efficiently.
I think that there are at least two parallel goals the libraries you mention address:
Thanks for the clarification I am currently more interested in your number 2 and this is the set of packages I am currently using. I was just wondering if concepts from (1) might help here.
Do you need anything more?
I think this might already bring me a long way. In particular if the eachchunk iterator would implement HasShape so that one would know beforehand how to organize the individual chunks. Maybe in the future an optional affinity function would help as well to facilitate interaction with DistributedArrays.
I will experiment a bit and keep you updated. Do you think it would be feasible to create a package that only defines a single function name?
Thanks, I did not mention columnar formats here, because I am more interested in dense, n-dimensional arrays. However, there might be some overlap in concepts and maybe one could define a common interface for these, but it is currently not my focus.
I drafted a small interface package here GitHub - meggart/ChunkedArrayBase.jl and added a few usage example in the Readme that test the functionality. Basically the interface works fine and comments on details would be very welcome as an issue in the repository. I opened an issue in HDF5 as well, Zarr and NetCDF will follow. For now I omitted dealing with the heterogenous case, but I think this would be quite easy to extend by either adding a second iterator (eachchunktype) or indeed by returning a tuple instead.
Which of those formats allow to add new columns or reshape the data directly on disk, without needing to copy and create a new one?
It will be very useful when dealing with larger than memory datasets.
I think all of them. In NetCDF/HDF5 you can define a dimension with unlimited length, which means you can keep appending data to it. For Zarr you can always resize/append along any dimension.
For all of these packages this will only be efficient if you define you chunking accordingly so that ideally you always append complete chunks to your dataset.