there is quite a zoo of data formats that allow you to store multidimensional arrays tiled into chunks, with or without compression. Examples are HDF5.jl, NetCDF.jl, Zarr.jl, or BigArrays.jl, maybe there will be a TileDB wrapper soon.
It would be really nice if there was a generic way to access and process these arrays with respect to their chunked nature, so that one could iterate over chunks, copy from one format to another, efficiently distribute computations over workers respecting the chunks, etc through a common interface. This would greatly enhance possible coupling to tiled in-memory data formats and processing tools like TiledIteration.jl or DistributedArrays.jl
When searching for existing interfaces, the closest thing I found was BlockArrays.jl. However, the model does not completely fit the formats mentioned above, since it seems to focues on variable-sized “Blocks” while in the formats mentioned above, the chunks have all equal size. Then there is Blocks.jl from JuliaParrallel, which seems to provide exactly the interface I thought about but is abandoned.
My main question is if there are people who would support an effort to harmonize a chunk interface for some of the above-mentioned data formats, ideally contributors/authors of these packages and if there are other already existing interfaces that might be ready to use and that I have missed in my search.
It is also unclear to me if Chunk type defined in Dagger.jl is generic enough to deal with our use case and would give us a lot of chunked processing for free. So far I have seen only examples of Dagger being applied to JuliaDB instances, although it should be generic enough to deal with arbitrary chunked datasets, so maybe this is the way to go. Any explanation/input here would be welcome as well.
What is the best interface/framework to deal with chunked, dense multidimensional arrays data formats