Common interface for chunked arrays?

Hi all,

there is quite a zoo of data formats that allow you to store multidimensional arrays tiled into chunks, with or without compression. Examples are HDF5.jl, NetCDF.jl, Zarr.jl, or BigArrays.jl, maybe there will be a TileDB wrapper soon.

It would be really nice if there was a generic way to access and process these arrays with respect to their chunked nature, so that one could iterate over chunks, copy from one format to another, efficiently distribute computations over workers respecting the chunks, etc through a common interface. This would greatly enhance possible coupling to tiled in-memory data formats and processing tools like TiledIteration.jl or DistributedArrays.jl

When searching for existing interfaces, the closest thing I found was BlockArrays.jl. However, the model does not completely fit the formats mentioned above, since it seems to focues on variable-sized “Blocks” while in the formats mentioned above, the chunks have all equal size. Then there is Blocks.jl from JuliaParrallel, which seems to provide exactly the interface I thought about but is abandoned.

My main question is if there are people who would support an effort to harmonize a chunk interface for some of the above-mentioned data formats, ideally contributors/authors of these packages and if there are other already existing interfaces that might be ready to use and that I have missed in my search.

It is also unclear to me if Chunk type defined in Dagger.jl is generic enough to deal with our use case and would give us a lot of chunked processing for free. So far I have seen only examples of Dagger being applied to JuliaDB instances, although it should be generic enough to deal with arbitrary chunked datasets, so maybe this is the way to go. Any explanation/input here would be welcome as well.

TLDR

What is the best interface/framework to deal with chunked, dense multidimensional arrays data formats

5 Likes

I think that there are at least two parallel goals the libraries you mention address:

  1. some structure, typically sparsity, that can be taken advantage of (BlockArrays),

  2. a purely mechanical partition of elements to facilitate processing the data in parts (HDF & friends).

They have some intersection, but neither is a subset of the other.

As for a common API for chunks, one could imagine an eachchunk(A) iterator that returns some I::CartesianIndices that conform to the storage layout so that A[I] can be retrieved efficiently.

Do you need anything more?

1 Like

There are also Apache Arrow, Feather and Parquet with somewhat different set of features / requirements: python - What are the differences between feather and parquet? - Stack Overflow

1 Like

I think that there are at least two parallel goals the libraries you mention address:

Thanks for the clarification I am currently more interested in your number 2 and this is the set of packages I am currently using. I was just wondering if concepts from (1) might help here.

Do you need anything more?

I think this might already bring me a long way. In particular if the eachchunk iterator would implement HasShape so that one would know beforehand how to organize the individual chunks. Maybe in the future an optional affinity function would help as well to facilitate interaction with DistributedArrays.

I will experiment a bit and keep you updated. Do you think it would be feasible to create a package that only defines a single function name?

Thanks, I did not mention columnar formats here, because I am more interested in dense, n-dimensional arrays. However, there might be some overlap in concepts and maybe one could define a common interface for these, but it is currently not my focus.

Yes, I think a minimal “interface” package is the way to go. You may want to coordinate with the above packages though.

1 Like

For the chunked array containers similar to BlockArray, see also Vcat and Hcat in https://github.com/JuliaArrays/LazyArrays.jl and ArrayPartition in GitHub - SciML/RecursiveArrayTools.jl: Tools for easily handling objects like arrays of arrays and deeper nestings in scientific machine learning (SciML) and other applications. Those are different from BlockArray in that number and type of chunks are encoded in the type. I think they still fit in @Tamas_Papp’s category 1. If you are to design some kind of a common API, it might be useful to consider an interface that can deal with homogeneous and heterogeneous types. But it probably just means to return a tuple from eachchunk for the heterogeneous case.

(fixed wording: chunk size and types → number and type of chunks)

2 Likes

Thanks for the additional pointers. I had not thought about heterogeneous arrays before and will try to consider this as well.

I drafted a small interface package here GitHub - meggart/ChunkedArrayBase.jl and added a few usage example in the Readme that test the functionality. Basically the interface works fine and comments on details would be very welcome as an issue in the repository. I opened an issue in HDF5 as well, Zarr and NetCDF will follow. For now I omitted dealing with the heterogenous case, but I think this would be quite easy to extend by either adding a second iterator (eachchunktype) or indeed by returning a tuple instead.

Thanks again for the input.

3 Likes

Which of those formats allow to add new columns or reshape the data directly on disk, without needing to copy and create a new one?
It will be very useful when dealing with larger than memory datasets.

I think all of them. In NetCDF/HDF5 you can define a dimension with unlimited length, which means you can keep appending data to it. For Zarr you can always resize/append along any dimension.

For all of these packages this will only be efficient if you define you chunking accordingly so that ideally you always append complete chunks to your dataset.

2 Likes