Yes, it’s early stages so far. I have chunked iteration implemented, but no efficient algorithms for tiled indexing yet.
I’m thinking of having the following structure:
TiledArrays.jl
This is a package that implements efficient algorithms and indexing for arrays that are best indexed into in some form of tiling. The most common form would just be a grid, but you could imagine something like a ChainedArray, that consists of multiple smaller arrays chained together along some axis – like a virtual concatenation.
Essentially, this package would become the proposed ChunkedArrayBase.jl in the long term.
The main innovation here is to use the IndexStyle trait rather than dispatching on the array type. That way, implementing new index types only requires dispatching on the index style, rather than every possible array type. For now, I’ll play around with this just in TiledArrays.jl, but long term it may be worth thinking about merging this idea into base julia. See Issue #38284.
This may not be its own package but may instead get merged into TiledIteration.jl. See Issue #24.
DiskArrays.jl
@fabiangans has already implemented quite a cool feature set. Long term, I’d propose moving the chunking/tiling related algorithmic aspects to TiledArrays.jl / TiledIteration.jl, so that any kind of tiled array can benefit from this work.
If TiledArrays.jl can make it very simple to implement tiled / chunked array types, may be DiskArrays wouldn’t even be needed anymore at that point. But, I’m not so sure here. @fabiangans might have some more insight.
HDF5Arrays.jl
This is the implementation of an array type, with data stored on disk in HDF5 format. This should be able to do everything an Array can do in the long term. So, I’d even like to implement functions like cat for example. (Either create a virtual concatenation or actually write a new dataset to disk)
The cool thing about having a good implementation of something like a HDF5Array is that once this becomes a fully fledged array type, it can work with every package that expects arrays. Say you want a DataFrame to hold ungodly amounts of data, no worries, just use HDF5Vectors, rather than Vectors as columns. Say you want to plot a dataset stored on disk in HDF5 – sure, you could write some code yourself to push! data points to a plot one chunk at a time, but with something like a HDF5Array you can just do plot(h5arr) and it will just work! (Hopefully). Or, say you want to use JuliaDB’s / IndexedTable’s cool tools, but also want your data in HDF5 format, rather than Dagger binary files or CSV files, well, just make am IndexedTable of HDF5Vectors and it’ll work. (again, hopefully).
As for merging this into HDF5.jl, I’m a bit hesitant. To me HDF5.jl is an IO package. HDF5Arrays.jl will become an array implementation that uses some kind of HDF5 IO library but it’s not fundamentally part of an IO module. Imagine someone starts working on a pure julia implementation of a HDF5 IO library (like JLD2) and I wanted to use that new library in some parts. Would that make sense if HDF5Arrays.jl was part of HDF5.jl?
ChainedArrays.jl
This allows for treating a chain of arrays as a single array. I want this because I want to be able to treat a dataset spread across multiple files as if it were one contiguous dataset. This will either be its own thing or be implemented in TiledArrays.jl. Not sure yet.
Next Steps
The biggest obstacle right now is a good concept for an interface as well as efficient algorithms for tiled arrays. I have some ideas, but if anybody wants to help on this, I’d be very receptive.