Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5)

Luapulu · October 25, 2020, 7:32pm

Proposal

Often, I have data, that is larger than memory and I’d like to be able to work with the hdf5 file format in julia. I propose we/I write a package (HDF5Arrays.jl) to address some of the following use cases (in addition to adding to DiskArrays.jl so that Zarr and other formats can implement the same interface):

Creation of array like objects from paths/HDF5.jl objects.

# Use FileIO style dispatch to dispatch to correct methods
darr = DiskArray("test.hdf5") # if only one dataset present at file level, else: error
A = DiskArray("test.hdf5", "mygroup/A") # Make DiskArray out of dataset 'A' in group 'mygroup'

using HDF5
dset = h5open("test.hdf5")["mygroup/A"]
A = DiskArray(dset)

# this may not be feasible with just 'DiskArray' and dispatch on file format
# So, instead, may be each file format will implement their own version:
A = HDF5Array("test.hdf5", "mygroup/A")

# Even cooler:
A = HDF5Array("test.hdf5/mygroup/A")
# This wouldn't work with FileIO style dispatch if we're using file endings.
# But the specific call to HDF5Array, should make this work. It'll just look for '.hdf5', '.h5', etc. to split the path

To what extent a general DiskArray constructor should exist vs. each file format having its own constructors is something I’m not sure about. I’ve shown both above.

Creation of new DiskArrays, akin to initialisation of julia core Arrays

# Create a 1000x100 Float64 array in file test.hdf5 in group mygroup with name B
B = HDF5Array{Float64}("test.hdf5/mygroup/B", undef, 1000, 100)

# Extra arguments like chunking could work like so
C = HDF5Array{Int64}("test.hdf5/C", undef, 1000, 100, chunks=(100, 100))
C = HDF5Array{Int64}("test.hdf5/C", undef, 1000, 100, :chunks => (100, 100), :shuffle, :blosc => 3)
# (or we just take the same formatting as in HDF5.jl)

Work with multiple files/datasets as if they were one

# treat the datasets in the three files in 'mygroup/D' as if they were one long dataset
D = HDF5Array(["file1.hdf5", "file2.hdf5", "file3.hdf5"], "mygroup/D")

# This should also work (concatenate datasets 'A', 'B' and 'C'):
multi = HDF5Array("file1.hdf5", ["mygroup/A", "mygroup/B", "mygroup/C"])

Parallelised operations on DiskArrays/HDF5Arrays

Edata = HDF5Array(["file1.hdf5", "file2.hdf5", "file3.hdf5"], "E")

# create empty datasets with chunking in the three files
result = HDF5Array{Int64}(["file1.hdf5", "file2.hdf5", "file3.hdf5"], "result", undef, 0, chunks => (1000,))

# Three workers apply some_func to each element of the three files in parallel
# and write the result to the three files in 'result'
pmap!(some_func, result, EData)

Possibly also functions to split one file into many and to materialize a virtual concatenation into an actual single file on disk
Tables!

Tables.jl is cool and we should at least have a NamedTuple of HDF5Arrays and ideally also support row tables using HDF5s compound type. Also, add support for HDF5’s own table spec.

Moving Forward

@AStupidBear has already started work on some of this in HDF5Utils.jl. I suggest we don’t try to merge the DiskArray implementation in HDF5Utils into HDF5.jl though, but instead work seperately in HDF5Arrays.jl (or somewhere else if you prefer different naming). After all, this is not part of an interface to HDF5, but rather a specific implementation of a DiskArray. To me, that means it shouldn’t live in a wrapper/interface package.

@fabiangans What do you think about adding some of these concepts to the general DiskArray interface?

@oschulz would probably be interested in something like this as well.

Any thoughts/feedback/requests?

Luapulu · October 25, 2020, 7:55pm

tagging: @musm

Luapulu · October 25, 2020, 10:47pm

Ok, I have a minimum working example at this point without HDF5Utils. I’ll start adding features from here.

Juan · October 25, 2020, 11:23pm

It would be great to have a reshape function able to work with larger than memory data (stack/melt) and also join two larger than memory datasets directly on disk.

It would also be a good idea to use TileDB as database and file format, it’s supposed to be much faster.

Luapulu · October 25, 2020, 11:43pm

I think DiskArrays already supports some kind of reshape so it should already just work or be relatively easy to implement. And yes, merging, splitting reshaping multiple datasets on disk would be features I’d like to add.

TileDB sounds cool, but for my purposes I need hdf5.

musm · November 3, 2020, 10:24pm

Yeah this sounds like a good idea. It’s a bit difficult to determine what needs to be done at this point, since it very early stages (unless I’m mistaken).

Could you clarify how all these packages would tie together: HDF5Arrays, DiskArrays, TiledArrays?

It seems like HDF5Arrays, should eventually be integrated into HDF5.jl ?

I’ll also tag @jmert on this

lazarusA · November 3, 2020, 11:05pm

I don’t see the need for this. When things get too large is not just enough to do something like this

file = h5open("file.h5", "w")
set1 = d_create(file,"set1", datatype(Float64), dataspace(dim1, dim2, dim3))
set2 = d_create(file,"set2", datatype(Float64), dataspace(dim1, dim2, dim3))

# fill arrays
....
# close file
flush(file)
close(file)

and for reading back, similarly. [at least I think this addresses the title. ]

Luapulu · November 3, 2020, 11:37pm

Yes, it’s early stages so far. I have chunked iteration implemented, but no efficient algorithms for tiled indexing yet.

I’m thinking of having the following structure:

TiledArrays.jl

This is a package that implements efficient algorithms and indexing for arrays that are best indexed into in some form of tiling. The most common form would just be a grid, but you could imagine something like a ChainedArray, that consists of multiple smaller arrays chained together along some axis – like a virtual concatenation.

Essentially, this package would become the proposed ChunkedArrayBase.jl in the long term.

The main innovation here is to use the IndexStyle trait rather than dispatching on the array type. That way, implementing new index types only requires dispatching on the index style, rather than every possible array type. For now, I’ll play around with this just in TiledArrays.jl, but long term it may be worth thinking about merging this idea into base julia. See Issue #38284.

This may not be its own package but may instead get merged into TiledIteration.jl. See Issue #24.

DiskArrays.jl

@fabiangans has already implemented quite a cool feature set. Long term, I’d propose moving the chunking/tiling related algorithmic aspects to TiledArrays.jl / TiledIteration.jl, so that any kind of tiled array can benefit from this work.

If TiledArrays.jl can make it very simple to implement tiled / chunked array types, may be DiskArrays wouldn’t even be needed anymore at that point. But, I’m not so sure here. @fabiangans might have some more insight.

HDF5Arrays.jl

This is the implementation of an array type, with data stored on disk in HDF5 format. This should be able to do everything an Array can do in the long term. So, I’d even like to implement functions like cat for example. (Either create a virtual concatenation or actually write a new dataset to disk)

The cool thing about having a good implementation of something like a HDF5Array is that once this becomes a fully fledged array type, it can work with every package that expects arrays. Say you want a DataFrame to hold ungodly amounts of data, no worries, just use HDF5Vectors, rather than Vectors as columns. Say you want to plot a dataset stored on disk in HDF5 – sure, you could write some code yourself to push! data points to a plot one chunk at a time, but with something like a HDF5Array you can just do plot(h5arr) and it will just work! (Hopefully). Or, say you want to use JuliaDB’s / IndexedTable’s cool tools, but also want your data in HDF5 format, rather than Dagger binary files or CSV files, well, just make am IndexedTable of HDF5Vectors and it’ll work. (again, hopefully).

As for merging this into HDF5.jl, I’m a bit hesitant. To me HDF5.jl is an IO package. HDF5Arrays.jl will become an array implementation that uses some kind of HDF5 IO library but it’s not fundamentally part of an IO module. Imagine someone starts working on a pure julia implementation of a HDF5 IO library (like JLD2) and I wanted to use that new library in some parts. Would that make sense if HDF5Arrays.jl was part of HDF5.jl?

ChainedArrays.jl

This allows for treating a chain of arrays as a single array. I want this because I want to be able to treat a dataset spread across multiple files as if it were one contiguous dataset. This will either be its own thing or be implemented in TiledArrays.jl. Not sure yet.

Next Steps

The biggest obstacle right now is a good concept for an interface as well as efficient algorithms for tiled arrays. I have some ideas, but if anybody wants to help on this, I’d be very receptive.

Luapulu · November 3, 2020, 11:44pm

This addresses the problem only for the most simple operations.

Say you wanted an IndexedTable of HDF5 Datasets. How would you do that currently?

Say you wanted to sort your gigantic dataset to make groupby operations faster, how would you do that at the moment? You can’t read in the whole dataset, so you’d have to start writing your own custom sorting algorithms. Once you start down that road, you’re basically implementing your own version of HDF5Arrays.jl.

musm · November 4, 2020, 3:42am

Many thanks for the details! I’ll have to follow and explore more fully all the threads. For now it does make sense to see how HDF5Arrays.jl pans out independently. If all goes to plan, in other words integration of ChainedArrays.jl and DiskArrays.jl with TiledArrays.jl(which I think could be be a good way to consolidate everything), it might end up that HDF5Array.jl would be quite a small package over TiledArrays, and that HDF5.jl would then just reexport the package. It does sounds like we might just end up needed to return an HDF5Array within HDF5.jl for Array return types to make handling them more ergonomic, since right now HDF5.jl doesn’t have a great AbstractArray interface and has many methods missing. Glad to see this being worked on!

fabiangans · November 4, 2020, 11:40am

Thanks for the clarification here. As I interpret it, the largest step forward would be to implement the functionality currently available through DiskArrays by implementing a new IndexStyle instead of creating a new data type. I really like this idea and have tinkered a bit with it in particular when I implemented broadcasting, but came to the conclusion that some changes to base would be needed to implement all the behavior and I chose the short and dirty path with DiskArrays.jl.

So yes, I would be happy if DiskArrays got obsolete through defining a custom IndexStyle.

Similar to @musm I don’t understand why we need an extra HDF5Arrays package. Basically the aim of DiskArrays was exactly what you describe in that section, make arrays on disk behave exactly like Julia arrays, but not just for a single backend but across backends. So all the functionality described in that section I think can be generically implemented for zarr, NetCDF and HDF5 (maybe TileDB, once we get a Julia package), and should not be specific to HDF5.

Regarding concatenation of arrays, I have a version in DiskArrayTools.jl which is optimized for DiskArrays. When working on your ChainedArrays you might want to have a look at the implementation in LazyArrays.jl as well.

Luapulu · November 4, 2020, 2:36pm

First off, thanks for the link to DiskArrayTools.jl. That’s exactly what I was looking for.

You (@fabiangans) have a lot more experience with these kind of Chunked or Tiled Arrays. I have some ideas for how to do things but, if possible I’d like to get your input. Would you be amenable to a private discussion some time, so I can ask you some questions?

On merging

I agree, that all of these backends (HDF5, Zarr, etc.) can and should be able to use the same code. Ideally, HDF5Arrays (like equivalents for other backends) should be fairly compact packages that use DiskArrays.jl or may be something like a more general TiledArrays.jl in future.

While wanting to avoid 5 versions of {FileFormat}Arrays.jl in the registry / in the julia arrays org, is a reasonable argument for merging, I would note that multiple packages implement HDF5 IO, namely HDF5.jl, JLD.jl and JLD2.jl. To me, it should be possible to make a HDF5Array, or subtype thereof out of a JLDDataset. This is why I see HDF5Arrays as separate from any specific implementation of HDF5 style IO. You could also imagine people choosing different backends similar to how Plots.jl has the same interface but different backends.

Note also, that several aspects of an HDF5Array implementation might work differently to HDF5.jl itself. For example, say we decided to extend the get_chunk function. Well, it should be possible to have a HDF5Array that iterates in chunks larger than the chunks stored on disk. get_chunk can thus return different results when applied to a HDF5Array and its underlying HDF5Dataset. An even clearer example: I think it would be good to make

HDF5Array{MyStruct, 2}(path, internal_path, undef, 1000, 20)

possible by using JLD / JLD2 to serialise more complex element types. This would mean that not only would HDF5.jl suddenly have JLD2 and JLD as dependencies, but users of HDF5.jl HDF5Arrays wouldn’t even be guaranteed to get objects they can inspect with tools from HDF5.jl.

I will say: if we did merge, I think we should have a clear (minimalistic) interface to HDF5Arrays. I’d propose, making the following part of the public interface. Everything else is implementation detail.

Data must be stored on disk in a HDF5 compatible format.

This may include JLD and JLD2 for example, even if they only cover a subset of the HDF5 spec

Any HDF5Array satisfies the AbstractArray interface

eachindex must return an iterator over every index, getindex, setindex!, size, etc. should all work as expected, regardless of any underlying tiling / chunking.

eachchunk must return an iterator over chunks (not indices, but actual values) of an array
Customisation options

This includes setting chunk sizes on construction, defining where a dataset should be stored, whether a dataset should be memory mapped or compressed. Note that the names and exact behaviours may differ from HDF5.jl.

HDF5Array{Float64, 2}(path, internal_path, undef, 1000, 20, "chunk", (100, 5))

will not be supported by HDF5Arrays for example, even if HDF5.jl supports it. On the other hand, HDF5Arrays may also support features that HDF5.jl doesn’t, like serialisation of more complex types that “just works”.