Proposal
Often, I have data, that is larger than memory and I’d like to be able to work with the hdf5 file format in julia. I propose we/I write a package (HDF5Arrays.jl) to address some of the following use cases (in addition to adding to DiskArrays.jl so that Zarr and other formats can implement the same interface):
- Creation of array like objects from paths/HDF5.jl objects.
# Use FileIO style dispatch to dispatch to correct methods
darr = DiskArray("test.hdf5") # if only one dataset present at file level, else: error
A = DiskArray("test.hdf5", "mygroup/A") # Make DiskArray out of dataset 'A' in group 'mygroup'
using HDF5
dset = h5open("test.hdf5")["mygroup/A"]
A = DiskArray(dset)
# this may not be feasible with just 'DiskArray' and dispatch on file format
# So, instead, may be each file format will implement their own version:
A = HDF5Array("test.hdf5", "mygroup/A")
# Even cooler:
A = HDF5Array("test.hdf5/mygroup/A")
# This wouldn't work with FileIO style dispatch if we're using file endings.
# But the specific call to HDF5Array, should make this work. It'll just look for '.hdf5', '.h5', etc. to split the path
To what extent a general DiskArray constructor should exist vs. each file format having its own constructors is something I’m not sure about. I’ve shown both above.
- Creation of new DiskArrays, akin to initialisation of julia core Arrays
# Create a 1000x100 Float64 array in file test.hdf5 in group mygroup with name B
B = HDF5Array{Float64}("test.hdf5/mygroup/B", undef, 1000, 100)
# Extra arguments like chunking could work like so
C = HDF5Array{Int64}("test.hdf5/C", undef, 1000, 100, chunks=(100, 100))
C = HDF5Array{Int64}("test.hdf5/C", undef, 1000, 100, :chunks => (100, 100), :shuffle, :blosc => 3)
# (or we just take the same formatting as in HDF5.jl)
- Work with multiple files/datasets as if they were one
# treat the datasets in the three files in 'mygroup/D' as if they were one long dataset
D = HDF5Array(["file1.hdf5", "file2.hdf5", "file3.hdf5"], "mygroup/D")
# This should also work (concatenate datasets 'A', 'B' and 'C'):
multi = HDF5Array("file1.hdf5", ["mygroup/A", "mygroup/B", "mygroup/C"])
- Parallelised operations on DiskArrays/HDF5Arrays
Edata = HDF5Array(["file1.hdf5", "file2.hdf5", "file3.hdf5"], "E")
# create empty datasets with chunking in the three files
result = HDF5Array{Int64}(["file1.hdf5", "file2.hdf5", "file3.hdf5"], "result", undef, 0, chunks => (1000,))
# Three workers apply some_func to each element of the three files in parallel
# and write the result to the three files in 'result'
pmap!(some_func, result, EData)
-
Possibly also functions to split one file into many and to materialize a virtual concatenation into an actual single file on disk
-
Tables!
Tables.jl is cool and we should at least have a NamedTuple of HDF5Arrays and ideally also support row tables using HDF5s compound type. Also, add support for HDF5’s own table spec.
Moving Forward
@AStupidBear has already started work on some of this in HDF5Utils.jl. I suggest we don’t try to merge the DiskArray implementation in HDF5Utils into HDF5.jl though, but instead work seperately in HDF5Arrays.jl (or somewhere else if you prefer different naming). After all, this is not part of an interface to HDF5, but rather a specific implementation of a DiskArray. To me, that means it shouldn’t live in a wrapper/interface package.
@fabiangans What do you think about adding some of these concepts to the general DiskArray interface?
@oschulz would probably be interested in something like this as well.
Any thoughts/feedback/requests?