Interpolate over large HDF5 arrays

Just for reference, depending on the use case, this might not be the most performant solution. An alternative AbstractArray implementation could use the DiskArrays.jl interface which will always try to access the array in large blocks instead of relying on random access to single points as the pure AbstractArray interface. The following example would create an interpolated HDF5-based array.

Create a small wrapper around HDF5 Dataset that implements the DiskArrays interface

using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked

#Implement the DiskArray interface for HDF5 as proposed in https://github.com/JuliaIO/HDF5.jl/issues/615
struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
  ds::HDF5.Dataset
  cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Unchunked()
haschunks(x::HDF5DiskArray) = Chunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5.Dataset)
    cs = try
        GridChunks(ds, HDF5.get_chunk(ds))
    catch
        nothing
    end
    HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end

Create a Test dataset and wrap it

#Create example file
h5open("example.h5", "w") do fid
   g = create_group(fid, "mygroup")
   dset = create_dataset(g, "myvector", Float32, (10,10,10),chunk=(5,5,5))
   values = [Float32(i) for i in 1:10, j in 1:10, k in 1:10]
   write(dset, values)
end

fid = h5open("example.h5", "r")
da = HDF5DiskArray(fid["mygroup/myvector"])

Use InterpolatedDiskArray from DiskArrayTools to create an interpolated view into the array

using DiskArrayTools: InterpolatedDiskArray
using Interpolations
target_indices = (range(1,10,91),range(1,10,91),1:10)
newsize = length.(target_indices)
newchunks = GridChunks(newsize,(50,50,5))

di = InterpolatedDiskArray(da, newchunks, target_indices...,order=Quadratic())
#Compute sum over the last dimension, every chunk will only be accessed once, so this is quite efficient and works on arbitrary array sizes without running OOM
sum(di,dims=3)

If your workflow depends on random access to single locations into your dataset, this solution will bring no benefit. However, if you rather do some larger batch-processing processing chunks of data piece-by-piece it might be worth to look into the DiskArrays machinery.

4 Likes