Interpolate over large HDF5 arrays

fabiangans · March 19, 2025, 4:08pm

Just for reference, depending on the use case, this might not be the most performant solution. An alternative AbstractArray implementation could use the DiskArrays.jl interface which will always try to access the array in large blocks instead of relying on random access to single points as the pure AbstractArray interface. The following example would create an interpolated HDF5-based array.

Create a small wrapper around HDF5 Dataset that implements the DiskArrays interface

using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked

#Implement the DiskArray interface for HDF5 as proposed in https://github.com/JuliaIO/HDF5.jl/issues/615
struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
  ds::HDF5.Dataset
  cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Unchunked()
haschunks(x::HDF5DiskArray) = Chunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5.Dataset)
    cs = try
        GridChunks(ds, HDF5.get_chunk(ds))
    catch
        nothing
    end
    HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end

Create a Test dataset and wrap it

#Create example file
h5open("example.h5", "w") do fid
   g = create_group(fid, "mygroup")
   dset = create_dataset(g, "myvector", Float32, (10,10,10),chunk=(5,5,5))
   values = [Float32(i) for i in 1:10, j in 1:10, k in 1:10]
   write(dset, values)
end

fid = h5open("example.h5", "r")
da = HDF5DiskArray(fid["mygroup/myvector"])

Use InterpolatedDiskArray from DiskArrayTools to create an interpolated view into the array

using DiskArrayTools: InterpolatedDiskArray
using Interpolations
target_indices = (range(1,10,91),range(1,10,91),1:10)
newsize = length.(target_indices)
newchunks = GridChunks(newsize,(50,50,5))

di = InterpolatedDiskArray(da, newchunks, target_indices...,order=Quadratic())
#Compute sum over the last dimension, every chunk will only be accessed once, so this is quite efficient and works on arbitrary array sizes without running OOM
sum(di,dims=3)

If your workflow depends on random access to single locations into your dataset, this solution will bring no benefit. However, if you rather do some larger batch-processing processing chunks of data piece-by-piece it might be worth to look into the DiskArrays machinery.

Topic		Replies	Views
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1792	November 4, 2020
Taking the array indexing interface seriously Internals & Design question	31	2971	February 3, 2020
Reading a large array from an HDF5 file New to Julia hdf5	14	6245	February 6, 2023
Out of memory error in `viz` of 3D `RectilinearGrid` from Meshes.jl Visualization hdf5 , memory , glmakie , meshes	33	454	May 31, 2025
Recue Interpolated data from old JLD2 General Usage	6	346	May 21, 2023

Interpolate over large HDF5 arrays

Create a small wrapper around HDF5 Dataset that implements the DiskArrays interface

Create a Test dataset and wrap it

Use InterpolatedDiskArray from DiskArrayTools to create an interpolated view into the array

Related topics