Interpolate over large HDF5 arrays

henry2004y · March 18, 2025, 4:17am

Hi,

I have a extremely large standard HDF5 file (2 TB) which saves a 3D array, let’s say B. With HDF5.jl, I can access individual elements of the array in the same way as this thread shows: load-hdf5-file-larger-than-memory.

using HDF5

fid = h5open(fname, "r")
B = fid["B"] # size(B) = (10000, 10000, 4992), eltype(B) = Float32
B[10,10, 1] # this works

However, when I tried to perform interpolations over B using Interpolations.jl, I got

using Interpolations

itp = extrapolate(interpolate(B, BSpline(Linear())), NaN)

ERROR: MethodError: no method matching interpolate(::HDF5.Dataset, ::BSpline{Linear{Throw{OnGrid}}})
The function `interpolate` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  interpolate(::AbstractArray, ::IT, ::GT) where {IT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, BSpline}}}, BSpline}, GT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, Interpolations.GridType}}}, Interpolations.GridType}}
   @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/deprecations.jl:19
  interpolate(::AbstractArray, ::IT) where IT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, BSpline}}}, BSpline}
   @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:190
  interpolate(::AbstractArray, ::IT, ::Real, ::Int64) where IT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, BSpline}}}, BSpline}
   @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:217

This means that Interpolations.jl expects an AbstractArray as input, but HDF5.Dataset is not an AbstractArray. This array is so large that I cannot possibly read it into memory as a whole. Is it possible to interpolate over such large array?

Salmon · March 18, 2025, 5:54am

Sounds like HDF5 memory mapping would be good here.

henry2004y · March 18, 2025, 12:41pm

Yes, but unfortunately when I checked with HDF5.ismmappable, it returned false. I don’t know why, but probably it’s compressed.

Salmon · March 18, 2025, 1:04pm

I see, it would make sense to compress such a big data set of course.

What should probably still work is to simply create a wrapper that subtypes AbstractArray{T,N}.
The following seems to work on first glance, though I am not sure if Interpolations does not allocate the full array under the hood somehow. The reason I am suspecting something like that is because I can still access the interpolation after deleting the file.

using HDF5, Interpolations

struct HDF5MemoryArray{T,N} <: AbstractArray{T,N}
    data::HDF5.Dataset
    function HDF5MemoryArray(data::HDF5.Dataset)
        T = eltype(data)
        N = ndims(data)
        new{T,N}(data)
    end
end
Base.size(a::HDF5MemoryArray) = size(a.data)
Base.getindex(a::HDF5MemoryArray, i::Vararg{Int,N}) where N = a.data[i...]
Base.getindex(a::HDF5MemoryArray, i::Int) = a.data[i]

itp = h5open("a.h5","r") do file
        
    B = HDF5MemoryArray(file["data"])
        
    itp = extrapolate(interpolate(B, BSpline(Linear())), NaN)
end

But maybe this is a starting point

Edit:looks like the information is stored in itp.itp.coefs. I think this is more likely an issue with interpolations not being able to use a lazy interpolation here…

henry2004y · March 18, 2025, 1:09pm

I’ll try! Meanwhile, I found this issue in the HDF5.jl repository: Dataset as an AbstractArray. What is stopping this from being implemented?

Salmon · March 18, 2025, 1:14pm

At first glance at the github issue it looks like its more about technicalities, i.e. avoiding breaking changes. As far as i can tell there should be no issue in creating your own wrapper until HDF5.Dataset is an AbstractArray

henry2004y · March 18, 2025, 2:04pm

I quickly tried this on the large HDF5 array. As you mentioned, this might be a problem now for Interpolations.jl:

julia> itp = extrapolate(interpolate(B, BSpline(Linear())), NaN)
ERROR: OutOfMemoryError()
Stacktrace:
  [1] GenericMemory
    @ ./boot.jl:516 [inlined]
  [2] new_as_memoryref
    @ ./boot.jl:535 [inlined]
  [3] Array
    @ ./boot.jl:585 [inlined]
  [4] Array
    @ ./boot.jl:593 [inlined]
  [5] Array
    @ ./boot.jl:599 [inlined]
  [6] padded_similar(::Type{Float32}, inds::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/prefiltering.jl:5
  [7] copy_with_padding(::Type{Float32}, A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/prefiltering.jl:23
  [8] prefilter(::Type{Float32}, ::Type{Float32}, A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/prefiltering.jl:39
  [9] interpolate(::Type{Float32}, ::Type{Float32}, A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:167
 [10] interpolate(A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:191
 [11] top-level scope
    @ REPL[13]:1

Maybe I should try some other interpolation libraries?

stevengj · March 18, 2025, 4:50pm

Or you could implement it yourself. Linear interpolation on a regular grid is just a few lines of code.

henry2004y · March 18, 2025, 6:49pm

Following the suggestions, I quickly came up the following code for handling linear interpolations:

using HDF5

struct HDF5MemoryArray{T,N} <: AbstractArray{T,N}
   data::HDF5.Dataset
   function HDF5MemoryArray(data::HDF5.Dataset)
      T = eltype(data)
      N = ndims(data)
      new{T,N}(data)
   end
end
Base.size(a::HDF5MemoryArray) = size(a.data)
Base.getindex(a::HDF5MemoryArray{T,N}, i::Vararg{Int,N}) where {T, N} = a.data[i...]::T
Base.getindex(a::HDF5MemoryArray{T,N}, i::Int) where {T, N} = a.data[i]::T


"""
    trilin_reg(x, y, z, Q)

Trilinear interpolation for x1,y1,z1=(0,0,0) and x2,y2,z2=(1,1,1)
Q's are surrounding points such that Q000 = F[0,0,0], Q100 = F[1,0,0], etc.
"""
function trilin_reg(x::T, y::T, z::T, Q000, Q100, Q010, Q110, Q001, Q101, Q011, Q111) where {T<:Real}
   oneT = one(T)
   mx = oneT - x
   my = oneT - y
   mz = oneT - z
   fout =
      Q000 * mx * my * mz +
      Q100 * x  * my * mz +
      Q010 * y  * mx * mz +
      Q110 * x  * y  * mz +
      Q001 * mx * my * z +
      Q101 * x  * my * z +
      Q011 * y  * mx * z +
      Q111 * x  * y  * z
end

"""
    grid_interp(x, y, z, ix, iy, iz, field)

Interpolate a value at (x,y,z) in a field. `ix`,`iy` and `iz` are indexes for x, y and z
locations (0-based).
"""
grid_interp(x::T, y::T, z::T, ix::Int, iy::Int, iz::Int, field::AbstractArray{U,3}) where
   {T<:Real, U<:Number} =
   trilin_reg(x-ix, y-iy, z-iz,
      field[ix+1, iy+1, iz+1],
      field[ix+2, iy+1, iz+1],
      field[ix+1, iy+2, iz+1],
      field[ix+2, iy+2, iz+1],
      field[ix+1, iy+1, iz+2],
      field[ix+2, iy+1, iz+2],
      field[ix+1, iy+2, iz+2],
      field[ix+2, iy+2, iz+2]
   )

fname = "data.h5"

fid = h5open(fname, "r")

B = HDF5MemoryArray(fid["B"])

xGrid = range(-0.5, 0.5, length=size(B,1))
yGrid = range(-0.5, 0.5, length=size(B,2))
zGrid = range(-1, 1, length=size(B,3))

dx = xGrid[2] - xGrid[1]
dy = yGrid[2] - yGrid[1]
dz = zGrid[2] - zGrid[1]

locx = 0.0
locy = 0.0
locz = 0.0

x = (locx - xGrid[1]) / dx
y = (locy - yGrid[1]) / dy
z = (locz - zGrid[1]) / dz

# Find surrounding points (0-based indices)
ix = floor(Int, x)
iy = floor(Int, y)
iz = floor(Int, z)

grid_interp(x, y, z, ix, iy, iz, B)

Additional type declarations for Base.getindex are required for type stability.

juliohm · March 19, 2025, 6:12am

@henry2004y do you have an example HD5F file that you could share?

henry2004y · March 19, 2025, 2:10pm

Maybe a newly generated small HDF5 file can be used as an example?

using HDF5

h5open("example.h5", "w") do fid
   g = create_group(fid, "mygroup")
   dset = create_dataset(g, "myvector", Float32, (10,10,10))
   values = [Float32(i) for i in 1:10, j in 1:10, k in 1:10]
   write(dset, values)
end

fid = h5open("example.h5", "r")
g = fid["mygroup"]
# Assume the new struct and linear interpolation codes are loaded
B = g["myvector"] |> HDF5MemoryArray

xGrid = range(-0.5, 0.5, length=10)
yGrid = range(-0.5, 0.5, length=10)
zGrid = range(-1, 1, length=10)

dx = xGrid[2] - xGrid[1]
dy = yGrid[2] - yGrid[1]
dz = zGrid[2] - zGrid[1]

startx = 0.0
starty = 0.0
startz = 0.0

x = (startx - xGrid[1]) / dx
y = (starty - yGrid[1]) / dy
z = (startz - zGrid[1]) / dz

# Find surrounding points (0-based indices)
ix = floor(Int, x)
iy = floor(Int, y)
iz = floor(Int, z)

grid_interp(x, y, z, ix, iy, iz, bx) # returns 5.5

fabiangans · March 19, 2025, 4:08pm

Just for reference, depending on the use case, this might not be the most performant solution. An alternative AbstractArray implementation could use the DiskArrays.jl interface which will always try to access the array in large blocks instead of relying on random access to single points as the pure AbstractArray interface. The following example would create an interpolated HDF5-based array.

Create a small wrapper around HDF5 Dataset that implements the DiskArrays interface

using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked

#Implement the DiskArray interface for HDF5 as proposed in https://github.com/JuliaIO/HDF5.jl/issues/615
struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
  ds::HDF5.Dataset
  cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Unchunked()
haschunks(x::HDF5DiskArray) = Chunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5.Dataset)
    cs = try
        GridChunks(ds, HDF5.get_chunk(ds))
    catch
        nothing
    end
    HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end

Create a Test dataset and wrap it

#Create example file
h5open("example.h5", "w") do fid
   g = create_group(fid, "mygroup")
   dset = create_dataset(g, "myvector", Float32, (10,10,10),chunk=(5,5,5))
   values = [Float32(i) for i in 1:10, j in 1:10, k in 1:10]
   write(dset, values)
end

fid = h5open("example.h5", "r")
da = HDF5DiskArray(fid["mygroup/myvector"])

Use InterpolatedDiskArray from DiskArrayTools to create an interpolated view into the array

using DiskArrayTools: InterpolatedDiskArray
using Interpolations
target_indices = (range(1,10,91),range(1,10,91),1:10)
newsize = length.(target_indices)
newchunks = GridChunks(newsize,(50,50,5))

di = InterpolatedDiskArray(da, newchunks, target_indices...,order=Quadratic())
#Compute sum over the last dimension, every chunk will only be accessed once, so this is quite efficient and works on arbitrary array sizes without running OOM
sum(di,dims=3)

If your workflow depends on random access to single locations into your dataset, this solution will bring no benefit. However, if you rather do some larger batch-processing processing chunks of data piece-by-piece it might be worth to look into the DiskArrays machinery.

juliohm · March 19, 2025, 4:17pm

Given any AbstractArray from the above comments, you can lazily georef it for interpolation with GeoStats.jl:

using GeoStats

data = georef((; B)) # B from the example above

grid = CartesianGrid(10, 10, 10) # domain of interpolation

data |> InterpolateNeighbors(grid, model = NN())

That will give you access to various interpolation models and various types of interpolation domain (point sets, grids, meshes). The result of InterpolateNeighbors is currently composed of non-lazy columns, but it should be easy to work on a hotfix to reproduce the exact input array type.

henry2004y · March 23, 2025, 1:24pm

Thanks for the detailed example! This indeed requires some understanding of the internals of DiskArrays.jl. I will compare the performance with the first naive example next.

When I searched for related packages, I noticed a higher level YAXArrays.jl that’s built upon DimensionalData.jl and DiskArrays.jl. Maybe this task can be directly accomplished by taking advantage of YAXArrays.jl?

Topic		Replies	Views
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1716	November 4, 2020
Faster/performant interpolation of very big arrays Performance	0	237	June 7, 2023
Lookup table interpolation of an array New to Julia question	7	475	May 5, 2024
Matrix interpolations New to Julia	1	1392	December 5, 2019
Interpolation with matrices General Usage package , interpolations	13	1393	October 28, 2020

Interpolate over large HDF5 arrays

Create a small wrapper around HDF5 Dataset that implements the DiskArrays interface

Create a Test dataset and wrap it

Use InterpolatedDiskArray from DiskArrayTools to create an interpolated view into the array

Related topics