Interpolate over large HDF5 arrays

Hi,

I have a extremely large standard HDF5 file (2 TB) which saves a 3D array, let’s say B. With HDF5.jl, I can access individual elements of the array in the same way as this thread shows: load-hdf5-file-larger-than-memory.

using HDF5

fid = h5open(fname, "r")
B = fid["B"] # size(B) = (10000, 10000, 4992), eltype(B) = Float32
B[10,10, 1] # this works

However, when I tried to perform interpolations over B using Interpolations.jl, I got

using Interpolations

itp = extrapolate(interpolate(B, BSpline(Linear())), NaN)
ERROR: MethodError: no method matching interpolate(::HDF5.Dataset, ::BSpline{Linear{Throw{OnGrid}}})
The function `interpolate` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  interpolate(::AbstractArray, ::IT, ::GT) where {IT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, BSpline}}}, BSpline}, GT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, Interpolations.GridType}}}, Interpolations.GridType}}
   @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/deprecations.jl:19
  interpolate(::AbstractArray, ::IT) where IT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, BSpline}}}, BSpline}
   @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:190
  interpolate(::AbstractArray, ::IT, ::Real, ::Int64) where IT<:Union{NoInterp, Tuple{Vararg{Union{NoInterp, BSpline}}}, BSpline}
   @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:217

This means that Interpolations.jl expects an AbstractArray as input, but HDF5.Dataset is not an AbstractArray. This array is so large that I cannot possibly read it into memory as a whole. Is it possible to interpolate over such large array?

Sounds like HDF5 memory mapping would be good here.

Yes, but unfortunately when I checked with HDF5.ismmappable, it returned false. I don’t know why, but probably it’s compressed.

I see, it would make sense to compress such a big data set of course.

What should probably still work is to simply create a wrapper that subtypes AbstractArray{T,N}.
The following seems to work on first glance, though I am not sure if Interpolations does not allocate the full array under the hood somehow. The reason I am suspecting something like that is because I can still access the interpolation after deleting the file.

using HDF5, Interpolations

struct HDF5MemoryArray{T,N} <: AbstractArray{T,N}
    data::HDF5.Dataset
    function HDF5MemoryArray(data::HDF5.Dataset)
        T = eltype(data)
        N = ndims(data)
        new{T,N}(data)
    end
end
Base.size(a::HDF5MemoryArray) = size(a.data)
Base.getindex(a::HDF5MemoryArray, i::Vararg{Int,N}) where N = a.data[i...]
Base.getindex(a::HDF5MemoryArray, i::Int) = a.data[i]

itp = h5open("a.h5","r") do file
        
    B = HDF5MemoryArray(file["data"])
        
    itp = extrapolate(interpolate(B, BSpline(Linear())), NaN)
end

But maybe this is a starting point

Edit:looks like the information is stored in itp.itp.coefs. I think this is more likely an issue with interpolations not being able to use a lazy interpolation here…

I’ll try! Meanwhile, I found this issue in the HDF5.jl repository: Dataset as an AbstractArray. What is stopping this from being implemented?

At first glance at the github issue it looks like its more about technicalities, i.e. avoiding breaking changes. As far as i can tell there should be no issue in creating your own wrapper until HDF5.Dataset is an AbstractArray

I quickly tried this on the large HDF5 array. As you mentioned, this might be a problem now for Interpolations.jl:

julia> itp = extrapolate(interpolate(B, BSpline(Linear())), NaN)
ERROR: OutOfMemoryError()
Stacktrace:
  [1] GenericMemory
    @ ./boot.jl:516 [inlined]
  [2] new_as_memoryref
    @ ./boot.jl:535 [inlined]
  [3] Array
    @ ./boot.jl:585 [inlined]
  [4] Array
    @ ./boot.jl:593 [inlined]
  [5] Array
    @ ./boot.jl:599 [inlined]
  [6] padded_similar(::Type{Float32}, inds::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/prefiltering.jl:5
  [7] copy_with_padding(::Type{Float32}, A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/prefiltering.jl:23
  [8] prefilter(::Type{Float32}, ::Type{Float32}, A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/prefiltering.jl:39
  [9] interpolate(::Type{Float32}, ::Type{Float32}, A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:167
 [10] interpolate(A::HDF5MemoryArray{Float32, 3}, it::BSpline{Linear{Throw{OnGrid}}})
    @ Interpolations ~/.julia/packages/Interpolations/91PhN/src/b-splines/b-splines.jl:191
 [11] top-level scope
    @ REPL[13]:1

Maybe I should try some other interpolation libraries?

Or you could implement it yourself. Linear interpolation on a regular grid is just a few lines of code.

2 Likes

Following the suggestions, I quickly came up the following code for handling linear interpolations:

using HDF5

struct HDF5MemoryArray{T,N} <: AbstractArray{T,N}
   data::HDF5.Dataset
   function HDF5MemoryArray(data::HDF5.Dataset)
      T = eltype(data)
      N = ndims(data)
      new{T,N}(data)
   end
end
Base.size(a::HDF5MemoryArray) = size(a.data)
Base.getindex(a::HDF5MemoryArray{T,N}, i::Vararg{Int,N}) where {T, N} = a.data[i...]::T
Base.getindex(a::HDF5MemoryArray{T,N}, i::Int) where {T, N} = a.data[i]::T


"""
    trilin_reg(x, y, z, Q)

Trilinear interpolation for x1,y1,z1=(0,0,0) and x2,y2,z2=(1,1,1)
Q's are surrounding points such that Q000 = F[0,0,0], Q100 = F[1,0,0], etc.
"""
function trilin_reg(x::T, y::T, z::T, Q000, Q100, Q010, Q110, Q001, Q101, Q011, Q111) where {T<:Real}
   oneT = one(T)
   mx = oneT - x
   my = oneT - y
   mz = oneT - z
   fout =
      Q000 * mx * my * mz +
      Q100 * x  * my * mz +
      Q010 * y  * mx * mz +
      Q110 * x  * y  * mz +
      Q001 * mx * my * z +
      Q101 * x  * my * z +
      Q011 * y  * mx * z +
      Q111 * x  * y  * z
end

"""
    grid_interp(x, y, z, ix, iy, iz, field)

Interpolate a value at (x,y,z) in a field. `ix`,`iy` and `iz` are indexes for x, y and z
locations (0-based).
"""
grid_interp(x::T, y::T, z::T, ix::Int, iy::Int, iz::Int, field::AbstractArray{U,3}) where
   {T<:Real, U<:Number} =
   trilin_reg(x-ix, y-iy, z-iz,
      field[ix+1, iy+1, iz+1],
      field[ix+2, iy+1, iz+1],
      field[ix+1, iy+2, iz+1],
      field[ix+2, iy+2, iz+1],
      field[ix+1, iy+1, iz+2],
      field[ix+2, iy+1, iz+2],
      field[ix+1, iy+2, iz+2],
      field[ix+2, iy+2, iz+2]
   )

fname = "data.h5"

fid = h5open(fname, "r")

B = HDF5MemoryArray(fid["B"])

xGrid = range(-0.5, 0.5, length=size(B,1))
yGrid = range(-0.5, 0.5, length=size(B,2))
zGrid = range(-1, 1, length=size(B,3))

dx = xGrid[2] - xGrid[1]
dy = yGrid[2] - yGrid[1]
dz = zGrid[2] - zGrid[1]

locx = 0.0
locy = 0.0
locz = 0.0

x = (locx - xGrid[1]) / dx
y = (locy - yGrid[1]) / dy
z = (locz - zGrid[1]) / dz

# Find surrounding points (0-based indices)
ix = floor(Int, x)
iy = floor(Int, y)
iz = floor(Int, z)

grid_interp(x, y, z, ix, iy, iz, B)

Additional type declarations for Base.getindex are required for type stability.

@henry2004y do you have an example HD5F file that you could share?

Maybe a newly generated small HDF5 file can be used as an example?

using HDF5

h5open("example.h5", "w") do fid
   g = create_group(fid, "mygroup")
   dset = create_dataset(g, "myvector", Float32, (10,10,10))
   values = [Float32(i) for i in 1:10, j in 1:10, k in 1:10]
   write(dset, values)
end

fid = h5open("example.h5", "r")
g = fid["mygroup"]
# Assume the new struct and linear interpolation codes are loaded
B = g["myvector"] |> HDF5MemoryArray

xGrid = range(-0.5, 0.5, length=10)
yGrid = range(-0.5, 0.5, length=10)
zGrid = range(-1, 1, length=10)

dx = xGrid[2] - xGrid[1]
dy = yGrid[2] - yGrid[1]
dz = zGrid[2] - zGrid[1]

startx = 0.0
starty = 0.0
startz = 0.0

x = (startx - xGrid[1]) / dx
y = (starty - yGrid[1]) / dy
z = (startz - zGrid[1]) / dz

# Find surrounding points (0-based indices)
ix = floor(Int, x)
iy = floor(Int, y)
iz = floor(Int, z)

grid_interp(x, y, z, ix, iy, iz, bx) # returns 5.5
1 Like

Just for reference, depending on the use case, this might not be the most performant solution. An alternative AbstractArray implementation could use the DiskArrays.jl interface which will always try to access the array in large blocks instead of relying on random access to single points as the pure AbstractArray interface. The following example would create an interpolated HDF5-based array.

Create a small wrapper around HDF5 Dataset that implements the DiskArrays interface

using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked

#Implement the DiskArray interface for HDF5 as proposed in https://github.com/JuliaIO/HDF5.jl/issues/615
struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
  ds::HDF5.Dataset
  cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Unchunked()
haschunks(x::HDF5DiskArray) = Chunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5.Dataset)
    cs = try
        GridChunks(ds, HDF5.get_chunk(ds))
    catch
        nothing
    end
    HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end

Create a Test dataset and wrap it

#Create example file
h5open("example.h5", "w") do fid
   g = create_group(fid, "mygroup")
   dset = create_dataset(g, "myvector", Float32, (10,10,10),chunk=(5,5,5))
   values = [Float32(i) for i in 1:10, j in 1:10, k in 1:10]
   write(dset, values)
end

fid = h5open("example.h5", "r")
da = HDF5DiskArray(fid["mygroup/myvector"])

Use InterpolatedDiskArray from DiskArrayTools to create an interpolated view into the array

using DiskArrayTools: InterpolatedDiskArray
using Interpolations
target_indices = (range(1,10,91),range(1,10,91),1:10)
newsize = length.(target_indices)
newchunks = GridChunks(newsize,(50,50,5))

di = InterpolatedDiskArray(da, newchunks, target_indices...,order=Quadratic())
#Compute sum over the last dimension, every chunk will only be accessed once, so this is quite efficient and works on arbitrary array sizes without running OOM
sum(di,dims=3)

If your workflow depends on random access to single locations into your dataset, this solution will bring no benefit. However, if you rather do some larger batch-processing processing chunks of data piece-by-piece it might be worth to look into the DiskArrays machinery.

4 Likes

Given any AbstractArray from the above comments, you can lazily georef it for interpolation with GeoStats.jl:

using GeoStats

data = georef((; B)) # B from the example above

grid = CartesianGrid(10, 10, 10) # domain of interpolation

data |> InterpolateNeighbors(grid, model = NN())

That will give you access to various interpolation models and various types of interpolation domain (point sets, grids, meshes). The result of InterpolateNeighbors is currently composed of non-lazy columns, but it should be easy to work on a hotfix to reproduce the exact input array type.

1 Like

Thanks for the detailed example! This indeed requires some understanding of the internals of DiskArrays.jl. I will compare the performance with the first naive example next.

When I searched for related packages, I noticed a higher level YAXArrays.jl that’s built upon DimensionalData.jl and DiskArrays.jl. Maybe this task can be directly accomplished by taking advantage of YAXArrays.jl?