HDF5.jl and YAXArrays.jl integration

Hi, I am working on a script to create a YAXArrays dataset from an HDF5 file. The idea is to iterate over the datasets in a selected group and combine them into a single dataset.

The first issue I encountered is the inability to lazily load the data. I couldn’t find any documentation on this, so I am currently loading all the data into memory.

The second issue is that I am unable to convert the dictionary of YAXArrays I created into a Dataset.

Please see the code below. I would greatly appreciate any help with these issues.

using HDF5
using YAXArrays

# Function to infer dimensions and create YAXArray
function create_yaxarray(group_path, var_name, dataset)
    # Infer dimension names from attributes
    all_attrs = Dict(attrs(dataset))  # Assumes "DimensionNames" attribute exists
    if haskey(all_attrs, "DimensionNames")
        dim_names = all_attrs["DimensionNames"]
    end
    if dim_names === missing
        println("Warning: No dimensions attribute for $var_name, using default.")
        dim_names = ["dim_$i" for i in 1:ndims(dataset)]  # Default dimension names
    else
        dim_names = split(dim_names, ",")  # Split comma-separated dimensions
    end
    
    # Create YAXArray
    axlist = Tuple(
                Dim{Symbol(dim_name)}(collect(1:size(dataset)[dim_i]))
                for (dim_i,dim_name) in enumerate(dim_names))  
    data = read(dataset)
    all_attrs["name"] = var_name # Name of the variable
    all_attrs["source"] = group_path # Source group path
    return YAXArray(
            axlist,  # Dimension names as a NamedTuple
            data,  # The dataset (either lazy or fully loaded)                
            Dict(all_attrs)  # Source group path
        )
end

function load_dataset(filename::String, group::String = "/ScienceData/Geo")
    # Open the HDF5 file
    fid = h5open(filename, "r")
    datasets = Dict{Symbol, YAXArray}()       
    try          
        # Iterate over Geo group
        for var_name in keys(fid[group])
            var_path = joinpath(group, var_name)           
            dataset = fid[var_path]
            datasets[Symbol(var_name)] = create_yaxarray(group, var_name, dataset)     
        end        
    finally
        close(fid)  # Ensure the file is closed
    end

    return datasets
end

when I execute the code it creates a dictionary of YAXArrays as expected. However when I try to convert this dictionary into a dataset

data_dict = load_dataset(filename, "/ScienceData/Geo")
ds = Dataset(;properties = Dict{String,Any}(), data_dict)

I get the following error message:

ERROR: MethodError: no method matching iterate(::Nothing)

Closest candidates are:
iterate(::LibGit2.GitConfigIter)
@ LibGit2 /cm/shared/apps/julia/1.10.6/share/julia/stdlib/v1.10/LibGit2/src/config.jl:225
iterate(::LibGit2.GitConfigIter, ::Any)
@ LibGit2 /cm/shared/apps/julia/1.10.6/share/julia/stdlib/v1.10/LibGit2/src/config.jl:225
iterate(::LaTeXStrings.LaTeXString, ::Int64)
@ LaTeXStrings ~/.julia/packages/LaTeXStrings/6NrIG/src/LaTeXStrings.jl:108

Stacktrace:
[1] foreach(f::YAXArrays.Datasets.var"#3#6", itr::Nothing)
@ Base ./abstractarray.jl:3098
[2] (::YAXArrays.Datasets.var"#2#5")(c::Dict{Symbol, YAXArray})
@ YAXArrays.Datasets ~/.julia/packages/YAXArrays/ppMtD/src/DatasetAPI/Datasets.jl:37
[3] foreach(f::YAXArrays.Datasets.var"#2#5", itr::@NamedTuple{data_dict::Dict{Symbol, YAXArray}})
@ Base ./abstractarray.jl:3098
[4] Dataset(; properties::Dict{String, Any}, cubes::@Kwargs{data_dict::Dict{Symbol, YAXArray}})
@ YAXArrays.Datasets ~/.julia/packages/YAXArrays/ppMtD/src/DatasetAPI/Datasets.jl:35
[5] top-level scope
@ REPL[33]:1

I would appreciate any help on these two issues.

The very first example on “Reading and writing data” in the HDF5.jl manual explains how to read a subset/slice of a dataset into memory. Is that what you’re looking for?

Regarding the lazy loading there is some snippets of how we could define a HDF5 DiskArray here: Support for DiskArrays · Issue #615 · JuliaIO/HDF5.jl · GitHub .
I am not sure, how outdated this code is.
See also the current discussion about implementing the DiskArray interface HDF5.Dataset as an AbstractArray · Issue #930 · JuliaIO/HDF5.jl · GitHub

For the construction of the Dataset you have to splat the contents of your Dictionary so replace your last line by:

ds = Dataset(;properties = Dict{String,Any}(), data_dict...)

see: Frequently Asked Questions (FAQ) | YAXArrays.jl

1 Like