How to modify HDF5 dataset

I’m trying to write a program to read in an HDF5 file with a lot of complicated metadata, do some processing on a large dataset, and write the metadata and modified dataset back out to a new file. I want to avoid explicitly copying each piece of metadata or implementing some generic thing to copy each object (except the interesting dataset) one by one.

My first try was to copy the file, delete the dataset, and write a new one like so

using HDF5

function update_dataset1(src, dest, dataset)
    cp(src, dest, follow_symlinks=true, force=true)
    f = h5open(dest, "r+")
    d = f[dataset]
    newd = zeros(eltype(d), size(d)...)
    o_delete(d)
    write(f, dataset, newd)
end

But this basically doubles the file size because o_delete calls H5Ldelete which only deletes the reference to the dataset and not the actual object written to file (which becomes unreachable). I could write to a temporary file and shell out to h5repack it, I guess. I also tried the following

function update_dataset2(src, dest, dataset)
    cp(src, dest, follow_symlinks=true, force=true)
    f = h5open(dest, "r+")
    d = f[dataset]
    newd = zeros(eltype(d), size(d)...)
    d .= newd
end

But this dies with a MethodError not matching copyto! with the right types. I can actually do this in Python like so

def update_dataset3(src, dest, dataset):
    shutil.copy(src, dest)
    f = h5py.File(dest, "r+")
    d = f[dataset]
    newd = numpy.zeros_like(np.asarray(d))
    d[:,:] = newd

so I’m wondering if there’s just some interface in HDF5.jl that I’m missing.

1 Like

I noticed in HDF5.jl that setindex! is defined, and so I tried

function update_dataset4(src, dest, dataset)
    cp(src, dest, follow_symlinks=true, force=true)
    f = h5open(dest, "r+")
    d = f[dataset]
    newd = zeros(eltype(d), size(d)...)
    d[:,:] = newd
end

and it does what I want.

So now I’m just wondering what’s the difference between x[:] = y and x .= y. I think the first calls setindex! while the other calls copyto!, but I don’t really understand why these are distinct methods.

Edit: Also, is there a shorthand for “all indices of all dimensions” like the ... in numpy?