JLD2: How to update existing dataset array on disk?

I am trying to update parts of an array stored inside a .jld2 file. However, all the changes that I make are not actually stored to disk.
As an MWP, imagine doing

using JLD2

a = zeros(100,100)
jldsave("test.jld2"; a=a)

jldopen("test.jld2", "r+") do file
     file["a"][:,1] .= 1
end

After that, I expect that loading the file again will give me an updated array “a” with the first column replaced by ones, however I get an array of purely zeros.

How to update an existing array in a .jld2 file?

file["a"][:,1] .= 1 is lowered to file["a"] reading the file to an array, then doing in-place broadcasting over that array, not the file. Setting indices or in-place broadcasting must be done with 1 bracket, so the only possible syntax is file["a", :, 1] .= 1, which seems unsupported because the docs don’t mention it. I’m not even sure if that syntax is feasible to implement. Maybe you could replace file["a"][:,1] .= 1 with:

tempa = file["a"] # read file and allocate temporary array
tempa[:,1] .= 1   # update the temporary array
file["a"] = tempa # write temporary array to file
1 Like

This gives an Error saying that the dataset named “a” already exists.

Looked into that error a bit and it doesn’t look great: overwriting existing data · Issue #450 · JuliaIO/JLD2.jl · GitHub. It’s a short enough read, but the takeaway:

It is already possible to delete(f, "dataset") .
This will delete your reference to the data and you can write a new Dataset with the same name. Note, that the data itself remains in the “gap”. So, if you do that many times, the file will grow.

I presume that is a typo and should be, in your case, delete!(file, "a") prior to file["a"] = tempa. The file keeping the obsolete data is concerning, especially with how much larger arrays can get. If I’m reading it right, this could be resolved with Allow reading/modifying arrays with Mmap · Issue #235 · JuliaIO/JLD2.jl · GitHub, but #450 might be alluding to something else and also mentions that editing files generally risks corruption due to program crashes or file system blunders outside of your control, which you don’t have to worry about with arrays in RAM.

To be honest, it all looks very strange to me.
If I use HDF5 instead, the same update works fine (if I replace broadcasting with assignment). I mean, if I do

using HDF5
h5open("test.jld2", "r+") do file
     file["a"][:,1] = ones(size(a,1))
end

on the same file created by JLD2, the array is updated without any problems.
Given that JLD2 essentially realizes a subset of HDF5, it seems like a bug.

IIRC, JLD2.jl does include HDF5 compatibility as a goal but not support of HDF5 in its entirety. Also bear in mind that HDF5.jl is a wrapper of the reference implementation of HDF5 in C, and what file["a"] and thus file["a"][:, 1] = ... does is part of the wrapping, not the HDF5 model. JLD2.jl is not related to HDF5.jl, so how they do things can differ.