Is there a way to load only a part of a dataset in a JLD2 file?

Hi, the question is pretty much the title. I’ve ran a largeish computation on a cluster with lots of ram so it can fit its results (the largest of which is a 8 dimensional array of vectors of vectors which is probably not advised but these should genuinely have different lengths) which are then saved into a jld2 file. Now I’ve copied it to my computer but can’t load the data because it doesn’t all fit into my ram. I was hoping to only load slices/individual cells at a time but now can’t find a way to do this, is there any way this can be done?

There isn’t any ergonomic way to do this.

My recommendation would be to change the way you store data and try to split the nested vectors at least partially into nested JLD2 groups (for nesting) and separate datasets.

If you really want to read the current file you have (and assuming it’s just arrays of numbers ), I’d recommend using HDF5.jl directly to open the file and manually de-reference the nested arrays.

julia> using JLD2, HDF5

julia> data = 
        [ [ [1,2], [3,4] ], [ [5,6], [7,8] ] ]
2-element Vector{Vector{Vector{Int64}}}:
 [[1, 2], [3, 4]]
 [[5, 6], [7, 8]]

julia> jldsave("test.jld2"; data)

julia> f = h5open("test.jld2")
🗂️ HDF5.File: (read-only) test.jld2
├─ 📂 _types
│  └─ 📄 00000001
│     ├─ 🏷️ julia_type
└─ 🔢 data
   ├─ 🏷️ julia_type

julia> d = f["data"]
🔢 HDF5.Dataset: /data (file: test.jld2 xfer_mode: 0)
├─ 🏷️ julia_type

julia> d = read(f, "data")
2-element Vector{HDF5.Reference}:
 HDF5.Reference(HDF5.API.hobj_ref_t(0x0000000000001330))
 HDF5.Reference(HDF5.API.hobj_ref_t(0x00000000000014a8))

julia> ref = d[2]
HDF5.Reference(HDF5.API.hobj_ref_t(0x00000000000014a8))

julia> vec1 = read(f[ref])
2-element Vector{HDF5.Reference}:
 HDF5.Reference(HDF5.API.hobj_ref_t(0x0000000000001540))
 HDF5.Reference(HDF5.API.hobj_ref_t(0x00000000000015b0))

julia> ref2 = vec1[1]
HDF5.Reference(HDF5.API.hobj_ref_t(0x00000000000015b0))

julia> read(f[ref2])
2-element Vector{Int64}:
 5
 6
2 Likes

I see, that does make sense. Thank you for the fast answer, I will do the HDF5 de-referencing for my current data.

For future runs or other systems I wonder if you have any tips on how to restructure the data. The data I have here is from a parameter scan over 8 parameters (hence the 8 dimensional arrays) and for each set of parameters I can get a variable number of results (these are essentially polynomial roots but their number does genuinely vary, though I do have an upper bound). I can’t really think of any convenient way to do this beyond vectors, the only more computationally suited way I can think of is having a fixed length (upper bound) tuple or SVector with the results and then also storing a “number of solutions” variable. However this does seem quite unwieldy.

I’d suggest to enumerate your parameters using e.g. DrWatson.jl with dict_list.

Then you can save your individual result vectors in JLD2 as
/results/000001, /results/000002, …
and store the parameters in a separate list / datasets.
That way you can load the parameters and first and figure out, which result dataset you need.