A future for JLD2?

I’d say that bindings for good libraries in other languages are always welcome, but this is a bit tangential to the current discussion, where many users advocate for a good, stable pure-Julia solution for serialisation.

1 Like

I have tried to use Serialization with small but relatively complex, parametric structs between an Ubuntu 18.04 and a recent OSX. It worked on Julia 1.3 but upgrading both to 1.4 broke it.

Edit: IMO stabilizing Serialization would be a great solution to this problem. I would also be happy to help doing it.

NaNs can carry a “payload”. That is: there are a few million different 32-bit values that are all interpreted as NaN. (Even more for 64-bit numbers.)

Thank you, these are all interesting discussions
but by now the conversation has drifted away significantly from what I wanted to talk about in this thread.

To restate the question:
Are people already using JLD2 or would use it if it got maintained again?
Are you able and willing to help?
I would also particularly ask for feedback from JuliaIO people.

8 Likes

Thanks for working on JLD2. If it’s actually still faster (or can be), then it seems might be worthwhile. I’m just not sure it is faster than JLD when non-default Blosc compression is used by it.

I only have modest use cases (so far), so I don’t care either way, but I looked a bit into both. I noticed JLD2 gets me the same file independently of compression option:

julia> A = []
julia> file = jldopen("mydata_c.jld2", "w", compress=true)
julia> file["A"] = A  # for JLD I could do (not without JLD. as in docs so I filed an issue): JLD.@write file A
julia> close(file)

For JLD the files differ, while same size (I guess couldn’t for non-empty array):

-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 6288 júl 14 10:05 mydata.jld
-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 6288 júl 14 10:52 mydata_c.jld
-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 5180 júl 14 10:49 mydata.jld2
-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 5180 júl 14 10:33 mydata_c.jld2

Reading the code it seems JLD2 only has, and I guess, always uses deflate/ZlibCompressor, while JLD:

uses Blosc compression, which imposes very little performance penalty, but leads to HDF5 files that are not readable by other applications unless a Blosc plugin is installed. If you also specify compatible=true , then a different (and often slower) compression method is used that should be readable by any HDF5-using software.

I would like even better compression options that are now available, but with these used, I’m just not sure JLD2 is faster (is that outdated info, or just with memory mapped files or?).
JLD, or HDF5.jl that is, added Mmap in July 2019: Update to use BinaryProvider by staticfloat · Pull Request #555 · JuliaIO/HDF5.jl · GitHub

and in Blosc in August 2018: https://github.com/JuliaIO/HDF5.jl/blob/c7865be2431406f9438056580cbb891e375bfafb/Project.toml

From my perspective it’s time better spent on that, and (fully) Julia-only solutions not important, or preferred. It least the good compression algorithms are not in Julia, and will not be maintained there, in the near-future, the new ones, and also Blosc.

I’m not sure this applies to JLD2 however, and it’s a bit alarming for JLD:

Note: You should only read JLD files from trusted sources, as JLD files are capable of executing arbitrary code when read in.

JLD2 has faster startup (despite HDF5_jll now used by JLD):

$ julia --startup-file=no -O0 -q

julia> @time using JLD
  1.270251 seconds (1.66 M allocations: 94.025 MiB, 0.98% gc time)

julia> @time using JLD  # default Julia optimization:
  1.632788 seconds (1.66 M allocations: 93.993 MiB, 0.87% gc time)

FYI: e.g. “JLD.jl is the preferred way of saving ScikitLearn.jl models.” because of “PyCallJLD, which uses pickle” Links don't work and docs outdated · Issue #85 · cstjean/ScikitLearn.jl · GitHub GitHub - JuliaPy/PyCallJLD.jl: JLD support for PyCall objects

Another question that needs to be resolved is whether members of JuliaIO are willing to review my PRs.
If that is not the case then I could still start developing my own fork of JLD2 and possibly register it at a later point in time.

1 Like

I can’t speak for them, but I have seen (semi-)abandonned packages just add new contributors when someone wants to continue the work.

2 Likes

The membership in the organisation is the last problem, you only need to find anyone with good understanding of the codebase to review the changes.

That is probably true.
The only alternative would be for me to start building a comprehensive summary of the library internals to make review easier and to keep development in a separate dev branch that needs to be tested rigorously before anything is merged into master.

1 Like

Hey guys, I’ve been following this discussion. Do you know of any programming language where this “persistent” programming with arbitrary data structures works? That is what JLD2 is (was?) trying to do, isn’t it? Just run the program, at an arbitrary point dump the state of the program, and it can be resumed precisely as it was at any time.

Edit: I think it works with lisp, but is there anything else?

1 Like

ah yes…https://root.cern.ch, (which is C++ but complex enough it’s basically its own thing);
for example, Chapter: Trees

if anyone wants to actually use it, use python (official) binding: ROOT: PyRoot tutorials

2 Likes

I come from Matlab where this has worked for a long time.

1 Like

I haven’t used it recently, so I don’t know: has saving code worked at any point? Meaning if I define a function and try to save it, will it work?

The function save in R works really well, saving data and code.

Root is impressive! But, it is still C++ :frowning:

Yes, you can save and load closures, user-defined classes, etc. Of course, this is considerably easier to do in Matlab than Julia, but it has been possible for quite some time.

1 Like

I’ll just add that I would use something like JLD2 if it was a bit more polished and well-tested. In the past, I did use MATLAB’s save utility quite extensively, even for longer term storage. Right now in Julia, I stick to CSV files and am just pickier about what actually needs to be saved to a file.

1 Like

I would like to help out (pending my time commitments elsewhere). One thing that I think could be useful is splitting out the HDF5 part from the serializer: I think there would be great value in having a pure-Julia HDF5 implementation, and potentially supporting other container file formats.

16 Likes

Thank you, that is great to hear!

One thing that I think could be useful is splitting out the HDF5 part from the serializer: I think there would be great value in having a pure-Julia HDF5 implementation,

In principle I would agree but I fear that JLD2 is as fast as it is largely because it’s routines are fairly lowlevel. I’m not sure that this can be split off easily or without having to make a lot of compromises.
On the other hand, I think it could be possible to add a HDF5 compatible mode
that only supports basic types and encodes type signatures in a way that can be read by other HDF5 libraries.

This is definitely some that we can discuss in the future.

2 Likes

HDF5 will maintain a significant presence in HPC due to its inclusion in the software for exascale project (e4s-project.github.io). That is something to be considered in this context, perhaps?