A future for JLD2?

giordano · July 14, 2020, 12:24am

I’d say that bindings for good libraries in other languages are always welcome, but this is a bit tangential to the current discussion, where many users advocate for a good, stable pure-Julia solution for serialisation.

tisztamo · July 14, 2020, 5:11am

I have tried to use Serialization with small but relatively complex, parametric structs between an Ubuntu 18.04 and a recent OSX. It worked on Julia 1.3 but upgrading both to 1.4 broke it.

Edit: IMO stabilizing Serialization would be a great solution to this problem. I would also be happy to help doing it.

Per · July 14, 2020, 6:39am

NaNs can carry a “payload”. That is: there are a few million different 32-bit values that are all interpreted as NaN. (Even more for 64-bit numbers.)

JonasIsensee · July 14, 2020, 7:26am

Thank you, these are all interesting discussions
but by now the conversation has drifted away significantly from what I wanted to talk about in this thread.

To restate the question:
Are people already using JLD2 or would use it if it got maintained again?
Are you able and willing to help?
I would also particularly ask for feedback from JuliaIO people.

Palli · July 14, 2020, 11:09am

Thanks for working on JLD2. If it’s actually still faster (or can be), then it ~~seems~~ might be worthwhile. I’m just not sure it is faster than JLD when non-default Blosc compression is used by it.

I only have modest use cases (so far), so I don’t care either way, but I looked a bit into both. I noticed JLD2 gets me the same file independently of compression option:

julia> A = []
julia> file = jldopen("mydata_c.jld2", "w", compress=true)
julia> file["A"] = A  # for JLD I could do (not without JLD. as in docs so I filed an issue): JLD.@write file A
julia> close(file)

For JLD the files differ, while same size (I guess couldn’t for non-empty array):

-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 6288 júl 14 10:05 mydata.jld
-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 6288 júl 14 10:52 mydata_c.jld
-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 5180 júl 14 10:49 mydata.jld2
-rw-r--r-- 1 pharaldsson_sym pharaldsson_sym 5180 júl 14 10:33 mydata_c.jld2

Reading the code it seems JLD2 only has, and I guess, always uses deflate/ZlibCompressor, while JLD:

uses Blosc compression, which imposes very little performance penalty, but leads to HDF5 files that are not readable by other applications unless a Blosc plugin is installed. If you also specify compatible=true , then a different (and often slower) compression method is used that should be readable by any HDF5-using software.

I would like even better compression options that are now available, but with these used, I’m just not sure JLD2 is faster (is that outdated info, or just with memory mapped files or?).
JLD, or HDF5.jl that is, added Mmap in July 2019: Update to use BinaryProvider by staticfloat · Pull Request #555 · JuliaIO/HDF5.jl · GitHub

and in Blosc in August 2018: https://github.com/JuliaIO/HDF5.jl/blob/c7865be2431406f9438056580cbb891e375bfafb/Project.toml

From my perspective it’s time better spent on that, and (fully) Julia-only solutions not important, or preferred. It least the good compression algorithms are not in Julia, and will not be maintained there, in the near-future, the new ones, and also Blosc.

I’m not sure this applies to JLD2 however, and it’s a bit alarming for JLD:

Note: You should only read JLD files from trusted sources, as JLD files are capable of executing arbitrary code when read in.

JLD2 has faster startup (despite HDF5_jll now used by JLD):

$ julia --startup-file=no -O0 -q

julia> @time using JLD
  1.270251 seconds (1.66 M allocations: 94.025 MiB, 0.98% gc time)

julia> @time using JLD  # default Julia optimization:
  1.632788 seconds (1.66 M allocations: 93.993 MiB, 0.87% gc time)

FYI: e.g. “JLD.jl is the preferred way of saving ScikitLearn.jl models.” because of “PyCallJLD, which uses pickle” Links don't work and docs outdated · Issue #85 · cstjean/ScikitLearn.jl · GitHub GitHub - JuliaPy/PyCallJLD.jl: JLD support for PyCall objects

JonasIsensee · July 14, 2020, 12:22pm

Another question that needs to be resolved is whether members of JuliaIO are willing to review my PRs.
If that is not the case then I could still start developing my own fork of JLD2 and possibly register it at a later point in time.

Tamas_Papp · July 14, 2020, 2:21pm

I can’t speak for them, but I have seen (semi-)abandonned packages just add new contributors when someone wants to continue the work.

giordano · July 14, 2020, 3:01pm

The membership in the organisation is the last problem, you only need to find anyone with good understanding of the codebase to review the changes.

JonasIsensee · July 14, 2020, 4:28pm

That is probably true.
The only alternative would be for me to start building a comprehensive summary of the library internals to make review easier and to keep development in a separate dev branch that needs to be tested rigorously before anything is merged into master.

PetrKryslUCSD · July 14, 2020, 4:45pm

Hey guys, I’ve been following this discussion. Do you know of any programming language where this “persistent” programming with arbitrary data structures works? That is what JLD2 is (was?) trying to do, isn’t it? Just run the program, at an arbitrary point dump the state of the program, and it can be resumed precisely as it was at any time.

Edit: I think it works with lisp, but is there anything else?

jling · July 14, 2020, 5:00pm

ah yes…https://root.cern.ch, (which is C++ but complex enough it’s basically its own thing);
for example, Chapter: Trees

if anyone wants to actually use it, use python (official) binding: ROOT: PyRoot tutorials

hendri54 · July 14, 2020, 5:29pm

I come from Matlab where this has worked for a long time.

PetrKryslUCSD · July 14, 2020, 5:37pm

I haven’t used it recently, so I don’t know: has saving code worked at any point? Meaning if I define a function and try to save it, will it work?

Paulo_Jabardo · July 14, 2020, 5:40pm

The function save in R works really well, saving data and code.

PetrKryslUCSD · July 14, 2020, 6:30pm

Root is impressive! But, it is still C++

jondeuce · July 14, 2020, 6:48pm

Yes, you can save and load closures, user-defined classes, etc. Of course, this is considerably easier to do in Matlab than Julia, but it has been possible for quite some time.

tbeason · July 14, 2020, 6:50pm

I’ll just add that I would use something like JLD2 if it was a bit more polished and well-tested. In the past, I did use MATLAB’s save utility quite extensively, even for longer term storage. Right now in Julia, I stick to CSV files and am just pickier about what actually needs to be saved to a file.

simonbyrne · July 14, 2020, 8:51pm

I would like to help out (pending my time commitments elsewhere). One thing that I think could be useful is splitting out the HDF5 part from the serializer: I think there would be great value in having a pure-Julia HDF5 implementation, and potentially supporting other container file formats.

JonasIsensee · July 15, 2020, 9:52am

Thank you, that is great to hear!

One thing that I think could be useful is splitting out the HDF5 part from the serializer: I think there would be great value in having a pure-Julia HDF5 implementation,

In principle I would agree but I fear that JLD2 is as fast as it is largely because it’s routines are fairly lowlevel. I’m not sure that this can be split off easily or without having to make a lot of compromises.
On the other hand, I think it could be possible to add a HDF5 compatible mode
that only supports basic types and encodes type signatures in a way that can be read by other HDF5 libraries.

This is definitely some that we can discuss in the future.

PetrKryslUCSD · July 15, 2020, 2:56pm

HDF5 will maintain a significant presence in HPC due to its inclusion in the software for exascale project (e4s-project.github.io). That is something to be considered in this context, perhaps?

Topic		Replies	Views
ANN: JLD2 (JLD in pure Julia) Community	15	3684	October 26, 2017
JLD.jl vs JLD2.jl General Usage	23	8702	October 30, 2018
Status of libraries for saving binary files: composite datatypes General Usage hdf5 , jld2 , bson	2	615	February 18, 2020
Recommended serialization interface in Oct 2020: JLD, JLD2, New to Julia question , jld , jld2	6	2065	October 26, 2020
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020

A future for JLD2?

Related topics