JLD.jl vs JLD2.jl

What is the suggested way of saving a bunch of variables to a file at this point, JLD or JLD2? The last thread comparing the two packages was JLD2’s announcement a year ago, and people seemed split over which to use. It doesn’t seem that there have been any commits to JLD.jl in a year.

4 Likes

JLD.jl is not yet compatible with Julia v0.7/ v1.0 .
If you’re on v1.0 then JLD2.jl is the way to go AFAICT.

2 Likes

Ok thanks!

You can also check out BSON.jl

Cool, thanks! What are the pros and cons of BSON? Seems like two pros are that it’d be more language independent, and that it doesn’t use HDF5 (and therefore avoids the corruption issues etc that comes with HDF5)?

It can also save closures and functions which, of I’m not mistaken, jld(2) can not.

The only caveat is that JLD2.jl needs some work to consistently work on Julia 1.0 in all cases.

Can you describe what you expect not to work? I’ve been using JLD2.jl on 1.0 for a while now and haven’t encountered any problems.

More generally, it’s seemingly like JLD.jl is deprecated in favor of JLD2.jl (at least in practice if not in name). Are there any plans to make this official and change the name of JLD2.jl to JLD.jl?

1 Like

Here are the problems I think are most significant:

  1. handling missing: EXCEPTION_ACCESS_VIOLATION with large Vector{Union{Missing,Int32}} · Issue #108 · JuliaIO/JLD2.jl · GitHub, Zeros converted to missing when loading DataFrame with Union types · Issue #111 · JuliaIO/JLD2.jl · GitHub
  2. Correctly handling situation if underlying layout of a type changes: Char data type is not compatible between Julia 0.7 and Julia 0.6 · Issue #110 · JuliaIO/JLD2.jl · GitHub
  3. Saving UnionAll: Failure to save struct parameterized on Union containing UnionAlls. · Issue #109 · JuliaIO/JLD2.jl · GitHub
  4. Handling types across modules: Reconstructing types defined in one module inside another · Issue #107 · JuliaIO/JLD2.jl · GitHub

And they show that also some other problems might lurk in corner cases.

3 Likes

there is an irony here. I believe one lauded aspect of the JLD format over the serialization format was that it would be stable for much longer than serialization, whose formats could change with every release, and the newer versions would forget how to load the older ones.

it reminds me a little of the BBC Domesday Project - Wikipedia , which was supposed to last another 1000 years and did not even make it stably to age 30.

for a format to be long-term stable means also that the julia package will be stable. will JLD2 be long-term stable?

1 Like

Simon was weary of registering it because at JuliaCon 2017 he mentioned he probably wouldn’t have the time to maintain it properly. We are now seeing the effect of that, we were properly warned :slight_smile:. I think the bigger question is what should we do about it. Since Julia is finally stable with v1.0, it sounds like if the issues @bkamins mentions are addressed then it will be a quite good Julia v1.0 offering. In that case, it may end up stable by default after that, which could be a good thing for this kind of library.

For JLD proper, there is Fix all julia 0.7 issues by crbinz · Pull Request #227 · JuliaIO/JLD.jl · GitHub . Personally, I think we as a community need to go to JLD2 or BSON because of the function support that they offer (this is required for any DiffEq usage of these tools), and having a common saving format is somewhat essential to making things jive well. (But it’s always easy to mention work someone else should do haha.)

4 Likes

thx, chris. I did not mean to complain. I agree that it would be nice to have a permanent binary storage format. Unless the code to serialize/deserialize were backward-compatible, so that later Julia versions could still read earlier Julia data.

Oh no worries. I put the parenthetical because I am saying a lot about what JLD, JLD2, BSON “should do” and putting no work into it myself :smile:. Serialization is an interesting mention though since I wonder how much that could change post Julia v1.0. Serialization is heavily tied to things like message passing for multiprocessing so I don’t think it could change without being breaking, so “serialization won’t break in Julia v1.x” might be a safe bet. I’d bounce that off someone who works on the internals to double check though.

JLD2 works fine on 1.0. What are the effects you are mentioning?

I agree it works fine on v1.0. The effects that I am mentioning are the ones from @bkamins’s list.

JLD2 works fine and has a stability upside due to its limited activity, but that does mean that support for the latest and greatest features will lag. Some of these features, like missing, can be pretty crucial in some communities so it’s important to note that there’s really no one who finds it their duty to add features to it daily/weekly.

1 Like

apologies for piping in. can I summarize my understanding of Serialization vs JLD2?

Serialization could be less trustworthy as a long-term storage format, i.e., for data that one still wants to read in 10 years. Although serialize could change, as long as deserialize can still read old-version-serialized data, the previous sentence could be wrong. Serialization could serve as a viable long-term data storage format.

JLD2 is an alternative binary format, albeit not one that is part of the base language. It’s main advantage is HDF5 writing (but not reading) compatibility. It may sometimes be faster than Serialization. However, it is not maintained by base, and has some edge-case problems that may be fixed in the future. Thus, if the maintainers lose interest, it may not be as good as a long-term storage format, either.

Putting the two together, I am wondering whether either or both are good long-term data storage formats.

1 Like

You should be aware that whatever binary storage format is used it has to take care of possibilities of:

  • different underlying infrastructure
  • different versions of Julia
  • different specification of user defined types between sessions

For serialize/deserialize to work all three things above must not change.

For JLD2.jl AFAIK different underlying infrastructure is currently handled correctly. The condition different versions of Julia can be handled (but still requires some work - and big thanks for people who take care of this - all people who could help here would be very welcome). Ensuring different specification of user defined types between sessions is even harder and I do not know what are concrete plans to support it (I am not involved in any of the packages though so I might not know something).

In general I think it would be best if, as a community, we would decide if BSON or JLD2 is a primary long-term binary storage format and all concentrate on supporting it. Of course both are valuable, but given that writing and maintaining such infrastructure package is difficult and not very rewarding it would be great if at least one of them has a decent community behind it.

7 Likes

FWIW, I would use

  1. HDF5 or a similar stable format for anything long-term, just making use of basic types, eg arrays of homogeneous items (which of course means that you cannot easily use complicated composite types and constructs),
  2. for anything else I can easily regenerate, mmap or gzipped serialize, depending on various trade-offs, with the understanding that I would have to regenerate this occasionally, and set up the infrastructure for it.
3 Likes
  • Is the serialized data format compatible across different computers? I tried macos and linux on x86, and they were compatible. Are there any known infrastructures where they are not compatible?

  • are there any known instances where user-defined types can wreak havoc on serialized but not on JLD2 data?

  • I am thinking that the most endearing aspect of JLD2 is that it is HDF5 compatible, thus interchangeable, and thus more likely to last for longer. Fair?

  • If I had data that I would want to be readable in 50 years, I am thinking that even HDF5 is not half as safe a bet as almost any text-format. So, for long-term storage, even yuck-CSV and yuck-JSON may be better bets.

  1. I don’t know.

  2. I think it is the other way round: for a given version and OS architecture, JLD2 may be less resilient than serialization.

  3. AFAIK JLD2 is a subset of HDF5 with special metadata: that is to say, a HDF5 reader may be able to extract all the information in some format, but it would need to be reconstructed.

  4. Don’t know about that. Including its precursor, the HDF group has been around since the late 1980s. My problem with CSV is lack metadata: I can read it, but what does it mean?