What is the preferred way to save variables?

jld
hdf5
jld2
#1

I’m trying to save variables from one run of a problem JuMP to use as initial values for variables in a separate JuMP problem. I see that there seem to be several (competing?) options for doing this:

JLD2’s readme says it’s the successor to JLD, but over the past several months JLD has been continuously updated while JLD2 has not. However, looking through the issues on the JLD repository, I came across an issue where one of the maintainers said that JLD would only be getting compatibility fixes rather than updates in the future.

From another post on the forum I found BSON, which seems to be completely independent of the JLD variants.

There’s an HDF5 package as well, but in its readme it talks about the advantages of using JLD over pure HDF5.

I don’t know how well the last three packages handle Julia data types, but my guess is that they’ll work for basic things like arrays and strings but have trouble with more complex types.

Which of these, if any, is the preferred or “official” way to save workspace variables?

2 Likes

#2

There are four, somewhat but not completely orthogonal questions here:

  1. well-defined formats with a spec and more ad-hoc solutions (eg HDF5 vs CSV),
  2. binary vs text-based formats (eg BSON vs JSON), which can result in loss of precision in some cases,
  3. saving a restricted set of values (eg strings, arrays, numbers) or generic Julia values (array of some structs, further complications),
  4. maintenance status.

No solution is perfect, and you need to be more precise about your requirements.

FWIW, I think that while saving and reloading native objects would be very nice, in practice it is rather brittle; and most existing packages have outstanding issues, often with data corruption, so that rules out JLD2 and BSON. JLD, as you said, is not intensively maintained either. HDF5 is promising on paper but it is a monolithic, single-implementation piece of software (everyone just wraps the C library, if you ignore JLD2, which implements a subset). CSV is not precisely defined but can work in practice for matrix-like data, while JSON and BSON can be more flexible, but still require that you reformat your data to what they support.

In practice, I usually go with CSV, or Serialization for short-term storage on the same machine. This is overly conservative, but in the past I was burned by HDF5, JLD2, and BSON (corrupted data or intermittent bugs that were not fixed for a while). With CSV, I can always rescue my data, even if I have to whip up a parser myself in the worst case.

There are various other fringe formats, but they are even less widely used.

9 Likes

#3

I’m still in the early stages, so perhaps I should be careful about advertising it in a public forum like this, but I’m currently very much in the process of completing my pure Julia implementation of the apache arrow standard. This is essentially a table-focused format, however the full standard is diverse enough that it allows for storing a wide variety of objects, in principle it should support storing a wide class of Julia structs as tuples. The advantage of using arrow is that it is supported in a wide variety of languages, and it is useful for interacting with things like Parquet and Spark. (And yes, they don’t advertise it well on the website but it can certainly be used for storing data on disk.)

I am constantly needing to stash what I am working on on disk or on remote storage such as S3, and the time may well come when I need a Julia Spark wrapper on par with pyspark, so I’ve had lots of motivation for working on arrow.

On the opposite end of the compatibility spectrum, note that Julia’s built-in serializer can be exceptionally simple to use, was very performant the last time I checked, and is very reliable. The big disadvantage is of course that compatibility between different Julia versions is not guaranteed, so it’s only suitable for very short-term use.

10 Likes

#4

One more https://github.com/eschnett/ASDF.jl

2 Likes

#5

I really want to use ASDF just to annoy my data science colleagues.

Would be cool if it had a pure Julia implementation though :wink:

1 Like

#6

Protobuf may be a good choice? Does anyone have experience with that?

0 Likes

#7

I haven’t used the Julia implementation, but my experience of Protobuf from other contexts is very good: compact storage and extremely fast deserialization speeds. However, if you don’t need fast deserialization, I would choose another format for more readability and compatibility, such as CSV or JSON.

1 Like

#8

Ideally, I’d like a format that is binary, can save generic Julia values, and is maintained, but it looks like there’s not an option that meets all of those requirements currently. Serialization is good, but as has been pointed out, is incompatible between Julia versions. I’ll probably stick to a combination of that and CSV for the time being. Good luck with your project, @ExpandingMan.

0 Likes

#9

How does ASDF compare to Arrow, feather, hdf5, parquet…?
A database that seems very performant is TileDB but we don’t have it on Julia yet.

1 Like

#10

Thanks! But, to be clear, arrow will not allow you to “save generic Julia values”, at least not easily. Like I indicated, if you really wanted to you could, for example decompose some structs as tuples and store those in an arrow format. One of the top things on my own personal wishlist is being able to store arrays of arbitrary rank, which the (I think new) arrow tensor format should accommodate. If it some point someone wanted to make a JLDArrow or something like that, that might be pretty cool, but I’m not likely to take that on myself.

0 Likes

#11

I think this is the hardest part, since Julia values can be really, really complex. Consider

struct Foo{T,S}
    x::S
end

T = Union{Missing,Foo,Float64}
f = Foo{3,Vector{T}}(T[missing, missing, 1.0])
f.x[2] = f

The other complication is that types need to be defined before instantiating objects.

1 Like

#12

About the “saving arbitrary Julia variables” part of the request, I’m currently working on a library to handle this sort of thing without running into the maintenance burden of JLD or JLD2.

Specifically, my approach is a simple, lightweight package that allows IO packages like HDF5.jl or CSV.jl to expose “backends” to the user, that just do the very simple job of format-specific IO for the limited set of types that the format can natively support.

How more complicated types will be handled, is by creating a mapping function from the user’s types to some collection of types that the IO package natively supports, using metadata (potentially stored within the IO format, or outside of it, depending on what the IO package supports) to identify that that mapping looks like when the data is reloaded later on.

I’m hoping this generic approach will end up being less difficult to maintain than the JLD* packages since it doesn’t directly have to implement any HDF5 specific details, and can make use of multiple backends to support saving/loading (almost) anything the user can throw at it.

Once I’ve finished up the first pass at functionality I’ll make a release announcement on Discourse (it’ll probably be called SerDes.jl); it might be in a few weeks since I’ve got a lot of other stuff going on in my life.

2 Likes

#13

Will it handle closures?

1 Like

#14

I’m not sure whether I’m hoping for a “yes” or a “no”…

6 Likes

#15

We’ll give it a try, but it’s not a top priority. I’m not going to prevent any PRs to support closures or any other kind of code-mixed-with-data serdes, of course!

0 Likes