I’ve heard of problems with stability in JLD2
, but I’ve never experienced it personally.
We and others who work in cluster environments had serious data loss issues that can only be avoided by fiddeling with undocumented parameters.
https://github.com/JuliaIO/JLD2.jl/issues/55
A pull request to at least document this in the error messages hasn’t been merged:
https://github.com/JuliaIO/JLD2.jl/pull/62
Anecdotally it’s also been more difficult to get usable types back out (if the package version changes for example), but maybe we have just gotten better at that dance. Either way, I am staying well clear of the package. Anything I really care about I don’t trust serialization either, but turn it into an array of floats by hand first and then use BSON.jl.
Thank you, but is there then no standard reliable go-to option that the Julia community uses (or prefers), or is there, that will probably remain the standard reliable option?
I don’t think there is a “standard” option, but IMO BSON.jl is pretty useful.
For more complicated container types, consider mapping to something simpler/standard, eg a namedtuple of plain vanilla Array
s. Even when reconstructing more complex types is possible, it may not make sense if these types change after the data is saved.
Thanks.
How do BSON.jl and JSON.jl compare? Any pitfalls to think of? What about performance? Whose stability is greater?
The thing with having only arrays in BSON is that BSON is a well supported file format:
http://bsonspec.org/implementations.html
So I am fairly confident I will get my data out again in a usable form. In principle that should work for HDF backed systems as well, but see above for the state of the HDF backed libraries…
Thanks.
Someone mentioned in this thread that he was “burned” by these formats including BSON. What would you say to that?
What about JSON?
If shortterm storage is sufficient, how should someone go about using Serialization? I see its basic use, and was nudged into using an array or tuple to save more than one variable, but how then does one extract these variables from the array/tuple and put them in Main?
Definitely consider it, it is probably the best solution if neither your types nor the Julia version is expected to change.
I would prefer to do this manually (so that I do not clobber the namespace accidentally and maybe do some minimal validation), but you can always do something like
stuff = (a = 1, b = 2) # let's pretend this comes from deserialization
for (k, v) in pairs(stuff)
@eval $k = $v
end
Thanks.
I don’t understand. And I haven’t seen those two things before, that is, the macro @eval
and the function pairs()
.
Maybe, if you don’t mind, using my test example would help. If you put some stuff into a string, and save this string of stuff in a serialized file, like so:
# make some stuff
s1 = "string 1";
s2 = "string 2";
X = 7;
r = 0.33;
df = DataFrame(
Number = [1, 10, 20]
);
# serialize/save to file
serialize("test", [s1, s2, X, r, df]);
…how do you extract/deserialize all that stuff, all those objects, from the file, such that they are immediately available in Main as though you had just created them manually like before?
Is there maybe even an already existing function for this in the package for quick convenient use?
You should look them up then, that’s why we have a manual and docstrings.
I would put them in a NamedTuple
, otherwise their names are not saved. MWE:
using Serialization, DataFrames
s1 = "string 1";
s2 = "string 2";
X = 7;
r = 0.33;
df = DataFrame(Number = [1, 10, 20]);
# serialize/save to file
serialize("test", (s1 = s1, s2 = s2, x = X, r = r, df = df));
Then in a fresh session you can do:
using Serialization, DataFrames # MAKE SURE YOU LOAD DataFrames
# deserialize
objects = deserialize("test")
for (k, v) in pairs(objects)
@eval $k = $v
end
Thanks. I will try that.
I did check the documentation in this case, by the way, which is why I asked here.
I have a related question on this.
I can see that HDF5 has its benefits.
Consider a scenario where I only want to store DataFrames (and possibly additional variables which are HDF5 compatible). If limited understanding from HDF5 is correct, I can easily store each column vector with HDF5.
So I was wondering whether anyone already wrote a wrapper to store DataFrames with HDF5.
I guess there are possibly several steps that would be interesting:
- simplest case: the dataframe columns are hdf5 compatible types ( signed and unsigned integers of 8, 16, 32, and 64 bits,
Float32
,Float64
,UTF8String
) - if my dataframe contains certain incompatible types, one could circumvent this in different fashions (e.g. UInt128 could be stored as String when saving the dataframe (and converted back when loading it)). For CategoricalArrays or PooledArrays a mapping could probably be done in a simple fashion. Dates could probably be stored as Int64s.
Has anyone already implemented this? If not I will at least do some more thinking about this.
Or is this effectively the approach of JLD?
I think you can easily extend a read/write method for dataframe type which reads/writes a separate group with a bunch of columns, and then extend some read/write methods for vectors of types that are not supported by HDF5.jl itself.
Another question is, how could you automatically set some metadata to a group (or unsupported column vector type), so it will automatically read as a dataframe or the type you want, not just a bunch of vectors. For example, when reading a whole h5 file into a dictionary tree.
Hello, I hope I can revive an old thread rather than starting a new one on the subject.
Most of the above are able to store variables, arrays and dataframes nicely.
I am looking for a way to store the results from
ODE solvers: sol = solve(prob, ....)
Optimisations: res = optimize(....)
both are more complex data objects, which do not seem to recover well from any of the options I tried. Storing the dataframes/arrays only will loose most of the added value, e.g., interpolation/function behaviour for ODE solver results.
Is there a way to achieve that? In my case it’s great that I can brag that my optimisation only takes hours rather than days compared to MATLAB, but if I cannot store the data and reload it, that’s kind of a moot point if I need to use the meta information that’s part of the solution.
Have you tried with the Serialization
standard library (Serialization · The Julia Language).
I haven’t tried that one yet, I have to admit. Well, I’m currently in the process of setting up a testcase, so may come back to confirm if it works or not later.
I just wanted to see if there is one that is
a) supposed to be able to store this kind of object
b) proven to work
What about the incompatibility between versions of Julia that was mentioned in this thread? For the objects I need to store that would most likely not be an issue, since the object struct will change between versions anyway. But will a dataframe I save in Julia 1.6 still be readable in 1.7 or beyond? CSV and HDF5 are great for long term storage, but for the same reasons, they are quite limited with respect to what can be stored.
I guess what I’m after is a simple “dump workspace” and “reload workspace”. And from what I’ve read so far “serialize” and “deserialize” are my best bet there - BTW, is there a “save all workspace” option in serialize?
In my opinion, Serialization has proven to be significantly more stable between Julia versions than many other packages that claim to be so.
Thanks.
Serialize does the job I need it to do. I don’t want to store the result in perpetuity.
The only problem that’s left is that the ODE solution is based on modelling toolkit and carries all the preliminary equations to calculate the state variables that are not part of the actual solution vector.
That in itself is not a problem, but I use driver functions which are @register
ed and when I deserialize the solution it complains that these are not defined. Easy to solve, you just need to ensure that the functions are registered again before you load the ODE result back. Then everything works.
I am wondering if it would make sense to introduce something like
for serialization.