A future for JLD2?

Hi everyone,

Julia shines at numerical computing and inevitably we need to write data to disk.
There are a few binary formats for julia out there such as Serialization, HDF5.jl, MAT.jl, BSON.jl, JLD.jl
and JLD2.jl that all work but also all have their limitations.
(speed, file-size, bugs, binary dependencies, stability…)

My personal favourite has always been JLD2 because it is fast, supports compression,
can store custom structs, and because it is implemented in pure julia.
Sadly it has not seen any development in recent years and a few bugs have accumulated.
As of May 29 it has officially been marked as unmaintained.

In my opinion it would be a shame to let JLD2 go which is why I am writing this call for help.

I have already invested some time in understanding the code base and trying to fix some problems.
See for example:

https://github.com/JuliaIO/JLD2.jl/pull/198

https://github.com/JuliaIO/JLD2.jl/pull/197

https://github.com/JuliaIO/JLD2.jl/pull/196

https://github.com/JuliaIO/JLD2.jl/issues/195#issuecomment-657185277

https://github.com/JuliaIO/JLD2.jl/issues/55#issuecomment-654091106

One of the biggest road blocks is issue #55. I think that I have made some progress in uncovering the root of the problem but I need help from someone with more knowledge on memory mapping and network file systems (NFS).

Would you be interested in using JLD2 if it gets brought up to speed again?

I would, of course, appreciate any help in terms of coding but also review
and in particular discussion!

What you think?

43 Likes

Would you be interested in using JLD2 if it gets brought up to speed again?

Yes, it would be great if we could rely on JLD2 again.

In case you make any progress, I would be happy to act as a beta tester.

/Paul S

3 Likes

Thanks for posting this @JonasIsensee. I am also a fan of JLD2, for the same reasons you stated. I would be sad to see it abandoned. Having said that, it would be good to know what some of the folks who have put a lot of time into the juliaIO org think the future holds for saving custom julia structs to disk before really jumping into JLD2. For example, is there a plan for JLD to be maintained for the foreseeable future? If so maybe we should just live with the binary dependency of HDF5 and let development be focused there. Or is it recommended that people use the Serialization std lib for short term storage of custom struct data but use a format that isn’t julia specific for long term storage?

1 Like

Personally I’m also still a big fan of JLD2 despite that its maintenance-only at this point, and would love to see it revived. IMO in the top 5 advantages of Julia vs. Python is the ease of serialization, because while Python pickle is super brittle by comparison, essentially everything including arbitrary used-defined types, closures, etc… in Julia are robustly serializable out-of-the-box. Its a bit of a shame to not just absolutely hammer this point by having a great medium-term to-disk serialization option, which, importantly, still at least loads your data even if type definitions change, which can be expected in the medium-term (this is what JLD2 does, modulo bit-rot slowly introducing issues here and there). I don’t know enough to say which librar(ies) it makes most sense to converge on, but it would just be great to have something dependable moving forward.

2 Likes

I think that automatic externalization for all Julia types in a format that is stable in the long run is a very difficult task because of the complexity of Julia’s type system. It is not necessarily impossible, but takes a lot of work.

You can relax one of the requirements though to get a solution:

  1. if stability is not required (ephemeral storage), just serialize with Serialization.
  2. if simple types are sufficient (arrays, tables, or dictionaries), use one of the well-known formats (CSV, JSON).
  3. for more complex types, write a pair of conversion routines to convert to simple types.

It is worth devoting some thought to the actual specs of the problem, and then picking from one of the nice existing libraries. I personally like

3 Likes

I don’t disagree thats its difficult and certainly can’t really speak to it since I’m not working on it, but JLD2 (and BSON too fwiw) was/is basically already really close to this. In my experience with JLD2, crazy complex things serialize just fine, and if a type changes by the time of deserialization, you get back a ReconstructedType so you can at least still easily get your data out. The fact that it was sooo close to perfect always surprised me when it stopped being actively maintained.

4 Likes

You are right, some requirements can sometimes be relaxed.
However not all problems are the same.

In my applications text-based formats such as CSV or JSON are not an option due to file-size
and because I don’t want rounding errors just from saving and loading my data.
An additional difficulty is that makes it hard to work with C libraries is support for missing.
I tend to have large arrays with eltype Union{Missing, Float64}.

Another point, of course, is speed.
And with all those things, as @marius311 already said, JLD2 is/was already really close.

1 Like

Maybe encode it as NaN.

this is not optimal. There are applications where NaN (i.e. the result is known to not exist) and missing (i.e. we don’t know the result) have to be distinguished.

Also, in my experience, switching from JLD2 to BSON slows things down significantly for large files. Moreover, some applications might require saving more than 2 GB of data at once, which also rules out BSON.

1 Like

Related to the suggestion of writing conversion routines for complex types:

What I find somewhat daunting about that prospect is having to write those converters for deeply nested structs.

My question is: in those cases, a lot of the conversion would just be housekeeping (traversing the nested struct). In my use cases, the actual conversion would usually be trivial (most of the “leaf types” are easily handled by any of the packages used for loading and saving).

Is there a package that handles the housekeeping part of the conversion? What I have in mind is a function that traverses the nested struct and calls a conversion function when a “leaf type” is encountered and returns a nested Dict). And, of course, a complementary function that traverses a nested Dict and copies the leaf types back into a nested struct.

I would even be willing to ensure that what I save is all mutable. And I could provide an initialized object that loaded values are copied into. That would cover probably 90% of my use cases.

Maybe you missed it above, but I think the right approach is StructTypes.

Maybe I am misreading the StructTypes docs, but this doesn’t seem to handle the housekeeping part (recursion through the nested structs).

An example of StructTypes being used to serialize / deserialize a nested struct would be helpful.

I’ve been using JLD2 for a long time are rarely had any issues with it, it would certainly good to keep it up-to-date, even with the current functionalities only.

4 Likes

No, you ned to define the method for all the types of the fields of all data structures involved. If the data type happens to be in another package you need to either do type piracy (not recommended) or convince the maintainer of the other package to depend on StructTypes to add the method (good luck)

2 Likes

That’s one reason why I was hoping for a deserializer that populates the fields of an already constructed object.

Working with nested dictionaries is also not very fast.
JLD2 is great in that it produces an inline representation for immutable structs.
When you have lots of structs this can be much faster because of less indirection.

If you want smallest file size, HDF5 supports compression (what JLD format is based on), but historically not SPDP (and thus I’m not sure JLD[2] does yet), while now possible for 30% more compression (but 10x faster is possible):

https://userweb.cs.txstate.edu/~burtscher/research/SPDP/

SPDP is a fast, lossless, unified compression/decompression filter for HDF5 that has been designed for both 32-bit single-precision (float) and 64-bit double-precision (double) floating-point data. It also works on other data.

The paper on it is really interesting: https://userweb.cs.txstate.edu/~mb92/papers/dcc18.pdf

Abstract: Scientific computing produces, transfers, and stores massive amounts of single-and double-precision floating-point data, making this a domain that can greatly benefit from data compression. To gain insight into what makes an effective lossless compression algorithm for such data, we generated over nine million algorithms and selected the one that yields the highest compression ratio on 26 datasets.
[…]
We named the resulting algorithm SPDP, which is an abbreviation for “Single Precision Double Precision”. It is brand new […] Only Zstd performs better. On average, SPDP outperforms Blosc, bzip2, FastLZ, LZ4, LZO, and Snappy by at least 30% in terms of compression ratio. However, it tends to be slower.

We should consider supporting, using:
https://juliahub.com/ui/Packages/TurboPFor_jll/4zXB1/0.0.1+0

i.e. this package: GitHub - powturbo/TurboPFor-Integer-Compression: Fastest Integer Compression

and: GitHub - powturbo/Turbo-Transpose: Transpose: SIMD Integer+Floating Point Compression Filter [The first part is important as e.g. the current fastest supercomputer ARM-based and coming Macs too.]

ALL TurboTranspose functions now available under 64 bits ARMv8 including NEON SIMD. […]

  • Dynamic CPU detection and JIT scalar/sse/avx2 switching
  • 100% C (C++ headers), usage as simple as memcpy […]
  • more efficient, up to 10 times! faster than Bitshuffle
  • better compression (w/ lz77) and
    10 times! faster than one of the best floating-point compressors SPDP
  • can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors
    […]
    eTp4Lzt = lossy compression with allowed error = 0.0001

See also: https://arxiv.org/pdf/1503.00638.pdf

a nearly lossless rounding step which compares the precision of the data to a generalized and calibration-independent form of the radiometer equation. This allows the precision of the data to be reduced in a way that has an insignificant impact on the data. The newly developed Bitshuffle lossless compression algorithm is subsequently applied

Also interesting: GitHub - Ed-von-Schleck/shoco: shoco is a compressor for small text strings

for very small strings, it will always be better than standard compressors.

Does this work cross-platform? Having something stable for a given julia version across different OSs seems like a reasonable solution.

It is not guaranteed to, but I think it may work in practice if both are either 32 or 64 bit and the endianness matches. But I don’t think you want to rely on this, it could break any time without prior announcement.

In practice, I would just choose a file format that fits the problem. CSV, JSON/BSON, HDF5, Feather… so many to choose from, with various advantages for each.

2 Likes

Would people be interested in Julia bindings to adios2? (Disclaimer: I’ve been a core contributor over the years). We had a request for Julia bindings a while ago, but didn’t take off. I can work on those if there is enough interest.

Info about adios2:
open access paper: Redirecting
repo: https://github.com/ornladios/ADIOS2
docs: https://adios2.readthedocs.io

3 Likes