A future for JLD2?

Tamas_Papp · July 15, 2020, 3:13pm

I think it is generally recognized that HDF5 is an important format, there is no question about that.

However, due to the complexity of the full format spec, very few languages have native implementations — most wrap the C bindings, which is already done in HDF5.jl which is actively maintained.

mbauman · July 15, 2020, 3:27pm

I’m not aware of any other implementations of HDF5. Everyone wraps the one and only implementation which is effectively the “standard.”

Tamas_Papp · July 15, 2020, 3:35pm

There are few, but not many, because it is a daunting task. Eg

https://github.com/jamesmudd/jhdf

From a practical perspective, yes, the C implementation is pretty much the standard.

jling · July 15, 2020, 3:58pm

https://iopscience.iop.org/article/10.1088/1742-6596/1085/3/032020/pdf

highlights: Fig 1, 2, 6; only if ROOT has an I/O only library, more people would probably use it as it is really good (somehow)

bionicinnovations1 · July 19, 2020, 1:46am

Issues like this can put new adopters of Julia off. It makes it look like the language is underdeveloped and unstable. I use Julia for data analysis, so I don’t have a strong computing/programming background. Any threat to packages I rely on heavily always comes with a reasonable degree of anxiety for us, and makes me question why I didn’t invest in developing our tools in Python rather than Julia. I decided to take the risk of investing our tool development in Julia because I hoped it will one day surpass Python.

Clearly an emerging language has a pathway to continual improvement. However, there needs to be a real effort to ensure that if any packages are “let go”, there needs to be (the impression at least, that) a strong alternative package will: 1) be as good or better, 2) directly accept the legacy code so that the old formats can be read in and function as the old version. Those two things will make new adopters of Julia feel confident they have made the right choice.

JonasIsensee, I am very grateful that you have raised this and are trying to do what you can to save JDL2. It is clear from the thread that may also support this. For the reasons above, I hope others will see the urgency to continue to maintain good packages such as JDL2, especially if there is not plan for an alternative. I think this is necessary if Julia wants to continue to gain popularity.

xiaodai · July 19, 2020, 2:01am

Have you tried JDF.jl. I would appreciate comments and suggestion if it doesn’t satisfy your need. It’s much faster than CSV is you read/load the data alot.

roble · July 19, 2020, 5:28am

Hi @xiaodai
The speed and ease of use of JDF.jl is great. In my interest, I had a look at the internals and saw that each collumn is a file. The metadata is serialized using Serialization.jl. That made me wonder why you would not use Serialization in the first place as it would not be guaranteed to be readable across Julia versions?

Wouldn’t it make sense to use JSON for serializing the metadata? Please do not understand me wrong. I admire your work and consider using JDF.jl for my next projects.

Best Regards

xiaodai · July 19, 2020, 5:55am

Yeah. I will be moving away from Julia Serialisation and move towards JSON and BSON for metadata in the next version.

roble · July 19, 2020, 6:13am

That’s good news
When thinking about the metadata format, my proposal is surely not thought out. You shurely would want to use something lightweight and fast and rather no big dependency.
As far as i know, your package is already a good choice for storing numerical or ML data.

Tamas_Papp · July 19, 2020, 6:48am

Most packages are developed by volunteers, and come with no guarantees. It is unclear who should be making this “real effort”, and how you plan to incentivize them.

Generally, if you really depend on a package for your work, it makes sense to become familar with the internals so that you can fix things yourself. The best way to do this is to make PRs, benefit from reviews, and gradually become familiar with the codebase. Many package authors are willing to grant commit rights to the repo to people who invest this kind of effort, at which stage the problem is more or less solved.

This needs to be done proactively: when packages become abandonned, it is much more difficult to pick up the pieces because of bit rot and lack of interest.

tamasgal · July 19, 2020, 8:04am

I do not want to spread negativity due to my personal trauma (hello CERN ROOT) but I’d like to give you my two cents.

Currently, I see JLD2 as a dataformat which should be used for internal purposes (e.g. saving the state of a Julia programme) but not for sharable data, so my thoughts below are more focussing on extending the scope of it.

I am very reluctant in using a dataformat which is tied to a specific language/framework when it comes to shareable data. I might be heavily biased due to my awful prior experiences with high-energy physics data stored in the ROOT format, which a couple of years ago was more or less only be readable (efficiently) using a huge C++ framework with an extremely questionable design – luckily there are alternatives nowadays. The most annoying thing was that experiments with very simple data layouts picked up the ROOT format (carried over from people working at CERN), which introduced a “vendor-lock-in” for no good reasons, just to store data which could have been stored in a simple table-like structure in e.g. CSV… carried over the whole lifetime of these experiments.

So in case of JLD2, I have the fear that people might end up using it as a container for scientific data and others will be forced to use Julia to access the data (I assume Julia will be very successful in near future, but still ). I have not used JLD myself much (I had massive trouble opening files which I have saved years ago so I ditched it) so I have little idea how it’s actually implemented, but I’d definitely welcome an implementation which is self-descriptive, meaning that all the necessary information needed to reconstruct the data structures are encoded inside the HDF5 format structure.
As others pointed out, ROOT has extraordinary performance in reading and writing very complex data and it’s also self-descriptive, however its implementation is so complex (and not documented) that it took many years until people volunteered to implement I/O libraries in other languages. Also, writing ROOT files is still not possible (well, there are some basic write features in some libraries but it’s extremely limited).

On the other hand, I am sure that JLD2 can be quite successful if it turns out to be something which is usable in other language too, as a more general purpose binary format, to e.g. have an alternative to ROOT or protobuf or whatever. HDF5 offers a nice foundation but it’s still a pain to deal with complex (nested, ragged) data structures. I tend to follow the commonly known advise: “do not develop your own binary format”, so building upon HDF5 is in my opinion a good idea. E.g. the general I/O, compression and chunking etc. come for free. It would be awesome if there was a nice concept of storing complex data in a performant way using the HDF5 concepts, along with at least one other I/O library in another language (like Python).

Do you think that JLD2 would fill such a gap – a general purpose binary format for scientific data – and is it worth to go in this direction or is the plan to keep it tied to Julia?

JonasIsensee · July 19, 2020, 10:47am

Thank you @tamasgal,
reporting bad experiences is valuable to me, so we can try to stay away from those patterns.

I agree with you that, right now, JLD2s purpose is internal / short-medium term storage and I would not recommend publishing data in that format.

I agree, but in my opinion this is largely a question of tooling rather than a problem with the library/spec itself. On the contrary, I think with the HDF5.Group structure you can quite easily represent nested structures.

About JLD2:
JLD2 implements a subset of the HDF5 spec but with additional julia-specific features for encoding type information.
Nonetheless, JLD2 files are already valid HDF5 files.

julia> using JLD2, HDF5

julia> numbers = rand(5)
5-element Array{Float64,1}:
 0.35527512738045885
 0.31367145115382966
 0.49535895899812266
 0.4038095087610356
 0.4229428296143374

julia> hello = "world"
"world"

julia> @save "test.jld2" numbers hello

julia> f = h5open("test.jld2", "r")
HDF5 data file: test.jld2

julia> names(f)
2-element Array{String,1}:
 "hello"
 "numbers"

julia> read(f, "hello")
5-element Array{String,1}:
 "world"
 "\x7f"
 "\xbd\x03"
 ""
 ""

julia> read(f, "numbers")
5-element Array{Float64,1}:
 0.35527512738045885
 0.31367145115382966
 0.49535895899812266
 0.4038095087610356
 0.4229428296143374

 ~ h5dump test.jld2                                                          
HDF5 "test.jld2" {
GROUP "/" {
   DATASET "hello" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      DATA {
      (0): "world", "\000\000\000\000\000", "\000\000\000\000\000",
      (3): "\000\000\000\000\000", "\000\000\000\000A"
      }
   }
   DATASET "numbers" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      DATA {
      (0): 0.355275, 0.313671, 0.495359, 0.40381, 0.422943
      }
   }
}
}

Of course, as you can see with the string, at the moment you can retrieve the data but it is not fully straight forward.
When julia structs get involved it gets a bit more complicated.

julia> struct S; x::Int; y::Float64; z::String; end

julia> s = S(1, 2.0, "3")
S(1, 2.0, "3")

julia> @save "test2.jld2" s

~ h5dump test2.jld2                                                                 So 19 Jul 2020 12:36:50 CEST
HDF5 "test2.jld2" {
GROUP "/" {
   GROUP "_types" {
      DATATYPE "00000001" H5T_COMPOUND {
         H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLPAD;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         } "name";
         H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }} "parameters";
      }
         ATTRIBUTE "julia_type" {
            DATATYPE  "/_types/00000001"
            DATASPACE  SCALAR
            DATA {
            (0): {
                  "Core.DataType",
                  ()
               }
            }
         }
      DATATYPE "00000002" H5T_COMPOUND {
         H5T_STD_I64LE "x";
         H5T_IEEE_F64LE "y";
         H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLPAD;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         } "z";
      }
         ATTRIBUTE "julia_type" {
            DATATYPE  "/_types/00000001"
            DATASPACE  SCALAR
            DATA {
            (0): {
                  "Main.S",
                  ()
               }
            }
         }
   }
   DATASET "s" {
      DATATYPE  "/_types/00000002"
      DATASPACE  SCALAR
      DATA {
      (0): {
            1,
            2,
            "3"
         }
      }
   }
}
}

In a way this stuff is still self-descriptive. At least JLD2 can typically reconstruct types that are not defined in the current session but in a different language this could be a major undertaking.

So my dream for the future would be to

Improve JLD2 for julia-internal usage (there are plenty of outstanding issues)
Build and combine existing tooling to make it easy to produce hdf5 compatible structures for long-term storage and publication. (e.g. unrolling nested structures into groups )

Long term dream
3) provide a more fully featured hdf5 implementation to also be able to read hdf5 files not produced with JLD2

xiaodai · July 19, 2020, 10:56am

I wrote the Parquet.jl writer and I learnt a bit about Thrift and other formats. And I gotta say, I am not a fan. Simple JSON/BSON is more than enough for data format. Being easy to read is key!

xiaodai · July 19, 2020, 10:58am

Personally, I would consider consulting if ppl REALLY need it. But mostly, ppl just want to use it free. If a company needs it, they should considering paying for the maintainer’s time.

Tamas_Papp · July 19, 2020, 2:11pm

Technically yes, since HDF5 is so versatile. But in practice, it’s not much use: reconstructing the data in any language other than Julia would require manually mapping types anyway.

Practically, for ephemeral storage we have Serialization, while long-term storage will always require some kind of mapping to types of an external data format (HDF5, JSON, etc).

It may seem that formats like JLD2 target some intermediate use case: not quite ephemeral, but not long term; somewhat specific to Julia, but at the same time allowing extraction of data in other languages with a bit of work.

Yet I am skeptical about this since data storage can easily turn “long term” without explicit intent, and then recovery will be possible, but require manual work again. I realize that it would be nice to just be able to save and load data without a hassle, but I think it is worth putting in the extra effort to specify explicit externalization conventions and use existing, common formats. Just my 2 cents.

lewis · July 19, 2020, 2:12pm

@Tamas_Papp

You are great contributor to Julia and many packages.

But, I sort of disagree that anyone relying on a package should be able to maintain it. If Julia wants to be popular, then many of its users will be domain experts (epidemiology, meteorology, biochemistry, economics, etc.) who can code to a pragmatic degree but lack formal CS training in data structures and algorithms.

It seems like some packages (for some value of “some”) are broadly useful and are challenging enough to implement well that Julia Computing or the Julia board need to make sure that such packages are maintained. Serialization/de-serialization seems like a reasonable candidate.

While a “perfect” match to any/all Julia datastructures is very appealing, cross-platform/cross-language compatibility seems even more appealing. So, some requirements may need to be relaxed. The challenge seems to be around arbitrary Julia types. In some cases a type may be a convenience renaming of a composite type that has a reasonable cross language representation (structs, dictionaries, most arrays). It’s just hard to tell when serialization will run up against an exception as some big data blob is being parsed and converted. Handling types that represent value encodings that aren’t consistent across languages (missing, bigint, ??) get tricky also. It seems like only a pure Julia serialization/de-serialization can “do it all,” as in every conceivable Julia type.

Probably the two cases (all Julia types vs. cross-language types) should be separated or possibly the pure Julia thing degrades to cross-platform/cross-language with a parameter choice and replaces the “impossibles” with something that can be identified on the receiving end with a message that the replacement happened. At some point it does become the user’s responsibility to limit data content being serialized to stuff that has a cross-language representation if that’s what said user needs. Everything can’t be auto-magically converted.

It really sounds like what’s missed here is the pure Julia case though perhaps JLD is ok for that. And that it may not be fair to expect the pure Julia thing to also be the cross-language thing. It also seems that cross-language necessarily cannot represent all Julia objects and there are existing tools that cover popular cross language types. It seems like BSON is perhaps the most robust and needs to have its size limit raised a lot. For flat data tables, Julia is probably well served with the concern being mostly about performance.

Seems like separating the two cases makes solving this more feasible. Once there is a pure Julia thing that people like, it is always possible to write a tiny function that deserializes the Julia stuff and re-serializes with one of Julia good cross-language serializers. Whether the user has deep CS training or not, she knows her data best and knows what her collaborators using other languages expect in the dataset she is providing to the group.

Tamas_Papp · July 19, 2020, 2:26pm

That wasn’t implied; you have other options. One can either trust that others will maintain it, or pay someone to do it, etc.

Software is not special in this respect: if you want to keep operating a car or any other machine that can break down, you either do the mainteinance yourself, or pay someone to do it.

Free software is a bit special because you may be able to free-ride (meaning this in a good sense), but there are no guarantees — it may be great while it lasts, but can disappear from one moment to another.

Neither entity is special in the sense that they can just magic packages into being maintained. They have the same options as the other players, and eventually all of them cost money, which is available in finite amounts and has other uses which may have larger payoffs (from their perspective).

Possibly, which is why we already have

https://docs.julialang.org/en/v1/stdlib/Serialization/

Incidentally, you may be surprised how few people here have CS degrees, or any “formal” training in CS. A lot of Julia’s “infrastructure” packages are written and maintained by scientists who are trained in some other field, including physics, neuroscience, mathematics, biology, economics, etc. They just had a problem to solve and realized that no one else will do it for them.

Topic		Replies	Views
ANN: JLD2 (JLD in pure Julia) Community	15	3680	October 26, 2017
JLD.jl vs JLD2.jl General Usage	23	8695	October 30, 2018
Status of libraries for saving binary files: composite datatypes General Usage hdf5 , jld2 , bson	2	615	February 18, 2020
Recommended serialization interface in Oct 2020: JLD, JLD2, New to Julia question , jld , jld2	6	2058	October 26, 2020
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020

A future for JLD2?

Related topics