[ANN] JLD2 v0.2.0

Hi all!

I’ve put in some work to bring JLD2 back up to speed and I am happy to announce JLD2 v0.2.0!

This release covers a handful of changes / bugfixes:

  • You can now store Union types that have UnionAll fields e.g. Union{Int, Vector} (#206)

  • Previously immutable structs that contained references to objects of the same type could not be stored. This is now possible (#196)

  • CI on AppVeyor and CodeCoverage work again ( thanks in part to @lhupe )

  • New Magic Bytes that differ more strongly from JLD (#213)

  • No longer list FileIO as a dependency but rather use Requires to load the code on import (#217 @lhupe)

New Features

  • A new and improved saving macro syntax (#198)
# Option passing:

hello = "world"
@save "test.jld2" {compress=true} hello
@save "test.jld2" {compress=true, iotype=IOStream} hello

# Assignment syntax
@save "test.jld2" bye=hello
@save "test.jld2" hello randomnumber=rand(10)
  • Better error messages and handling (#225 @lhupe):
    We no longer try to open any given string as a path but instead check whether the string represents a valid file path first to give better error messages. Additionally, if the standard IO type MmapIO fails for some reason, we attempt to open the file with IOStream instead by default.

  • Inline Union Arrays (#221)
    Arrays with Union eltype where the Union has twoisbits member types are now stored inline in an interleaved fashion. This makes storing e.g. Vector{Union{Missing, Float64}} a lot more efficient AND allows for compression!

Here some attempts at storing very boring data - maximally compressible.

Summary
julia> @time using JLD2
[ Info: Precompiling JLD2 [033835bb-8acc-5ee8-8aae-3f567f8a3819]
4.090665 seconds (2.15 M allocations: 114.999 MiB, 0.28% gc time)

julia> u = Union{Float64, Missing}[zeros(10^6);];

julia> @time @save "test.jld2" u
4.864063 seconds (12.90 M allocations: 650.482 MiB, 5.46% gc time)

julia> @time @save "test.jld2" u
0.063690 seconds (5.20 k allocations: 15.539 MiB)

julia> @time @save "testcompressed.jld2" {compress=true} u
0.117254 seconds (5.64 k allocations: 32.728 MiB, 5.53% gc time)

julia> using JLD

julia> u = Union{Float64, Missing}[zeros(10^6);];

julia> @time @save "test.jld" u
61.623914 seconds (18.44 M allocations: 724.206 MiB, 0.41% gc time)

julia> @time @save "test.jld" u
58.175550 seconds (16.00 M allocations: 602.762 MiB, 0.37% gc time)

julia> @time JLD.jldopen(f->f["u"]=u, "testcompressed.jld", "w", compress=true)
65.954995 seconds (23.46 M allocations: 973.688 MiB, 0.74% gc time)

julia> using BSON

julia> u = Union{Float64, Missing}[zeros(10^6);];

julia> @time BSON.@save "test.bson" u
1.511677 seconds (11.02 M allocations: 418.530 MiB, 26.47% gc time)

julia> @time using Serialization
0.000514 seconds (624 allocations: 40.344 KiB)

julia> @time serialize("juliaserializer", u)
0.488430 seconds (1.91 M allocations: 65.454 MiB, 1.83% gc time)

File sizes when storing Union{Float64, Missing}[zeros(10^6);]

  • Serialization: 8,6M (~0.4s)
  • BSON: 16M (~1.4s)
  • JLD: 300M / 300 M (~58s)
  • JLD2: 8,6M / 23K (uncompressed / compressed) (~5s but second time ~0.1s)

Sure, we didn’t actually store any interesting data here but still,
no one else seems to be able to compress isbits union arrays.

Now the same again but with really no data

File sizes when storing Union{Float64, Missing}[missing for i=1:10^6]

  • Serialization: 3.9M
  • BSON: 126M
  • JLD: 307M
  • JLD2: 8,6M / 14K

Surprisingly the file size is much larger for BSON when the array consists of missings only.
Serialization outputs a smaller file though.
( An apology to the other libraries: I’m aware that this comparison is entirely unfair as it was specifically designed to highlight this particular feature. The applicability and advantage of this will vary and definitely be smaller in real world applications. )

Remarks on Compatibility

This release contains some breaking changes in the file format. However, care was taken that files written with older versions of JLD2 can still be read! If you find yourself unable to read older files, please report an issue.
In the same way it is not unlikely that there will be more changes to the format in the future but
I am hopeful that I won’t have to break the ability reading old files.

Best,
Jonas

86 Likes

I have to point out that Jonas is seriously underselling the work he has done here. In addition to the new features highlighted above he has attacked the issue backlog like a pitbull. He has addressed 78 different issues since July, some dating back three years.

So, speaking as someone who loves the convenience of JLD2 but was burned by data loss before Jonas came onboard, I just want to say thank you for adopting a promising but dying package and nurturing it back to life.

61 Likes

JLSO.jl not benchmarked

Wow. Nicely done.

Now the loading back is pretty nice for a dataframe and is much faster than JDF.jl for small datasets. More tests needed, but check this out

using JLD2
using JDF
using DataFrames
using Random: randstring
using WeakRefStrings

df = DataFrame([collect(1:100) for i = 1:3000])

df[!, :int_missing] =
    rand([rand(rand([UInt, Int, Float64, Float32, Bool])), missing], nrow(df))

df[!, :missing] .= missing
df[!, :strs] = [randstring(8) for i = 1:nrow(df)]
df[!, :stringarray] = StringVector([randstring(8) for i = 1:nrow(df)])

df[!, :strs_missing] = [rand([missing, randstring(8)]) for i = 1:nrow(df)]
df[!, :stringarray_missing] =
    StringVector([rand([missing, randstring(8)]) for i = 1:nrow(df)])
df[!, :symbol_missing] = [rand([missing, Symbol(randstring(8))]) for i = 1:nrow(df)]
df[!, :char] = getindex.(df[!, :strs], 1)
df[!, :char_missing] = allowmissing(df[!, :char])
df[rand(1:nrow(df), 10), :char_missing] .= missing

@time JDF.save("a.jdf", df)

using JLD2

@save "plsdel.jld2" df

df = nothing

@time @load "plsdel.jld2" df

@time df2 = JDF.load("a.jdf")

isequal(df, df2)

As far as I understand, JLSO is not a serializer but wraps other serializers such as BSON or the builtin serializer to store data.
Therefore it did not make much sense to include it here.

2 Likes

Thanks a lot for this great work Jonas and Lukas. Especially the provided benchmarks look very impressive and quite appealing.

So appealing, that I think I need to ask the following question(s), provided that my plan is to work with the FileIO interface (and thus special syntaxes offered by individual packages are not a bonus):

  • Is there a clear choice for a package for saving and loading binary data, including user-defined structs? Said differently, is there a “clearly better option”?
  • If not, is there a guide on how we can make the choice?

I’ve just went through the READMEs of BSON.jl and JLD2.jl. BSON.jl seems to highlight that it can “save anything”, but I don’t see JLD2 saying that it cannot do the same. BSON.jl is lightweight, but looking at Project.toml of JLD2, I’d say that it is pretty lightweight as well. And your post here shows huge filesize benefits for JLD2.

Is there any consensus?

2 Likes

Hi @Datseris,

at this point JLD2 cannot save anything but I hope to get it to that point in the future.

It can serialize custom structs but there are some limitations that I am aware of.
JLD2 can not (yet) store & load closures and MethodInstances / anonymous functions.

It should be possible to implement this but it is not done yet.
If those are features that you need e.g. because some of the structs you want have closure or function fields then JLD2 is not the best choice right now.

Storing closures is not very easy:
Suppose you have a closure that returns multiple anonymous functions that share the closed-over variables, how do you disentangle that when serializing and make sure that after loading they still reference the same variable?
I don’t have an answer to that yet - neither in implementation nor what the best behaviour would be.

Thanks, that is helpful. I didn’t get:

so you mean something like

struct A{F}
f::F
end
a = A(cos)

cannot be saved, even though cos is not an anonymous function?

It can be saved and loaded as long as the function is defined but behaviour gets odd when that is not the case. (And it really just stores the function name and not the methods, see example below)

julia> using JLD2

julia> struct A{F}; x::F; end

julia> f(x) = x
f (generic function with 1 method)

julia> a = A(f)
A{typeof(f)}(f)

julia> @save "test.jld2" a

-bash-4.2$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using JLD2

julia> struct A{F}; x::F; end

julia> @load "test.jld2"
┌ Warning: type A{Main.#f} does not exist in workspace; reconstructing
└ @ JLD2 ~/.julia/dev/JLD2/src/data.jl:1156
1-element Array{Symbol,1}:
 :a

julia> a
JLD2.ReconstructedTypes.var"##A{Main.#f}#253"()

-bash-4.2$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using JLD2

julia> struct A{F}; x::F; end

julia> f(x) = x^2
f (generic function with 1 method)

julia> @load "test.jld2"
1-element Array{Symbol,1}:
 :a

julia> a
A{typeof(f)}(f)

julia> a.x(3)
9
1 Like