I have this gist:
JLD2 is actually simply for reading in a 30 MB file with keys => randn(500)
is 40x slower than Arrow.jl.
How? I mean how come JLD2 generality sacrificed 40x speed difference?
I have this gist:
JLD2 is actually simply for reading in a 30 MB file with keys => randn(500)
is 40x slower than Arrow.jl.
How? I mean how come JLD2 generality sacrificed 40x speed difference?
In the example I also test for append speed, and Arrow has the same speed no matter file sizes, while jld2 runtime goes up with filesize slightly. (I disabled compression)
Am I missing something? Are there something other alternatives even better than Arrow.jl?
Arrow is optimized for speed for 2D table like data. JLD2 is completely generic for any kind of data. So this result is expected.
Do you want to re-phrase your question, for example to:
“Would it be possible to optimize the speed of JLD2 for 2D table like data?”
Yeah, I am fine anyway, if we like only lose 2-3x in speed, and not 40x.
Also in the example I store a Dict not a “2D table like data”, I mean I convert it back and forth and Arrow.jl is still 40x faster.
I mean to me it seems like, we store simple primitive types and JLD2.jl is somehow overcomplicating things.
I am starting to question if there is a usecase JLD2 is efficient for.
I use JLD2 for small data sets (usually structs of something) where the speed doesn’t matter because it is below 0.1s anyways. Very convenient and easy to use.
One aspect of JLD2 is that it uses an (oldish) generic data format, HDF5, which is at least readable by other languages. This might limit the room for optimization compared to the newer .arrow format.
I use JLD2 for small data sets …
Nice idea, going to work this way from now on, if no better idea comes up.
It’s a bit hard to play with your snippet, it’s quite large.
But also try BenchmarkTools.jl instead of @elapsed
. I tried to change it but suddenly all tests were failing.
And don’t forget to interpolate the variables into the expression @btime $x.^2
instead of @btime x.^2
.
Something like this:
function d()
N = 100_00
test_data = create_test_data(N)
update_key, update_value = create_update_data()
tmp_dir = mktempdir()
println("\nTesting with N=$N entries")
filename = joinpath(tmp_dir, "full_save_test.jld2")
function g(filename, update_key, update_value, test_data)
jldopen(filename, "w", compress=false) do file
for (k,v) in test_data
file[k] = v
end
file[update_key] = update_value
end
end
@btime $g($filename, $update_key, $update_value, $test_data)
end
d()
I tried @belapsed
instead of @elapsed
. I would say for testing append, it is not that practical (because appending to the same file multiple times measures something else what we want, fixing it is quite hacky).
I would say the results could be more accurate with belapsed if we would fix every issue with it, but at max it will mean 20-50% speedup, but nowhere close to 40x.
Maintainer of JLD2 here.
Your timings are fine, in my eyes.
However:
In your example Arrow does NOT fully load your data. Arrow.Table(filename) |> DataFrame
uses SentinelArrays
to give you a view into the file that looks like a Vector{String}
. That is fast but you would pay the allocation cost when actually wanting to work with the data. (A better benchmark should also include the conversion to Vector{String}
.)
Your application only uses plain types and has a structure that appears just more suitable for Arrow.
JLD2 has a different feature set focusing on compound structures and nested hierarchical data. This leaves performance on the table when data is quite homogeneous (but not isbits) like yours. (Still, looking at the benchmarks, this particular use case could be sped up considerably)
Here’s an example where JLD2 is more powerful in my eyes.
Example from the Arrow docs:
using Intervals
table = (col = [
Interval{Closed,Unbounded}(1,nothing)
],)
JLD2: works out of the box
save("test.jld2", "intervalvec", table)
load("test.jld2", "intervalvec")
Arrow: here’s the suggested solution from their docs
const NAME = Symbol("JuliaLang.Interval")
ArrowTypes.arrowname(::Type{Interval{T, L, R}}) where {T, L, R} = NAME
const LOOKUP = Dict(
"Closed" => Closed,
"Unbounded" => Unbounded
)
ArrowTypes.arrowmetadata(::Type{Interval{T, L, R}}) where {T, L, R} = string(L, ".", R)
function ArrowTypes.JuliaType(::Val{NAME}, ::Type{NamedTuple{names, types}}, meta) where {names, types}
L, R = split(meta, ".")
return Interval{fieldtype(types, 1), LOOKUP[L], LOOKUP[R]}
end
ArrowTypes.fromarrow(::Type{Interval{T, L, R}}, first, last) where {T, L, R} = Interval{L, R}(first, R == Unbounded ? nothing : last)
io = Arrow.tobuffer(table)
tbl = Arrow.Table(io)
On the risk of sounding like a broken clock, jld2 is intended for having a slightly cross-platform serialization of julia datastructures, while Arrow.jl is intended for serializing data.
This is a big difference! Very often, you probably want to store data, not datastructures.
Especially if you want to share files with the public (or want to load public data!), jld2 is about as appropriate as the good old curl ... | /bin/sh
pattern or the old-style “download and run this .exe to get the data” (cf Security issue: Type confusion, convert called during deserialization · Issue #117 · JuliaIO/JLD2.jl · GitHub – loading a jld2 file executes arbitrary code with user permissions, this is currently a documented unavoidable feature and not a bug).
Really nice findings!
I fixed the Arrow.jl read timing to not use SentinelArrays:
t_arrow_read = @elapsed read_dict = begin
df = Arrow.Table(filename) |> DataFrame
OrderedDict(k => Vector{String}(v) for (k,v) in zip(df.key, df.value))
end
# checking the types.
@show typeof(read_dict), typeof(first(values(read_dict)))
Now only 5-8x faster read times, we are getting closer definitely!
I am going to check your example where you mean we could have JLD2 come out faster, I am super curious!
This is a good point.
Data always has some structure.
When it is meaningfully possible to represent it using tables then that’s the way to go. (Especially if you want to distribute it to the public…)
If it’s nested and small (and byte accuracy for floating point values is not needed) then you can use something like JSON.
JLD2 is usefull when you want have nested (or recursive) structures where you care about floating point accuracy and / or object identity preservation. (So structure is part of the information content)
JLD2 also has recently gained experimental plain
modes that are safe but lack some features. (Every struct becomes a NamedTuple
…)
(Here documentation on that is badly lacking - more to come)
If you’re interested, you may open an issue over on github.
Quick summary: I have spent time on improving the trade-off between run time & compile time for JLD2.
Since, I did not have / consider a workload like yours, that probably made things worse for your type of application.
With the same poc:
julia> using JLD2, FileIO
julia> load("poc.jld2"; plain=true);
[51438] signal 11 (1): Segmentation fault
in expression starting at REPL[2]:1
unknown function (ip: 0x73cefa12cde4)
unknown function (ip: (nil))
Allocations: 1318825 (Pool: 1318741; Big: 84); GC: 2
Segmentation fault (core dumped)
That is for sure better than without plain (which simply pops a shell), but it ain’t good either.
Oh, that’s fun. Did I know about this? Hmm, that should be considered a bug, I think.
My usecase comes from storing the embeddings of ~100 julia package functions with their hash and their embedding vectors. (This can be a 100+ MB cachefile.)
Just to “memoize” the embedding calls, so I don’t accidentally pay again for the very same embeddings. Basically a stored Dict{String, Vector}(hash => (vector of 1024 sized Float32))
I plan to do a PR with this cache layer to PromptingTools.jl so hopefully embeddings will be simpler from julia.
Consider testing again with your target types.
Strings are very different from floats. Every instance has a variable size and JLD2 can’t store them inline. The length of the lowest-level objects is also important.
However, your application seems to indeed be limited to rather flat / tabular data.
In that case Arrow may just be the better option.
Let’s just say yesterday I tried to rewrite multiple places of jld2 to be Arrow, and I would say, WOW how many conveniences JLD2 gives compared to Arrow… now I feel like it was a mistake to switch away in many cases.
I wonder if the append operation in jld2 could be more efficient in any way, so it more closely matches Arrow append performance.