Why JLD2.jl is 40x slower than Arrow.jl

Sixzero · November 5, 2024, 7:57am

Just checked HDF5.jl for comparison.

The results compared to JLD2:

HDF5 has faster append speed with big files (2-5x faster), slower when files are smaller (2-5x slower).
HDF5 full read is 10x slower. This is quite bad.
File size is 2x of what JLD2 generates.

JonasIsensee · November 5, 2024, 8:06am

FYI: Advanced Usage · Julia Data Format
(but it probably won’t help much here)

JLD2 needs to parse some contents of a file before being able to append something. So, if you first read and then append, then you can save time by keeping the file open. (Open with r+ right away)

f = jldopen(fn, "r+")
try
    read_data = JLD2.loadnesteddict(f)

    # append
    
finally
    close(f)
end

JonasIsensee · November 5, 2024, 8:08am

Also, in your use-case, JLD2 and HDF5 should be compatible e.g. write files with JLD2 and read (edit?) with HDF5. (The reverse: writing with HDF5 and reading with JLD2 is also possible but not editing.)

Sixzero · November 5, 2024, 9:32am

No, I am fine with the narrowest compatibility, caches can be regenerated anywhere anytime, just be efficient.

bernhard · November 25, 2024, 9:59am

this post has made me curious, so I did some tests on my rather plain data (all DataFrames).

The numbers > 1.0 in the three rightmost columns are all in favour of Arrow.jl in my case.
This was measured using @elapsed. Filesize is as reported by stat so the largest JLD2 file is 3.7GB.
I note that the identifier column is intentionally partially hidden.

Topic		Replies	Views
Performance of Dictionaries.jl vs Base.Dict Internals & Design package , performance , dictionary , dictionaries	32	3206	December 13, 2024
Help with Arrow.jl and size of files Data question , arrow	23	1870	October 21, 2022
[ANN] RowTables.jl Data announcement	6	1118	July 26, 2018
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2647	February 24, 2023
JLD2 seems slow at write operations compared to serialize and HDF5 General Usage data	3	1167	November 20, 2017

Why JLD2.jl is 40x slower than Arrow.jl

Related topics