JLD takes too long reading names from a file

JLD seems to take an awfully long time just reading the names present in a JLD file ā€“ not reading values, just names.

julia> outfile = jldopen("/tmp/blah.jl", "w")
Julia data file version 0.1.1: /tmp/blah.jl

julia> for i in 1:10000
   write(outfile, "$(rand())", rand())
   end
julia> names(outfile) #compile (though it doesn't seem to make any difference)
julia> @time names(outfile)
44.540928 seconds (40.15 k allocations: 1.688 MB)

When the file contains ~75,000 entries. this names() call takes several hours.

On the contrary,

  time h5ls /tmp/blah.jl

Takes only

 real	0m0.284s
 user	0m0.236s
 sys	0m0.030s

What is going on? How do I fix this? (Iā€™m willing to mess around with the internals of JLD if necessary)

A quick investigation with @profile and ProfileView.view(C=true) shows that most of the time is spend within HDF5 especially the H5Gget_objname_by_idx function (see https://github.com/JuliaIO/HDF5.jl/blob/0366bb050d8ded8dff2d8f148818151610bbb75b/src/HDF5.jl#L987 where the call originates).

Since profiling seems to indicate the time is not spend within Julia h5ls must be using a different API to access the information or HDF5.jl is using the API the wrong way.

FWIW:

using JLD2

Base.names(jld::JLD2.JLDFile) = keys(jld.datasets)

outfile = jldopen("/tmp/blah.jl", "w");
for i in 1:10000
    write(outfile, "$(rand())", rand())
end
close(outfile)
julia> @time infile = jldopen("/tmp/blah.jl", "r");
  0.000445 seconds (141 allocations: 11.354 KiB)

julia> @time names(infile);
  0.008764 seconds (3.38 k allocations: 169.777 KiB)

https://github.com/simonster/JLD2.jl

3 Likes