Loading data sometimes very slow on HPC system

The only other thing why JLD2 could be slow the first time and then fast is if it needed to compile a lot specialized methods. Did you check the naive @time results to see whether the first run is dominated by compile time?

Okay I have tried to save the same data (NamedTuple that includes vectors of vectors and DataFrames where each row is a few scalars and a vector) using HDF5 (using a Dict instead of a NamedTuple), which threw an error:

Fatal error:
ERROR: Type Array does not have a definite size.

So I assumed the error comes from the Vector{Vector{T}} present in my data (this is what one of the dataframe columns is). Converting any of these into Matrix{T} I saved everything again using JLD2 (now without any dataframes), and it seems like that really was the caveat…

So this is great, I now avoid any vectors of vectors and seem to be able to load data fast from any node even on the first load. What I don’t understand is: Why does JLD2 show this behavior? If I start a new session of Julia on the node where the data was loaded once before in another Julia session, the loading was fast. Compilation of specialized methods is therefore not the issue? Or is there caching of these specialized methods that other nodes don’t have access to? During the slow load, the CPU is actually at ~10%. It would be great to understand why this happens and put a warning somewhere in the documentation maybe?

Great catch! An example:

using JLD2, FileIO, BenchmarkTools

const vecs = [rand(100) for _ in 1:1000];
const mat = stack(vecs); # 100x1000 matrix

jldsave("vecs.jld2", vecs=vecs)
jldsave("mat.jld2", mat=mat)

@btime load("vecs.jld2", "vecs"); # 2.931 ms (13559 allocations: 1.50 MiB)
@btime load("mat.jld2", "mat"); # 167.320 Îźs (144 allocations: 792.68 KiB)

Just an idea: HDF5 can only store simple n-dimensional arrays. So if you ask JLD2 to store a vector of vectors, it cannot just “dump” it in a HDF5 because it is not an n-dimensional array. Maybe it flattens the vector of vectors and stores it as a single vector along with the individual vector lengths? In any case, there must be some overhead.

1 Like

Hi @fgerick,
speaking from experience, this is most definitely a problem arising from the network file system
and JLD2 isn’t really at fault here. (if a new julia session is quick afterwards, how could it)

To work around it you can try

  • using iotype=IOStream to stop relying on MmapIO which doesn’t give any advantage on network file systems anyway.

  • copy files onto your node-scratch or /tmp and then opening the local file with JLD2.
    (remember to clean up after yourself…)

JLD2 / HDF5 stores Vector{Vector} type structures as a vector of references to arrays stored elsewhere (how else would you do it).
Therefore your Matrix version requires loading just a single contiguous chunk of memory, while the vector of vectors version has a lot of indirection (essentially random access ) while loading.
Depending on your file system this can be problematic.

Typically network file systems try to be smart and transfer as little data as possible.
When opening the JLD2 file, it will only send you small chunks of the file at a time.
This is great if you want to read small bits of big files but behaves poorly if you want to load the whole file at once.
At worst, that means, that a new chunk is requested for every vector of vectors element.
When your data is a big matrix, then JLD2 will directly request the whole thing and it is much faster.

2 Likes

Thank you for the explanation. I don’t understand too much still how and why the network file system remembers the access for some time on one node, but I don’t really understand how file systems work in the first place. In any case, I will just not rely on these “chunked up” data structures from now on. Copying hundreds of GB each time does not seem like a sensible thing to do just to be able to use Vector{Vector{T}} types.