DataFrames and serialization

I had to send a DataFrame (ca. 7000x1500, in my opinion not even that large) to several worker processes (ca. 40) and and the whole process was a pain:

@everywhere df = $df

took 3 minutes!

A colleague told me to write everything to disk and read it with the worker processes:

JLD.@save "df.jld" df

had to be interrupted because I lost patience

BSON.@save "df.bson" df
@everywhere BSON.@load "df.bson" df

saving finished but was very slow and reading the file into memory on the worker processes caused heavy swapping on the machine so that I had to kill all the processes to prevent the machine from freezing.

Finally I tried JLD2 and it saved me:

JLD2.@save "df.jld2" df
@everywhere JLD2.@load "df.jld2" df

took 13s!

So what is everyone else doing wrong? I have seen some older issues about serializing DataFrames which were mostly closed. And most importantly: How is it possible that this is faster over a (networked) filesystem than within the RAM of a single machine?

The same issue appeared with the return values of a pmap

1 Like