I had to send a DataFrame (ca. 7000x1500, in my opinion not even that large) to several worker processes (ca. 40) and and the whole process was a pain:
@everywhere df = $df
took 3 minutes!
A colleague told me to write everything to disk and read it with the worker processes:
JLD.@save "df.jld" df
had to be interrupted because I lost patience
BSON.@save "df.bson" df
@everywhere BSON.@load "df.bson" df
saving finished but was very slow and reading the file into memory on the worker processes caused heavy swapping on the machine so that I had to kill all the processes to prevent the machine from freezing.
Finally I tried JLD2
and it saved me:
JLD2.@save "df.jld2" df
@everywhere JLD2.@load "df.jld2" df
took 13s!
So what is everyone else doing wrong? I have seen some older issues about serializing DataFrame
s which were mostly closed. And most importantly: How is it possible that this is faster over a (networked) filesystem than within the RAM of a single machine?
The same issue appeared with the return values of a pmap
…