DataFrames and serialization

gdkrmr · July 11, 2019, 11:33am

I had to send a DataFrame (ca. 7000x1500, in my opinion not even that large) to several worker processes (ca. 40) and and the whole process was a pain:

@everywhere df = $df

took 3 minutes!

A colleague told me to write everything to disk and read it with the worker processes:

JLD.@save "df.jld" df

had to be interrupted because I lost patience

BSON.@save "df.bson" df
@everywhere BSON.@load "df.bson" df

saving finished but was very slow and reading the file into memory on the worker processes caused heavy swapping on the machine so that I had to kill all the processes to prevent the machine from freezing.

Finally I tried JLD2 and it saved me:

JLD2.@save "df.jld2" df
@everywhere JLD2.@load "df.jld2" df

took 13s!

So what is everyone else doing wrong? I have seen some older issues about serializing DataFrames which were mostly closed. And most importantly: How is it possible that this is faster over a (networked) filesystem than within the RAM of a single machine?

The same issue appeared with the return values of a pmap…

Topic		Replies	Views
Save and restore DataFrame, and serialize()/deserialize() General Usage	13	5370	September 13, 2019
Saving Unitful DataFrame to file Data dataframes , unitful	8	902	February 10, 2023
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
Difference between JuliaDB and DataFrames Data	13	1903	June 17, 2021
Efficient way to update a DataFrame stored in a bson file Performance	3	724	December 19, 2019

DataFrames and serialization

Related topics