The state of DataFrames.jl H2O benchmark

Just to elaborate a little bit on this: one of my current projects involves wrangling a number of large-ish in-memory tables, with sizes of between 20 and 60 millions rows, and between 5 and 25 columns, most of which are string columns.

The analysis requires quite a lot of join operations as well as groupbys on subsets of those tables, and when I originally wrote the code, it took over an hour to run from top to bottom, as well as requiring writing out intermediate results multiple times in order to restart the Julia process and free up memory.

@oxinabox then told me to use ShortString types for my string data, and convert them into PooledArrays. The results were nothing short of magical, processing times on almost all operations went down between 50 and 90%, and overall I can run the analysis now in about 15 minutes, without having to do any restarts.

As Bogumil mentions there is ongoing work as well as discussions around how to improve GC issues in the data ecosystems as well as potentially more widely, but in the meantime I’d highly recommend trying out ShortStrings and PooledArrays for string-heavy DataFrame workflows!

11 Likes