The state of DataFrames.jl H2O benchmark

nilshg · May 17, 2021, 11:53am

Just to elaborate a little bit on this: one of my current projects involves wrangling a number of large-ish in-memory tables, with sizes of between 20 and 60 millions rows, and between 5 and 25 columns, most of which are string columns.

The analysis requires quite a lot of join operations as well as groupbys on subsets of those tables, and when I originally wrote the code, it took over an hour to run from top to bottom, as well as requiring writing out intermediate results multiple times in order to restart the Julia process and free up memory.

@oxinabox then told me to use ShortString types for my string data, and convert them into PooledArrays. The results were nothing short of magical, processing times on almost all operations went down between 50 and 90%, and overall I can run the analysis now in about 15 minutes, without having to do any restarts.

As Bogumil mentions there is ongoing work as well as discussions around how to improve GC issues in the data ecosystems as well as potentially more widely, but in the meantime I’d highly recommend trying out ShortStrings and PooledArrays for string-heavy DataFrame workflows!

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5779	January 23, 2019
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	461	August 27, 2020
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	942	November 10, 2021
How much performance potential does DataFrames have? Offtopic question	7	4418	February 18, 2021

The state of DataFrames.jl H2O benchmark

Related topics