JuliaDB Benchmarks

joshday · February 12, 2019, 4:09pm

I’ve started gathering some reproducible benchmarks at GitHub - JuliaData/JuliaDB_Benchmarks: Reproducible benchmarks for JuliaDB (huge shout out to https://github.com/oxinabox/DataDeps.jl for simplifying the reproducible part).

I would happily accept PRs of more examples! If you even just have a pointer to an interesting dataset you think JuliaDB would work well for, I’d love to hear about it!

anon67531922 · February 12, 2019, 4:16pm

It is really great to have this, but it definitely highlights some room for improvement

nalimilan · February 12, 2019, 4:25pm

It would be interesting to add JuliaDB to Database-like ops benchmark.

piever · February 12, 2019, 8:01pm

I’m actually not sure: it may well be that groupby in pandas is special-casing mean and sum to use online algorithms, i.e. not extracting the vector corresponding to the group but just computing the sum while iterating through it (or so I understood from the discussion when comparing performance with DataFrames) in which case the correct performance comparison would be with groupreduce that indeed performs quite well. I’d be curious to see what happens with a custom user defined “reducing” function in groupby where this optimization is no longer available.

xiaodai · February 12, 2019, 8:11pm

The fannie mae data requires a login the it’s large at almost 2 billion rows

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5803	January 23, 2019
Group-by performance benchmarks and recommendations Data	12	3516	September 2, 2019
Tables package for fast grouping and filtering? Performance package	18	1579	December 8, 2019
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9373	January 1, 2025
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	467	August 27, 2020

JuliaDB Benchmarks

Related topics