JuliaDB Benchmarks

I’ve started gathering some reproducible benchmarks at GitHub - JuliaData/JuliaDB_Benchmarks: Reproducible benchmarks for JuliaDB (huge shout out to https://github.com/oxinabox/DataDeps.jl for simplifying the reproducible part).

I would happily accept PRs of more examples! If you even just have a pointer to an interesting dataset you think JuliaDB would work well for, I’d love to hear about it!

3 Likes

It is really great to have this, but it definitely highlights some room for improvement :sweat_smile:

It would be interesting to add JuliaDB to Database-like ops benchmark.

1 Like

I’m actually not sure: it may well be that groupby in pandas is special-casing mean and sum to use online algorithms, i.e. not extracting the vector corresponding to the group but just computing the sum while iterating through it (or so I understood from the discussion when comparing performance with DataFrames) in which case the correct performance comparison would be with groupreduce that indeed performs quite well. I’d be curious to see what happens with a custom user defined “reducing” function in groupby where this optimization is no longer available.

2 Likes

The fannie mae data requires a login the it’s large at almost 2 billion rows