JuliaDB Benchmarks



I’ve started gathering some reproducible benchmarks at https://github.com/joshday/JuliaDB_Benchmarks (huge shout out to https://github.com/oxinabox/DataDeps.jl for simplifying the reproducible part).

I would happily accept PRs of more examples! If you even just have a pointer to an interesting dataset you think JuliaDB would work well for, I’d love to hear about it!


It is really great to have this, but it definitely highlights some room for improvement :sweat_smile:


It would be interesting to add JuliaDB to https://h2oai.github.io/db-benchmark.


I’m actually not sure: it may well be that groupby in pandas is special-casing mean and sum to use online algorithms, i.e. not extracting the vector corresponding to the group but just computing the sum while iterating through it (or so I understood from the discussion when comparing performance with DataFrames) in which case the correct performance comparison would be with groupreduce that indeed performs quite well. I’d be curious to see what happens with a custom user defined “reducing” function in groupby where this optimization is no longer available.


The fannie mae data requires a login the it’s large at almost 2 billion rows