I would happily accept PRs of more examples! If you even just have a pointer to an interesting dataset you think JuliaDB would work well for, I’d love to hear about it!
It is really great to have this, but it definitely highlights some room for improvement
I’m actually not sure: it may well be that
groupby in pandas is special-casing
sum to use online algorithms, i.e. not extracting the vector corresponding to the group but just computing the sum while iterating through it (or so I understood from the discussion when comparing performance with DataFrames) in which case the correct performance comparison would be with
groupreduce that indeed performs quite well. I’d be curious to see what happens with a custom user defined “reducing” function in
groupby where this optimization is no longer available.
The fannie mae data requires a login the it’s large at almost 2 billion rows