Seems like DuckDB has resurrected the H2O benchmark and plans to keep re-running it: The Return of the H2O.ai Database-like Ops Benchmark - DuckDB
@bkamins any chance this effort could be implemented for general Tables.jl? It would be super nice to reuse the work in other table types and contexts.
DuckDB and Polaris are so fast
FlexiJoins.jl
is the general type-stable joining package. It doesnât constrain the dataset type, and is based on the regular Julia collections interface (not on Tables.jl, but many tables are implicitly supported).
The generic code path is implemented to be reasonably fast, but it doesnât have all those special cases DataFrames.jl do, such as integer or low-cardinality optimizations.
Iâm very familiar with Arrow (feather) as an on-disk file spec. I donât think in this case it has much to do with Arrow or not â representing a column in memory (just for Juliaâs own use) doesnât need fancy spec/schema/row group of Arrow.
On a separate note, we know Arrow is the future for a while and Pandas 2.0 is moving towards Arrow for backend as well. Arrow (feather) handling is another area Julia needs to improve but thatâs a different topic.
I would be fine if DataFrames.jl uses Arrow batch as in-memory representation and shares the column schema. But given Tables.jl and Arrow.jl, we can do zero-copy with other ecosystems as long as Arrow.jl is good enough, so again it shouldnât matter.
This benchmark has been updated again: Results
Anyone knows how DuckDB can run so fast? What does it take to make DataFrames.jl rise to the top?
Wow I didnât realize thereâs no commit to DF in 3 months
If the can find community members willing to contribute/review PRs we can move forward the development of the package.
The other perspective is, I havenât opened an issue on that repo since 2021, even though itâs a pretty standard package all over the place