The state of DataFrames.jl H2O benchmark

Seems like DuckDB has resurrected the H2O benchmark and plans to keep re-running it: The Return of the H2O.ai Database-like Ops Benchmark - DuckDB

10 Likes

@bkamins any chance this effort could be implemented for general Tables.jl? It would be super nice to reuse the work in other table types and contexts.

DuckDB and Polaris are so fast

3 Likes

FlexiJoins.jl is the general type-stable joining package. It doesn’t constrain the dataset type, and is based on the regular Julia collections interface (not on Tables.jl, but many tables are implicitly supported).

The generic code path is implemented to be reasonably fast, but it doesn’t have all those special cases DataFrames.jl do, such as integer or low-cardinality optimizations.

Interestingly both DuckDB and Polars use Arrow column-oriented storage.

I’m very familiar with Arrow (feather) as an on-disk file spec. I don’t think in this case it has much to do with Arrow or not – representing a column in memory (just for Julia’s own use) doesn’t need fancy spec/schema/row group of Arrow.

On a separate note, we know Arrow is the future for a while and Pandas 2.0 is moving towards Arrow for backend as well. Arrow (feather) handling is another area Julia needs to improve :frowning: but that’s a different topic.

I would be :100: fine if DataFrames.jl uses Arrow batch as in-memory representation and shares the column schema. But given Tables.jl and Arrow.jl, we can do zero-copy with other ecosystems as long as Arrow.jl is good enough, so again it shouldn’t matter.

2 Likes

This benchmark has been updated again: Results

6 Likes