The state of DataFrames.jl H2O benchmark

brucala · May 17, 2023, 8:00am

Seems like DuckDB has resurrected the H2O benchmark and plans to keep re-running it: The Return of the H2O.ai Database-like Ops Benchmark - DuckDB

juliohm · May 22, 2023, 7:53pm

@bkamins any chance this effort could be implemented for general Tables.jl? It would be super nice to reuse the work in other table types and contexts.

jling · May 22, 2023, 8:42pm

DuckDB and Polaris are so fast

aplavin · May 22, 2023, 8:47pm

FlexiJoins.jl is the general type-stable joining package. It doesn’t constrain the dataset type, and is based on the regular Julia collections interface (not on Tables.jl, but many tables are implicitly supported).

The generic code path is implemented to be reasonably fast, but it doesn’t have all those special cases DataFrames.jl do, such as integer or low-cardinality optimizations.

dmbates · May 22, 2023, 9:01pm

Interestingly both DuckDB and Polars use Arrow column-oriented storage.

jling · May 22, 2023, 9:10pm

I’m very familiar with Arrow (feather) as an on-disk file spec. I don’t think in this case it has much to do with Arrow or not – representing a column in memory (just for Julia’s own use) doesn’t need fancy spec/schema/row group of Arrow.

On a separate note, we know Arrow is the future for a while and Pandas 2.0 is moving towards Arrow for backend as well. Arrow (feather) handling is another area Julia needs to improve but that’s a different topic.

I would be fine if DataFrames.jl uses Arrow batch as in-memory representation and shares the column schema. But given Tables.jl and Arrow.jl, we can do zero-copy with other ecosystems as long as Arrow.jl is good enough, so again it shouldn’t matter.

rdavis120 · November 5, 2023, 9:33pm

This benchmark has been updated again: Results

tk3369 · July 21, 2024, 5:15pm

Anyone knows how DuckDB can run so fast? What does it take to make DataFrames.jl rise to the top?

jling · July 21, 2024, 10:44pm

Wow I didn’t realize there’s no commit to DF in 3 months

bkamins · July 22, 2024, 4:42am

If the can find community members willing to contribute/review PRs we can move forward the development of the package.

ChrisRackauckas · July 22, 2024, 12:22pm

The other perspective is, I haven’t opened an issue on that repo since 2021, even though it’s a pretty standard package all over the place

_micro · December 30, 2024, 6:56pm

CSV.jl is segfaulting in the most recent run of the benchmark.

bkamins · December 31, 2024, 6:13am

Thank you for reporting. This is tracked in Segfault when reading CSV from within the R environment · Issue #55765 · JuliaLang/julia · GitHub.

viralbshah · January 1, 2025, 8:45am

I noted this in the issue, but it loads fine for me on my mac arm64 with both 1.10.7 and 1.11.2.

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5783	January 23, 2019
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	461	August 27, 2020
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	943	November 10, 2021
How much performance potential does DataFrames have? Offtopic question	7	4421	February 18, 2021

The state of DataFrames.jl H2O benchmark

Related topics