There has been recent discussion about some benchmarking by the DuckDB folks with good results for DuckDB (“Well they would say that, wouldn’t they?” M. Rice-Davies) and Polars.
Especially now that Polars supports reading and writing the Arrow IPC file format for its DataFrame and Series types, it is reasonable to write a Julia DataFrame with Arrow.jl then read and manipulate it in Polars.
I have been using PyCall.jl and its pyimport function as
pl = pyimport("polars")
to access the Python bindings within a Julia session.
I am not that familiar with the internals of Python nor the distinctions between PyCall.jl and PythonCall.jl.
Would there be an inherent advantage, perhaps related to zero-copy, in using PythonCall.jl to access the Python bindings for Polars?
Would it be feasible/worthwhile creating a Julia package to access the Rust entry points for Polars? At pola-rs · GitHub there are bindings in Python, R, Node.js and pyo3. Is it that much of a stretch to produce a Julia package to access the polars Rust functions?
But also your opening paragraph does not really make much sense to me. What does DuckDB have to do with this, specifically? If you want to use DuckDB, just do using DuckDB, DBInterface (annoying that we need that second one).
Really I think DataFrames and Polars are close competitors. I’m not sure that a “Polars.jl” is really needed or worth the likely large amount of effort to make it happen.
I was thinking of the benchmarks mentioned in recent posts on https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/ when I mentioned DuckDB. The latest version of the benchmarks were performed by the DuckDB folks to show the speed of their most recent version. The benchmark results also show that Polars is very fast on these benchmarks. The mention of DuckDB was only to provide motivation, through their benchmarks, for interfacing to Polars.
It is certainly possible to use PyCall or PythonCall and the Python bindings for Polars to do the types of summaries shown in the benchmarks. But I don’t know if that route will cause copies of large objects to be made. I believe that the existing Arrow.Table and Arrow.write functions in Arrow.jl can provide for zero-copy exchange of Arrow tables between Julia and Rust, but I don’t know much about the Rust end of that.
The point of interfacing to Polars is not to replace DataFrames.jl but rather to enhance it for cases of large complicated joins and summaries.
Note that DataFrames.jl is noticeably slow only in some cases. Typically it is comparable (even if slower). And I could give you benchmarks, when DataFrames.jl is faster than Polars (one relevant case is when you run code on a laptop not on a server having 40-cores; on laptop DataFrames.jl performance is relatively more attractive).
My current thinking is:
we know in what cases we need to improve DataFrames.jl (it is mostly operations when you have many small groups; we are quite good if there are few large groups); in the cases when we are really slow we plan to improve them
still it should be expected that in the long run DuckDB/Polars will be faster on the average; the reason is that many more people are willing to work on performance of these frameworks (I opened GSoC/JSoC call for performance improvements for DataFrames.jl this year and it did not get people interested in it)
if I have to choose DuckDB vs Polars in my work I choose DuckDB as a fallback; the reasons are that
i. we already have a package DuckDB.jl that allows me to do it
ii. it is in general faster (not by much but faster)
iii. it is enough that I know SQL to use it (Polars also has SQL interface, but for DuckDB this is a native interface).
Also, in general, can you please report in DataFrames.jl issue cases when you in practice find DataFrames.jl to be exceedingly slow? Focusing on a concrete classes of cases is the easiest way for us to improve.
I guess DF.jl doesn’t automatically use multi-threading, which is fine, I wish the H2O DB benchmark has single thread benchmark.
But I guess they’re coming from big DB perspective, there’s no point in not doing everything multi-threaded. In deed, I think it’s pretty rare when you need to groupby many smaller dataframes in parallel, than having just a big “data base”
Roughly yes. They come from “big database” perspective, and I think it makes sense.
In terms of DataFrames.jl:
originally it was developed as single-threaded;
we are slowly adding multi-threading in different places, so things improve, but this is work in progress and did not have a top priority (we put more priority on single-threaded performance, functionality coverage, and API flexibility); but e.g. until Julia fully resolves issues with GC in multi-threading code we will have problems (GC is not an issue for DuckDB and Polars)
Will Dtables be able to address multithreading? What is the GC problem?
I think what’s missing from the Pluto and JuliaHub, and the Julia ecosystem as a whole, is a GUI table editor. With SQLite (or DuckDB if it’s developed). I should be able to connect an editor like DBeaver to enter tables, and map joins, while using the Pluto notebook to analyze data, and for visualization/publishing data.
What if i was doing something like employee evaluations, and I want to study the performance of individual employees and staff as a whole. Would rows be better for the former and columns for the later? Would a vector or graph database be better suited to evaluating each as needed?