Direct interface to Polars Rust library

There has been recent discussion about some benchmarking by the DuckDB folks with good results for DuckDB (“Well they would say that, wouldn’t they?” M. Rice-Davies) and Polars.

Especially now that Polars supports reading and writing the Arrow IPC file format for its DataFrame and Series types, it is reasonable to write a Julia DataFrame with Arrow.jl then read and manipulate it in Polars.

I have been using PyCall.jl and its pyimport function as

pl = pyimport("polars")

to access the Python bindings within a Julia session.

I am not that familiar with the internals of Python nor the distinctions between PyCall.jl and PythonCall.jl.

  • Would there be an inherent advantage, perhaps related to zero-copy, in using PythonCall.jl to access the Python bindings for Polars?

  • Would it be feasible/worthwhile creating a Julia package to access the Rust entry points for Polars? At pola-rs · GitHub there are bindings in Python, R, Node.js and pyo3. Is it that much of a stretch to produce a Julia package to access the polars Rust functions?

I know of one direct path – jlrs

But also your opening paragraph does not really make much sense to me. What does DuckDB have to do with this, specifically? If you want to use DuckDB, just do using DuckDB, DBInterface (annoying that we need that second one).

Really I think DataFrames and Polars are close competitors. I’m not sure that a “Polars.jl” is really needed or worth the likely large amount of effort to make it happen.

I was thinking of the benchmarks mentioned in recent posts on https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/ when I mentioned DuckDB. The latest version of the benchmarks were performed by the DuckDB folks to show the speed of their most recent version. The benchmark results also show that Polars is very fast on these benchmarks. The mention of DuckDB was only to provide motivation, through their benchmarks, for interfacing to Polars.

It is certainly possible to use PyCall or PythonCall and the Python bindings for Polars to do the types of summaries shown in the benchmarks. But I don’t know if that route will cause copies of large objects to be made. I believe that the existing Arrow.Table and Arrow.write functions in Arrow.jl can provide for zero-copy exchange of Arrow tables between Julia and Rust, but I don’t know much about the Rust end of that.

The point of interfacing to Polars is not to replace DataFrames.jl but rather to enhance it for cases of large complicated joins and summaries.

Note that DataFrames.jl is noticeably slow only in some cases. Typically it is comparable (even if slower). And I could give you benchmarks, when DataFrames.jl is faster than Polars (one relevant case is when you run code on a laptop not on a server having 40-cores; on laptop DataFrames.jl performance is relatively more attractive).

My current thinking is:

  • we know in what cases we need to improve DataFrames.jl (it is mostly operations when you have many small groups; we are quite good if there are few large groups); in the cases when we are really slow we plan to improve them
  • still it should be expected that in the long run DuckDB/Polars will be faster on the average; the reason is that many more people are willing to work on performance of these frameworks (I opened GSoC/JSoC call for performance improvements for DataFrames.jl this year and it did not get people interested in it)
  • if I have to choose DuckDB vs Polars in my work I choose DuckDB as a fallback; the reasons are that
    i. we already have a package DuckDB.jl that allows me to do it
    ii. it is in general faster (not by much but faster)
    iii. it is enough that I know SQL to use it (Polars also has SQL interface, but for DuckDB this is a native interface).

Also, in general, can you please report in DataFrames.jl issue cases when you in practice find DataFrames.jl to be exceedingly slow? Focusing on a concrete classes of cases is the easiest way for us to improve.

11 Likes

I guess DF.jl doesn’t automatically use multi-threading, which is fine, I wish the H2O DB benchmark has single thread benchmark.

But I guess they’re coming from big DB perspective, there’s no point in not doing everything multi-threaded. In deed, I think it’s pretty rare when you need to groupby many smaller dataframes in parallel, than having just a big “data base”

Roughly yes. They come from “big database” perspective, and I think it makes sense.

In terms of DataFrames.jl:

  • originally it was developed as single-threaded;
  • we are slowly adding multi-threading in different places, so things improve, but this is work in progress and did not have a top priority (we put more priority on single-threaded performance, functionality coverage, and API flexibility); but e.g. until Julia fully resolves issues with GC in multi-threading code we will have problems (GC is not an issue for DuckDB and Polars)
6 Likes

DuckDB has been archived since 2022, and is not compatable with the latests versions of DBeaver.

https://github.com/kimmolinna/DuckDB.jl

I have moved my question to a seperate topic, since it appears to be a use issue.
Connecting to DuckDB

Will Dtables be able to address multithreading? What is the GC problem?

I think what’s missing from the Pluto and JuliaHub, and the Julia ecosystem as a whole, is a GUI table editor. With SQLite (or DuckDB if it’s developed). I should be able to connect an editor like DBeaver to enter tables, and map joins, while using the Pluto notebook to analyze data, and for visualization/publishing data.

DuckDB.jl now lives inside the official DuckDB repo:

2 Likes

If you use allocated objects in columns of your table then and have billions of rows then GC often starts to be the most expensive part of executing data transformations.

How do I check the version of DuckDB is the same in DBeaver and Julia?

In case you hit an issue, I have not been able to work with duckdb since the 0.9.1 release and there is an open issue which others have encountered as well.

Also there is a Polars.jl being worked on.

I got DuckDB to work on Julia or DBeaver, but I can’t communicate between the two. What happened to JuliaDB?

Would there be a benefit to making Julia able to call Rust libraries, the same way it can call C or Fortran libraries?

What if i was doing something like employee evaluations, and I want to study the performance of individual employees and staff as a whole. Would rows be better for the former and columns for the later? Would a vector or graph database be better suited to evaluating each as needed?