Pola.rs vs DataFrames.jl

klwlevy · February 23, 2023, 4:33pm

There seems to be quite some activity in the Python community about using pola.rs to get fast dataframes, with the aid of Rust. As a frequent DataFrames.jl user, I would be curious to hear @bkamins opinion on this “competitor”.

Developer explaining the package

pola.rs homepage

bkamins · February 23, 2023, 4:59pm

Polars is an excellent package that is actively developed and it has a decent performance.

In the comparison you shared there is an error, as DataFrames.jl has MIT License.

tbeason · February 23, 2023, 5:00pm

That table is mostly meaningless. It only reflects that Polars attracts a wider audience, which is obvious since it is a package for both Rust and Python!

I think DataFrames.jl alone is somewhat competitive with Polars, but less feature complete (by design?). DataFrames.jl in combination with some other packages from the ecosystem, like CSV.jl, Arrow.jl, or a few other processing packages, make our setup in Julia nearly as good. Perhaps less user friendly though.

Polars has some cool built-in functions like a groupby_dynamic which would be nice to have.

Another example, I have been using DuckDB lately, and they have a solid (and improving) Polars integration, but the Julia integration is lagging behind a bit.

bkamins · February 23, 2023, 5:23pm

We are open to add any features that are doable in DataFrames.jl. If something is needed can you please open an issue to discuss how it would fit?

Palli · February 23, 2023, 5:33pm

Yes, Polars seems excellent (and Pandas in some ways too, still more features, though I dislike much of its syntax at least in Python).

There is a wrapper for Pandas, but not (yet) for Polars (I mean to the Rust code, you can call Python, that way you have indirect access; is the Python wrapper complete?). I suppose one could be made (but some work, at least to support all of it), but I’m thinking would it make sense, and to use it with DataFrames.jl? I don’t see any (inherent) reason for Polars to be better (faster, than Julia, or have more features), would a wrapper make sense say just for some additional features?

bkamins · February 23, 2023, 5:48pm

Inherent reasons:

Polars is Arrow based only, that means it is less general, but can be optimized more for this specific case (strings is one of the major areas; note that Pandas 2.0 goes this way also for the same reason);
Polars has lazy evaluation; DataFrames.jl does not have it and will likely not have it;
On the other hand DataFrames.jl can be more easily than Polars optimized against: custom Julia types (not present in Arrow) and against execution of functions defined in Julia.

I think it would make sense so that users have choice.

A general comment:

I maintain DataFrames.jl because I believed Julia users need a decent data frame package. However, I would have no problem if in the future some better package replaced it PROVIDED THAT it would not introduce limitations that Julia users would not want to accept. Simply 5 years ago without an investment into DataFrames.jl development I believed that preprocessing data in Julia was a pain and I needed it. Having said that, I am convinced that DataFrames.jl is pretty feature complete (not 100% but pretty close; recently Tidier.jl has started being developed with DataFrames.jl as a backend I I have not received any significant complaint that we miss something really important that would hinder porting of tidyverse to Julia - as an example)
In general I believe that data frame functionality is super important but as a “side kick” as I call it (i.e. almost never doing data pre- and post- processing is the thing one wants to do - this is something one has to do ; these are support processes that need to be there and need to be good enough). I think that the key thing about Julia is to make sure we provide the right “core” packages that do real data science operations best in class (like ML, optimization, simulation, etc.).

tbeason · February 23, 2023, 6:05pm

I don’t think that groupby_dynamic fits well within DataFrames.jl. It would be better for a time series or panel data package (right now I just do it manually in 2 steps).

jling · February 23, 2023, 6:06pm

tangent but also I asked on Slack already, it would really be great if something with Query.jl syntax can give us graph optimization (including making intermediate variables) before computation one day, of course not in the scope of DataFrames.jl

klwlevy · February 23, 2023, 6:36pm

I absolutely love DataFrames.jl and I am very impressed with @bkamins constant efforts to make it better AND take the time to explain the features to a wider audience. Like in this thread for example. Looking forward to tomorrow’s blog post.

Superblog

Topic		Replies	Views
Direct interface to Polars Rust library Data question	13	1666	November 9, 2023
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	946	November 10, 2021
DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc) Data	10	2069	September 15, 2018
Suggestions for a package to read tabular data Data question	12	2724	February 13, 2017
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017

Pola.rs vs DataFrames.jl

Related topics