Pola.rs vs DataFrames.jl

There seems to be quite some activity in the Python community about using pola.rs to get fast dataframes, with the aid of Rust. As a frequent DataFrames.jl user, I would be curious to hear @bkamins opinion on this “competitor”.

Developer explaining the package

pola.rs homepage

2 Likes

Polars is an excellent package that is actively developed and it has a decent performance.

In the comparison you shared there is an error, as DataFrames.jl has MIT License.

7 Likes

That table is mostly meaningless. It only reflects that Polars attracts a wider audience, which is obvious since it is a package for both Rust and Python!

I think DataFrames.jl alone is somewhat competitive with Polars, but less feature complete (by design?). DataFrames.jl in combination with some other packages from the ecosystem, like CSV.jl, Arrow.jl, or a few other processing packages, make our setup in Julia nearly as good. Perhaps less user friendly though.

Polars has some cool built-in functions like a groupby_dynamic which would be nice to have.

Another example, I have been using DuckDB lately, and they have a solid (and improving) Polars integration, but the Julia integration is lagging behind a bit.

1 Like

We are open to add any features that are doable in DataFrames.jl. If something is needed can you please open an issue to discuss how it would fit?

2 Likes

Yes, Polars seems excellent (and Pandas in some ways too, still more features, though I dislike much of its syntax at least in Python).

There is a wrapper for Pandas, but not (yet) for Polars (I mean to the Rust code, you can call Python, that way you have indirect access; is the Python wrapper complete?). I suppose one could be made (but some work, at least to support all of it), but I’m thinking would it make sense, and to use it with DataFrames.jl? I don’t see any (inherent) reason for Polars to be better (faster, than Julia, or have more features), would a wrapper make sense say just for some additional features?

Inherent reasons:

  • Polars is Arrow based only, that means it is less general, but can be optimized more for this specific case (strings is one of the major areas; note that Pandas 2.0 goes this way also for the same reason);
  • Polars has lazy evaluation; DataFrames.jl does not have it and will likely not have it;
  • On the other hand DataFrames.jl can be more easily than Polars optimized against: custom Julia types (not present in Arrow) and against execution of functions defined in Julia.

I think it would make sense so that users have choice.


A general comment:

  • I maintain DataFrames.jl because I believed Julia users need a decent data frame package. However, I would have no problem if in the future some better package replaced it PROVIDED THAT it would not introduce limitations that Julia users would not want to accept. Simply 5 years ago without an investment into DataFrames.jl development I believed that preprocessing data in Julia was a pain and I needed it. Having said that, I am convinced that DataFrames.jl is pretty feature complete (not 100% but pretty close; recently Tidier.jl has started being developed with DataFrames.jl as a backend I I have not received any significant complaint that we miss something really important that would hinder porting of tidyverse to Julia - as an example)
  • In general I believe that data frame functionality is super important but as a “side kick” as I call it (i.e. almost never doing data pre- and post- processing is the thing one wants to do - this is something one has to do :smile:; these are support processes that need to be there and need to be good enough). I think that the key thing about Julia is to make sure we provide the right “core” packages that do real data science operations best in class (like ML, optimization, simulation, etc.).
20 Likes

I don’t think that groupby_dynamic fits well within DataFrames.jl. It would be better for a time series or panel data package (right now I just do it manually in 2 steps).

tangent but also I asked on Slack already, it would really be great if something with Query.jl syntax can give us graph optimization (including making intermediate variables) before computation one day, of course not in the scope of DataFrames.jl

1 Like

I absolutely love DataFrames.jl and I am very impressed with @bkamins constant efforts to make it better AND take the time to explain the features to a wider audience. Like in this thread for example. Looking forward to tomorrow’s blog post.

Superblog

10 Likes