Release announcements for DataFrames.jl

I’m a bit surprised by the change to String. The bike shed was so much more beautiful with the interned blue string of Symbol. Is there a short rationale for the change?

Are Strings the future? Should we get used to typing "col_name" instead of :col_name in anticipation of an eventual deprecation?

Should we get used to typing "col_name" instead of :col_name in anticipation of an eventual deprecation?

No - both strings and Symbols are allowed and will be allowed.

In general the policy post 0.21 release is not to break the functionality in DataFrames.jl (this is not carved in stone of course, but I will oppose any breaking change without a very good justification). The objective is that every code that is working under 0.21 without printing deprecation warnings should be working in long term.

Is there a short rationale for the change?

The rationale is the following:

  1. we keep, and will keep allowing Symbols, actually internally we work and will work with Symbols, as they are faster.
  2. it is very convenient to have a possibility to work with strings, consider eg. df."some column" the ways you had to use in the past to get a column with a space inside it (the simplest is df.var"some column" so it is not that bad but most users do not know that this is possible)
  3. People migrating from R/Python naturally tend to use strings, and DataFrames.jl is an entry-level package, and explaining the difference between strings and Symbols and why you are not allowed to do string operations on Symbols is not something you want to do to non computer scientists during the introductory lecture.

We are aware that it comes at a cost of being a bit overly flexible, but again - strings are just opt-in feature so if you do not like it do not use it (I do not use them in my code unless in very special cases).

13 Likes

I love it! I found that working with columns with spaces was very verbose due to the requirement for explicit Symbol conversion. (I wasn’t aware of the var trick, that’s pretty cool!)

Nice.

Is there any benchmark showing the speed and memory consumption) of reshaping with DataFrames.jl vs R’s data.table?
I mean converting big dataframes from long to wide and vice versa.

Database-like ops benchmark (and you should probably look at the bars from the second run, as the first run includes compilation time)

2 Likes

These benchmarks are incredible… congrats!

But those bencharmks don’t include reshaping, just “grouping by” and “joining”.

I don’t think there is any benchmark for re-shaping, unfortunately.

What will be the relationship between DataFrames and DataFramesMeta over time? Will DataFrames incorporate the functionalities of DataFramesMeta in the future?

Honestly from i hear it incorporated quite a few in select

They will be separate packages. @pdeffebach is working towards a major upgrade of DataFramesMeta.jl functionality. Here is the discussion Plan for DataFrames 1.0 · Issue #148 · JuliaData/DataFramesMeta.jl · GitHub.

3 Likes

DataFrames.jl 0.22.0 release is out. You can check out the detailed release notes here and the release here.

First let me thank people who worked on it. There are very many contributors, so here I list only those who contributed a merged PR between 0.21 and 0.22 releases (thee list is very long even with this filter): Alexey Stukalov, Arsh Sharma, Baurzhan Muftakhidinov, Bogumił Kamiński, Daniel Molina, David Nies, Jacob Quinn, Jonas Schulze, Kevin Bonham, Logan Kilpatrick, Matthieu Gomez, Milan Bouchet-Valat, Morten Piibeleht, Nicholas Ritchie, Nick Eubank, Nils Gudat, Okon Samuel, Paulito Palmes, Peter Deffebach, Peter Shintech, Ronan Arraes Jardim Chagas, Takafumi Arakaki, Tom Kwong, Tyler Beason, Wolf Thomsen, Zhuo Jia Dai.

The 0.22 release is intended to be the last release before 1.0 release and our intention is not to make breaking changes and make a 1.0 release relatively soon. Therefore you can safely assume that what works under 0.22 and is not deprecated (rembember about using --depwarn=error in production code) will work long-term.
Also please keep in mind that display changes are not considered to be breaking.

The major changes in this release are (I am listing only breaking changes, as there are dozens of additions of functionalities — too many to list here):

  • the package is precompiled aggresively (this means that it takes ~30 seconds when it is being installed to precompile), but “time to first data frame” will be reduced
  • PrettyTables.jl is now the default back-end to print DataFrames to text/plain; the print option splitcols was removed and the output format was changed
  • the list of provided DataFrame constructors has been significantly restricted
  • the rules for transformations passed to select/select!, transform/transform!, and combine have been made consistent and more flexible; in particular now it is allowed to return multiple columns from a transformation function
  • The dependency on CategoricalArrays.jl is deprecated (which means that in 1.0 release we will completely drop this dependency; this should also help with latency in particular, though CategoricalArrays.jl got much better in this area recently)
  • in joins passing NaN or real or imaginary -0.0 in on column now throws an error; passing missing thows an error unless matchmissing=:equal keyword argument is passed
  • unstack now produces row and column keys in the order of their first appearance and has two new keyword arguments allowmissing and allowduplicates
  • in describe the specification of custom aggregation is now function => name; old name => function order is now deprecated
  • All(args...) is deprecated, use Cols(args...) instead (except that All() is still allowed)

What is planned for the future (without guarantees what will make it into 1.0 release, as many of these things are hard and experimental; I am listing here only a limited number of thigs see issues/PRs in the package repository for a complete view):

  • remove all deprecations
  • improve join performance
  • use multithreading in split-apply-combine
  • add proprow specifier in transformations (like nrow but calculating proportions)
  • add RowNumber virtual source column in transformations
  • add AsVector wrapper (like AsTable but passing arguments as a vector to a function)
  • add where function (like filter but consistent with other transformation functions)
  • more flexible stack/unstack (in particular unstacking on mupultiple columns and multiple values)

Ecosystem changes:

  • If you are maintaining a package that has DataFrames.jl as a dependency please update the Project.toml to allow 0.22 version
  • I will update the tutorials soon (I will post when it is done, howver first some packages need to be updated to allow DataFrames.jl 0.22).
  • It is also recommended to update the dependency on CategoricalArrays.jl to the 0.9 release of this package, as it significantly reduces number of introduced method invalidations.

I hope you will enjoy using new DataFrames.jl!

49 Likes

Could you please expand on this point a little?

The details are here https://github.com/JuliaData/DataFrames.jl/pull/2464. The most important rule we have now: if you pass a single positional argument to DataFrame then it is considered to be a Tables.jl table (with some minor exceptions that are generally considered obvious corner cases in which it is desirable to behave differently).

1 Like

While we are at it check out Database-like ops benchmark for performance of 0.22 release on groupby. Thanks to @nalimilan:

  • we are now competitive in small data frames (0.5GB); essentially in “normal” packages scope we are on par with data.table and faster than everything else (I am excluding here and below things like ClickHouse or cuDF), and we are not using multithreading yet (@nalimilan is working on it)
  • medium size (5GB): we are on par with data.table for small/medium number of groups; but we are still slower if you have very many groups
  • large size (50GB): the same situation as for medium size - we are very good for small number of groups but have problems with cases when there are a lot of small groups (again - we need to think if we can improve here in the future)
8 Likes

That’s really impressive! I’m curious if there are any plans to expand utilities like groupby to work on other Tables? Most notably, arrays of named tuples and NT-like objects. Basically, a DataFrame is a canonical example of column-based table, and Array{NamedTuple{...}} is the same for row-based. Would be great if they could enjoy the same operations, e.g. groupby, written in a performant way. Do you think it’s feasible to reuse existing implementation for this?

1 Like

The API could be reused, if we think we like it. The implementation can be reused by any column-storage table, but row-storage tables will require a different implementation to be fast I think.

2 Likes

Could someone explain what’s behind the deprecation of CategoricalArrays, and what the suggested approach to categorical data is now? The notes say that ’ transform(df, cols .=> categorical .=> cols) is now preferred to the previous use of the categorical function, but that seems awkward, and I also wonder if we should be doing something else entirely.

I looked at the pull request associated with the deprecation, and a chain of prior pulls that were referenced. They showed that removing the dependence was a goal, but I didn’t see why.

The rationale is the following. CategoricalArrays.jl functionality is orthogonal to DataFrames.jl functionality. Therefore DataFrames.jl should not depend on CategoricalArrays.jl.

This does not mean that it is discouraged to use these packages in combination. Actually there are loads of tests that test the integration.

The point is that essentially all functionality of DataFrames.jl can be implemented in terms of DataAPI.jl without requiring to depend on CategoricalArrays.jl directly. The only user visible change when this is not the case is clearly categorical function. The current design of Julia does not allow for conditional dependencies so we had to deprecate categorical(df, cols). The transform(df, cols .=> categorical .=> cols) does what categorical(df, cols) did previously - thus the recommendation. I know it is longer, but it could not be worked around.

Additionally - in the past CategoricalArrays.jl was very offending in terms of number of method invalidations it introduced. And this was the original significant reason for this decoupling. With CategoricalArrays.jl 0.9 release this is no longer the case, but still it turned out that we only need to sacrifice categorical method to drop a dependency, so it was judged it is worth it, as DataFrames.jl is kind of “core” package so it should be as lightweight as possible.

9 Likes

Thanks for the info. Does that mean that a module does its own using CategoricalArrays can continue to use, e.g., the categorical! function with a DataFrame?