Release announcements for DataFrames.jl

DataFrames.jl 1.2.0 is out. Here you can find the release notes. I have also written a blog post explaining the key user visible changes it introduces.

19 Likes

DataFrames.jl 1.3.0 is out.

It is a major release much bigger than recent releases. It is expected that, hopefully, we managed to fix all key missing parts in the package to make it feature complete.

Development towards 1.4.0 will continue by adding additional features requested by the users. I expect to have this release around JuliaCon 2022 (unless something unexpected happens).

Here you can find the detailed release notes. See also NEWS.md for a list of relevant changes in the package.

Let me briefly summarize the most important changes and additions (in total 125 PRs were merged since 1.2.2 release which is a lot) this will be brief so it assumes you know the functionality of the package, I will soon write a blog post explaining these changes for newcomers):

  • in groupby now users have more control on resulting group order (this resolves the issue previously groupby was implemented to produce the group ordering that is fastest to create by default, which is unintuitive in certain use cases; now sort keyword argument is improved and allows more control from the user if this is desired);
  • if SubDataFrame was created with : column selector (i.e. it contains all columns of its parent) then you can add new columns to such data frame in all functions (the filtered out rows get filled with missing value)
  • delete! is deprecated in favor of deleteat! fixing the inconsistency with how what these functions are used for in Julia Base
  • leftjoin! is added allowing for in-place joining of data frames (and it is fast)
  • in source .=> transformation .=> destination form of the transformation minilanguage the Cols, Between, All and Not selectors support broadcasting;
  • fix a bug in handling of keyword arguments in sorting related functions that in some cases allowed passing tuples (support of which was removed in 1.0 release) and in some other cases lead to stack overflow;
  • transformations having a form AsTable(...) => ByRow(sum) (and other standard reduction functions) are now fast even when many columns are selected (solving a long standing performance bottleneck)
  • In DataFrames.jl 1.4 release on Julia 1.7 or newer broadcasting assignment into an existing column of a data frame will replace it. Under Julia 1.6 or older it will be an in place operation. (this is an unfortunate difference in behavior between versions of Julia - it is impossible to implement it differently due to limitations of Julia Base; that is why a clear announcement of this discrepancy is made now and the change will be made effective in DataFrames.jl 1.4)

Before I wrap up let me thank everyone who contributed towards this release!

50 Likes

Hi, I didn’t notice the announce for the underlying change in Julia 1.7. Can you give me a pointer?

https://github.com/JuliaLang/julia/pull/39473

The point is that x.y .= z in Julia 1.6 first takes y from x and then performs broadcasting of z into it. While since Julia 1.7 the operation can be handled as a whole (not in two separate steps).

This is a similar pattern to x[y] .= z that existed for a long time where Julia treats this expression as a whole and not makes x[y] selection and then broadcasts z into it (which clearly would not be useful).

The consequence in DataFrames.jl is that when you write df.col .= value we have in 1.7 a full control over how this expression should be resolved.

Thanks!

The following tutorials were updated to DataFrames.jl 1.3:

Since the list is long please open an issue if there is some bug in them.

12 Likes

After many months of hard work DataFrames.jl 1.4.0 has been released. There were 98 PR included in this release (not including patch release commits, and we had 6 such releases) authored by: Alex Arslan, alfaromartino, anand jain, Bogumił Kamiński, Eric Hanson, jariji, Joseph Wilson, Lilith Orion Hafner, Martijn Visser, Milan Bouchet-Valat, Mo-Gul, musvaage, reumle, Rik Huijzer, Ronan Arraes Jardim Chagas, Stefan Krastanov, Yakir Luc Gagnon; I used names provided on GitHub commits. There were also numerous people that opened issues and took part in the discussion. I would like to thank them all. Among them @nalimilan must be mentioned as he reviewed every PR that was made.

This is one of the biggest releases made. The number of PRs is large, but most importantly several important improvements were made. You can find all changes in the 1.4 release and 1.3.x patch releases in NEWS.md. Some of the changes involved hundreds of comments and discussions and changes in the whole JuliaData ecosystem.

Here let me highlight major changes (I drop minor improvements and bug fixes as there are too many of them):

  • DataFrames.jl 1.4.0 requires at least Julia 1.6, if you use older version of Julia DataFrames.jl 1.3.6 should be used; it is in maintenance mode (so if there are any bugs found please report them and I will make a patch release)
  • unstack now supports combine keyword argument, which turns this function into a pivot-table (allowing for aggregation of data when unstacking)
  • add full support for “data frame as a collection of rows” functionality, adding functions: reverse! , permute! , invpermute! , shuffle , shuffle!, resize! , keepat! , pop! , popfirst! , popat!, pushfirst!, insert!, prepend!
  • On Julia 1.7 or newer broadcasting assignment into an existing column of a data frame replaces it. Under Julia 1.6 or older it is an in place operation. (#3022)
  • Add special syntax for eachindex , groupindices , and proprow to transformation mini-language
  • DataFrame is now a mutable struct ; this change makes DataFrame objects serialized under earlier versions of DataFrames.jl incompatible with version 1.4
  • added table-level and column-level metadata support.

Between DataFrames.jl 1.4.0 and 1.3.0 the following major compatibility changes are made:

  • Compat.jl 4.2 from 3.17
  • PrettyTables.jl 2.1 form 0.12, 1
  • SnoopPrecompile.jl 1 from no dependency

These changes might cause version conflicts when adding packages (minor and patch versions for other packages were made but they are unlikely to make Julia package manager complain).

On my blog I will be releasing posts in the coming weeks explaining the most important changes in DataFrames.jl 1.4.0 on practical cases.

Also soon all curated tutorials will be updated to DataFrames.jl 1.4.0. I will make a post when this is done.

68 Likes

Many thanks, Bogumił, and all contributors who have worked steadily on this release! DataFrames.jl is a workhorse delivering no-nonsense, dependable, efficient results. I am also grateful for its excellent documentation, both in-package and through other resources such as your blog. It is a delight, and one of the chief reasons I made a transition from R to julia.

14 Likes

For anyone needing to convert DataFrame objects from 1.3 to 1.4 version (e.g. if you serialized your objects for short-term storage).

The easiest solution is to use Tables.columntable on DataFrame objects created under DataFrames.jl 1.3 version. Then serialize them. Next upgrade DataFrames.jl to 1.4. Deserialize the NamedTuple, and next transform it back to DataFrame.

6 Likes

All tutorials referenced in Introduction · DataFrames.jl have been updated to DataFrames.jl 1.4.0.

Some conclusions from the process:

  • we have accumulated over the years a lot of curated tutorials. I am really convinced, after going through them while updating, that if someone carefully studies them it is sufficient to confidently work with DataFrames.jl.
  • the updating process mostly required adding new functionalities and fixing broadcasting assignment rule explanation (PR make broadcasting assignment consistent with ! by bkamins · Pull Request #3022 · JuliaData/DataFrames.jl · GitHub); other than that it was smooth (which is a good sign :smile:).
  • PrettyTables.jl in HTML backend works really nice; thank you @Ronis_BR for working on it (I have opened some issues related to things I have noticed when going through loads of outputs that can be used as ideas for further improvements).
12 Likes

Minor typos in the documentation (the examples are not displayed correctly, ```jldoctest was missing)

https://dataframes.juliadata.org/stable/lib/functions/#Base.invpermute!
https://dataframes.juliadata.org/stable/lib/functions/#Base.permute!
https://dataframes.juliadata.org/stable/lib/functions/#Random.shuffle
https://dataframes.juliadata.org/stable/lib/functions/#Random.shuffle!

I corrected them, but don’t know how modifications to documentations work. So, let me know if everything is fine.

Thanks for your hard work!!

1 Like

5 posts were split to a new topic: Asof join support in DataFrames.jl

DataFrames.jl 1.5.0 is out.

You can find a list of all changes since 1.4.4 here and a summary of most important additions in NEWS.md.

Here let me briefly summarize most important things that will affect almost everyone using DataFrames.jl:

  • DataFrames.jl is Julia 1.9 ready; we have improved precompilation so that things will be more snappy;
  • groupby now fully supports all kind of sorting options that allow for specifying the resulting group order;
  • joining functions now support order keyword argument allowing the user to specify the order of the rows in the produced table (this is a big long time requested convenience feature);
  • Improved Cols column selector (allowing for performing of any set operation of passed arguments and allowing for passing multiple predicate functions that perform column selection).

The precompilation support in DataFrames.jl has two modes:

  • full precompilation;
  • no precompilation.

The default is full precompilation. In this mode the package should precompile in around 50 seconds and then its load time should be around 1.8 seconds. The benefit of full precompilation is that later commonly used functions do not need to be compiled so that you will have a more responsive experience.

The no precompilation mode disables precompilation. Then the package precompiles in around 5 seconds, and its load time is under 1 second. The downside is that later every function needs to be compiled when it is used.

To give you a flavor of the difference, the following example code:

using DataFrames
df = DataFrame(rand(5, 3), :auto)
combine(df, :x1 => sum)
combine(df, All() .=> minimum)
df.id = [1, 1, 2, 2, 2];
gdf = groupby(df, :id);
transform(gdf, AsTable(Cols(r"x")) => ByRow(sum))

runs in around 4.4 seconds without precompilation and 2.4 seconds with precompilation (note that timings include package load time).

The instructions how to turn on/off precompilation are given here. Note that this can be done on a per project environment basis.

41 Likes

4 posts were split to a new topic: The naming of allunique in DataFrames.jl

All tutorials listed in Introduction · DataFrames.jl are now updated to DataFrames.jl 1.5.0 and Julia 1.9.

9 Likes

DataFrames.jl 1.6.0 has just been released (so it can be field tested by users before JuliaCon2023 :smile:).

This release focused mostly on code cleanup, improving API consistency, and integration issues. You can find the list of user-visible changes here and of all changes here.

I want to highlight three changes (the first two are things that are likely to be often used in daily work with DataFrames.jl; the third potentially could break some existing code - this is unlikely, but users should be aware of the risk):

  • Improvement of the convenience of using the Not selector: it now allows passing multiple positional arguments that are treated as if they were wrapped in Cols and does not throw an error when a vector of duplicate indices is passed when making column selection
  • DataFrame constructor now allows passing column names that replace the names generated by default
  • All Tables.AbstractRow subtypes are now treated in the same way as DataFrameRow in all operations; this could be minimally breaking in case users relied on Tables.AbstractRow to be treated as a scalar by combine in the past (the change follows the requests that treating Tables.AbstractRow as a scalar is on a border of being a bug)

The list of functionalities planned for 1.7 release can be found here 1.7 Milestone · GitHub.

29 Likes
34 Likes

From strength to strength. Comprehensive write-up. Does Bogumil sleep?

3 Likes

This write up about DataFrames.jl in the JOSS journal is outstanding. The detailed discussion about the design choices is very informative. I have been using DataFrames.jl for many years, but this article adds new perspective on various nuances in the package.

Kudos to Bogumil and Milan.

5 Likes

I think he had a brief nap back in 2014, but not since then.

3 Likes