Release announcements for DataFrames.jl

Is there any “cheat sheet” newer or more complete than this one?
https://www.ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/
It uses dataframes.jl v0.22.

@Juan - it should be mostly OK. You could open an issue on https://github.com/tk3369/www.ahsmart.com to ask for update. Thank you!

1 Like

I am trying to update to DataFrames 1.1.0 using
Pkg.update(“DataFrames”)
and checking the package with
Pkg.status(“DataFrames”) its not getting updated - I am still seeing v0.22.7

Can you please help

Do ] @add DataFrames@1.1.0 and read the error message closely. It will tell you what package is holding back compatability in your environment.

2 Likes

Updating registry at C:\Users\harne\.julia\registries\General
Updating registry at C:\Users\harne\.julia\registries\JuliaComputingRegistry
Resolving package versions…
ERROR: Unsatisfiable requirements detected for package ScikitLearn [3646fa90]:
ScikitLearn [3646fa90] log:
├─possible versions are: 0.5.0-0.6.3 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 0.5.0-0.6.3
└─restricted by compatibility requirements with DataFrames [a93c6f00] to versions: uninstalled — no versions left
└─DataFrames [a93c6f00] log:
├─possible versions are: 0.11.7-1.1.0 or uninstalled
└─restricted to versions 1.1.0 by an explicit requirement, leaving only versions 1.1.0

Still
Pkg.status(“DataFrames”)
Status C:\Users\harne\.julia\environments\v1.6\Project.toml
[a93c6f00] DataFrames v0.22.7

After removing the ScikitLearn package - DataFrames package was updated using your code. Thanks a lot for help

1 Like

Congrats for reaching version 1.0!!!
It’s a major contribution and an important step in bringing consensus to the “Is Julia production-ready?” dilemma as for ‘general’ data science utilization!
Thanks!!!

4 Likes

A promised blog post about filter vs subset is here.

7 Likes

I am upgrading to DataFrames 1.0 this weekend.

Previously, leftjoin(df1, df2, on=:key) resulted in a DataFrame with rows ordered the same as df1. I know it was documented that this could change, but I also bet I wasn’t the only one that had code relying on it.

To the others who were relying on it, what do you do now? Make an index column and sort!?

Could you please elaborate on this in https://github.com/JuliaData/DataFrames.jl/issues/2753. I would add the kwargs I discuss there relatively quickly when we reach a consensus what options for what joins we want.

2 Likes

For the interested people (as many ask about it) in The state of DataFrames.jl H2O benchmark - #14 by bkamins I have summarized the conclusions from the latest H2O benchmark.

2 Likes

DataFrames.jl 1.2.0 is out. Here you can find the release notes. I have also written a blog post explaining the key user visible changes it introduces.

19 Likes

DataFrames.jl 1.3.0 is out.

It is a major release much bigger than recent releases. It is expected that, hopefully, we managed to fix all key missing parts in the package to make it feature complete.

Development towards 1.4.0 will continue by adding additional features requested by the users. I expect to have this release around JuliaCon 2022 (unless something unexpected happens).

Here you can find the detailed release notes. See also NEWS.md for a list of relevant changes in the package.

Let me briefly summarize the most important changes and additions (in total 125 PRs were merged since 1.2.2 release which is a lot) this will be brief so it assumes you know the functionality of the package, I will soon write a blog post explaining these changes for newcomers):

  • in groupby now users have more control on resulting group order (this resolves the issue previously groupby was implemented to produce the group ordering that is fastest to create by default, which is unintuitive in certain use cases; now sort keyword argument is improved and allows more control from the user if this is desired);
  • if SubDataFrame was created with : column selector (i.e. it contains all columns of its parent) then you can add new columns to such data frame in all functions (the filtered out rows get filled with missing value)
  • delete! is deprecated in favor of deleteat! fixing the inconsistency with how what these functions are used for in Julia Base
  • leftjoin! is added allowing for in-place joining of data frames (and it is fast)
  • in source .=> transformation .=> destination form of the transformation minilanguage the Cols, Between, All and Not selectors support broadcasting;
  • fix a bug in handling of keyword arguments in sorting related functions that in some cases allowed passing tuples (support of which was removed in 1.0 release) and in some other cases lead to stack overflow;
  • transformations having a form AsTable(...) => ByRow(sum) (and other standard reduction functions) are now fast even when many columns are selected (solving a long standing performance bottleneck)
  • In DataFrames.jl 1.4 release on Julia 1.7 or newer broadcasting assignment into an existing column of a data frame will replace it. Under Julia 1.6 or older it will be an in place operation. (this is an unfortunate difference in behavior between versions of Julia - it is impossible to implement it differently due to limitations of Julia Base; that is why a clear announcement of this discrepancy is made now and the change will be made effective in DataFrames.jl 1.4)

Before I wrap up let me thank everyone who contributed towards this release!

50 Likes

Hi, I didn’t notice the announce for the underlying change in Julia 1.7. Can you give me a pointer?

https://github.com/JuliaLang/julia/pull/39473

The point is that x.y .= z in Julia 1.6 first takes y from x and then performs broadcasting of z into it. While since Julia 1.7 the operation can be handled as a whole (not in two separate steps).

This is a similar pattern to x[y] .= z that existed for a long time where Julia treats this expression as a whole and not makes x[y] selection and then broadcasts z into it (which clearly would not be useful).

The consequence in DataFrames.jl is that when you write df.col .= value we have in 1.7 a full control over how this expression should be resolved.

Thanks!

The following tutorials were updated to DataFrames.jl 1.3:

Since the list is long please open an issue if there is some bug in them.

12 Likes

After many months of hard work DataFrames.jl 1.4.0 has been released. There were 98 PR included in this release (not including patch release commits, and we had 6 such releases) authored by: Alex Arslan, alfaromartino, anand jain, Bogumił Kamiński, Eric Hanson, jariji, Joseph Wilson, Lilith Orion Hafner, Martijn Visser, Milan Bouchet-Valat, Mo-Gul, musvaage, reumle, Rik Huijzer, Ronan Arraes Jardim Chagas, Stefan Krastanov, Yakir Luc Gagnon; I used names provided on GitHub commits. There were also numerous people that opened issues and took part in the discussion. I would like to thank them all. Among them @nalimilan must be mentioned as he reviewed every PR that was made.

This is one of the biggest releases made. The number of PRs is large, but most importantly several important improvements were made. You can find all changes in the 1.4 release and 1.3.x patch releases in NEWS.md. Some of the changes involved hundreds of comments and discussions and changes in the whole JuliaData ecosystem.

Here let me highlight major changes (I drop minor improvements and bug fixes as there are too many of them):

  • DataFrames.jl 1.4.0 requires at least Julia 1.6, if you use older version of Julia DataFrames.jl 1.3.6 should be used; it is in maintenance mode (so if there are any bugs found please report them and I will make a patch release)
  • unstack now supports combine keyword argument, which turns this function into a pivot-table (allowing for aggregation of data when unstacking)
  • add full support for “data frame as a collection of rows” functionality, adding functions: reverse! , permute! , invpermute! , shuffle , shuffle!, resize! , keepat! , pop! , popfirst! , popat!, pushfirst!, insert!, prepend!
  • On Julia 1.7 or newer broadcasting assignment into an existing column of a data frame replaces it. Under Julia 1.6 or older it is an in place operation. (#3022)
  • Add special syntax for eachindex , groupindices , and proprow to transformation mini-language
  • DataFrame is now a mutable struct ; this change makes DataFrame objects serialized under earlier versions of DataFrames.jl incompatible with version 1.4
  • added table-level and column-level metadata support.

Between DataFrames.jl 1.4.0 and 1.3.0 the following major compatibility changes are made:

  • Compat.jl 4.2 from 3.17
  • PrettyTables.jl 2.1 form 0.12, 1
  • SnoopPrecompile.jl 1 from no dependency

These changes might cause version conflicts when adding packages (minor and patch versions for other packages were made but they are unlikely to make Julia package manager complain).

On my blog I will be releasing posts in the coming weeks explaining the most important changes in DataFrames.jl 1.4.0 on practical cases.

Also soon all curated tutorials will be updated to DataFrames.jl 1.4.0. I will make a post when this is done.

68 Likes

Many thanks, Bogumił, and all contributors who have worked steadily on this release! DataFrames.jl is a workhorse delivering no-nonsense, dependable, efficient results. I am also grateful for its excellent documentation, both in-package and through other resources such as your blog. It is a delight, and one of the chief reasons I made a transition from R to julia.

14 Likes

For anyone needing to convert DataFrame objects from 1.3 to 1.4 version (e.g. if you serialized your objects for short-term storage).

The easiest solution is to use Tables.columntable on DataFrame objects created under DataFrames.jl 1.3 version. Then serialize them. Next upgrade DataFrames.jl to 1.4. Deserialize the NamedTuple, and next transform it back to DataFrame.

6 Likes