Release announcements for DataFrames.jl

I make type mistakes very often using ==. I wish there were a safer equality operator that would error unless the types of the operands were compatible.

2 Likes

Then now with the 1.1.0 version is also possible to use
subset(df, :x => ==(0))
or only
subset(df, :x => ByRow(==(0)))
?

Only

subset(df, :x => ByRow(==(0)))

The following

subset(df, :x => ==(0))

is a somewhat meaningless comparison and can lead to hard to catch bugs, which is why it’s disallowed.

4 Likes

Or you can write filter(:x => ==(0), df). This is the crucial difference between filter (which works on a element) and subset which takes a whole vector.

We have discussed removing filter support, as its syntax is inconsistent with the rest of the DataFrames.jl minilanguage (as normally :x => fun means passing a whole vector to fun), but the use-case we discuss here is frequent enough that we decided to keep the inconsistency.

1 Like

Agreed. I started adding typeassert to be sure

typeassert(id, eltype(df.id))
only_id = filter(:id => ==(id), df)

but this isn’t a DataFrames problem, I think. The equality is from Julia base.

8 posts were split to a new topic: Performance of DataFrames’ subset and ByRow

Is there any “cheat sheet” newer or more complete than this one?
https://www.ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/
It uses dataframes.jl v0.22.

@Juan - it should be mostly OK. You could open an issue on https://github.com/tk3369/www.ahsmart.com to ask for update. Thank you!

1 Like

I am trying to update to DataFrames 1.1.0 using
Pkg.update(“DataFrames”)
and checking the package with
Pkg.status(“DataFrames”) its not getting updated - I am still seeing v0.22.7

Can you please help

Do ] @add DataFrames@1.1.0 and read the error message closely. It will tell you what package is holding back compatability in your environment.

2 Likes

Updating registry at C:\Users\harne\.julia\registries\General
Updating registry at C:\Users\harne\.julia\registries\JuliaComputingRegistry
Resolving package versions…
ERROR: Unsatisfiable requirements detected for package ScikitLearn [3646fa90]:
ScikitLearn [3646fa90] log:
├─possible versions are: 0.5.0-0.6.3 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 0.5.0-0.6.3
└─restricted by compatibility requirements with DataFrames [a93c6f00] to versions: uninstalled — no versions left
└─DataFrames [a93c6f00] log:
├─possible versions are: 0.11.7-1.1.0 or uninstalled
└─restricted to versions 1.1.0 by an explicit requirement, leaving only versions 1.1.0

Still
Pkg.status(“DataFrames”)
Status C:\Users\harne\.julia\environments\v1.6\Project.toml
[a93c6f00] DataFrames v0.22.7

After removing the ScikitLearn package - DataFrames package was updated using your code. Thanks a lot for help

1 Like

Congrats for reaching version 1.0!!!
It’s a major contribution and an important step in bringing consensus to the “Is Julia production-ready?” dilemma as for ‘general’ data science utilization!
Thanks!!!

4 Likes

A promised blog post about filter vs subset is here.

7 Likes

I am upgrading to DataFrames 1.0 this weekend.

Previously, leftjoin(df1, df2, on=:key) resulted in a DataFrame with rows ordered the same as df1. I know it was documented that this could change, but I also bet I wasn’t the only one that had code relying on it.

To the others who were relying on it, what do you do now? Make an index column and sort!?

Could you please elaborate on this in https://github.com/JuliaData/DataFrames.jl/issues/2753. I would add the kwargs I discuss there relatively quickly when we reach a consensus what options for what joins we want.

2 Likes

For the interested people (as many ask about it) in The state of DataFrames.jl H2O benchmark - #14 by bkamins I have summarized the conclusions from the latest H2O benchmark.

2 Likes

DataFrames.jl 1.2.0 is out. Here you can find the release notes. I have also written a blog post explaining the key user visible changes it introduces.

19 Likes

DataFrames.jl 1.3.0 is out.

It is a major release much bigger than recent releases. It is expected that, hopefully, we managed to fix all key missing parts in the package to make it feature complete.

Development towards 1.4.0 will continue by adding additional features requested by the users. I expect to have this release around JuliaCon 2022 (unless something unexpected happens).

Here you can find the detailed release notes. See also NEWS.md for a list of relevant changes in the package.

Let me briefly summarize the most important changes and additions (in total 125 PRs were merged since 1.2.2 release which is a lot) this will be brief so it assumes you know the functionality of the package, I will soon write a blog post explaining these changes for newcomers):

  • in groupby now users have more control on resulting group order (this resolves the issue previously groupby was implemented to produce the group ordering that is fastest to create by default, which is unintuitive in certain use cases; now sort keyword argument is improved and allows more control from the user if this is desired);
  • if SubDataFrame was created with : column selector (i.e. it contains all columns of its parent) then you can add new columns to such data frame in all functions (the filtered out rows get filled with missing value)
  • delete! is deprecated in favor of deleteat! fixing the inconsistency with how what these functions are used for in Julia Base
  • leftjoin! is added allowing for in-place joining of data frames (and it is fast)
  • in source .=> transformation .=> destination form of the transformation minilanguage the Cols, Between, All and Not selectors support broadcasting;
  • fix a bug in handling of keyword arguments in sorting related functions that in some cases allowed passing tuples (support of which was removed in 1.0 release) and in some other cases lead to stack overflow;
  • transformations having a form AsTable(...) => ByRow(sum) (and other standard reduction functions) are now fast even when many columns are selected (solving a long standing performance bottleneck)
  • In DataFrames.jl 1.4 release on Julia 1.7 or newer broadcasting assignment into an existing column of a data frame will replace it. Under Julia 1.6 or older it will be an in place operation. (this is an unfortunate difference in behavior between versions of Julia - it is impossible to implement it differently due to limitations of Julia Base; that is why a clear announcement of this discrepancy is made now and the change will be made effective in DataFrames.jl 1.4)

Before I wrap up let me thank everyone who contributed towards this release!

50 Likes

Hi, I didn’t notice the announce for the underlying change in Julia 1.7. Can you give me a pointer?