Release announcements for DataFrames.jl

It’s ShapML here that’s holding you back.

2 Likes

A small announcement is that DataFrames.jl 1.1.0 has just been released. After 1.0 release minor releases should not be problematic for users so the upgrade should be smooth.

The reason we decided to go for 1.1 release so soon (and not just patch release) is the behavior of subset function that was introduced only in 1.0 release and we got a fast user feedback (and many thanks for matthieugomez for pushing it actively on GitHub) about one corner case that was unintended. The details are here. Also this week I will write on my blog more explanation about this change.

The short explanation is the following. Currently this errors:

julia> df = DataFrame(x=zeros(3))
3×1 DataFrame
 Row │ x       
     │ Float64 
─────┼─────────
   1 │     0.0
   2 │     0.0
   3 │     0.0

julia> subset(df, :x => ==(0))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

while in the 1.0 release unintentionally it worked because the condition ==(0) was applied to whole vector df.x, producing a scalar false, and broadcasted. While logically correct clearly it was very error prone and unintended. Most likely user wanted to use ByRow like this:

julia> subset(df, :x => ByRow(==(0)))
3×1 DataFrame
 Row │ x       
     │ Float64 
─────┼─────────
   1 │     0.0
   2 │     0.0
   3 │     0.0

Now - as you can see - we safeguard user code from such mistakes.

The change was on the border of a bug fix and functionality change. So we decided to make 1.1.0 release quickly to make sure that no user code depends on the unintended behavior that was accepted in 1.0 release.

7 Likes

I make type mistakes very often using ==. I wish there were a safer equality operator that would error unless the types of the operands were compatible.

2 Likes

Then now with the 1.1.0 version is also possible to use
subset(df, :x => ==(0))
or only
subset(df, :x => ByRow(==(0)))
?

Only

subset(df, :x => ByRow(==(0)))

The following

subset(df, :x => ==(0))

is a somewhat meaningless comparison and can lead to hard to catch bugs, which is why it’s disallowed.

3 Likes

Or you can write filter(:x => ==(0), df). This is the crucial difference between filter (which works on a element) and subset which takes a whole vector.

We have discussed removing filter support, as its syntax is inconsistent with the rest of the DataFrames.jl minilanguage (as normally :x => fun means passing a whole vector to fun), but the use-case we discuss here is frequent enough that we decided to keep the inconsistency.

1 Like

Agreed. I started adding typeassert to be sure

typeassert(id, eltype(df.id))
only_id = filter(:id => ==(id), df)

but this isn’t a DataFrames problem, I think. The equality is from Julia base.

8 posts were split to a new topic: Performance of DataFrames’ subset and ByRow

Is there any “cheat sheet” newer or more complete than this one?
https://www.ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/
It uses dataframes.jl v0.22.

@Juan - it should be mostly OK. You could open an issue on GitHub - tk3369/www.ahsmart.com: Web site to ask for update. Thank you!

1 Like

I am trying to update to DataFrames 1.1.0 using
Pkg.update(“DataFrames”)
and checking the package with
Pkg.status(“DataFrames”) its not getting updated - I am still seeing v0.22.7

Can you please help

Do ] @add DataFrames@1.1.0 and read the error message closely. It will tell you what package is holding back compatability in your environment.

2 Likes

Updating registry at C:\Users\harne\.julia\registries\General
Updating registry at C:\Users\harne\.julia\registries\JuliaComputingRegistry
Resolving package versions…
ERROR: Unsatisfiable requirements detected for package ScikitLearn [3646fa90]:
ScikitLearn [3646fa90] log:
├─possible versions are: 0.5.0-0.6.3 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 0.5.0-0.6.3
└─restricted by compatibility requirements with DataFrames [a93c6f00] to versions: uninstalled — no versions left
└─DataFrames [a93c6f00] log:
├─possible versions are: 0.11.7-1.1.0 or uninstalled
└─restricted to versions 1.1.0 by an explicit requirement, leaving only versions 1.1.0

Still
Pkg.status(“DataFrames”)
Status C:\Users\harne\.julia\environments\v1.6\Project.toml
[a93c6f00] DataFrames v0.22.7

After removing the ScikitLearn package - DataFrames package was updated using your code. Thanks a lot for help

1 Like

Congrats for reaching version 1.0!!!
It’s a major contribution and an important step in bringing consensus to the “Is Julia production-ready?” dilemma as for ‘general’ data science utilization!
Thanks!!!

4 Likes

A promised blog post about filter vs subset is here.

6 Likes

I am upgrading to DataFrames 1.0 this weekend.

Previously, leftjoin(df1, df2, on=:key) resulted in a DataFrame with rows ordered the same as df1. I know it was documented that this could change, but I also bet I wasn’t the only one that had code relying on it.

To the others who were relying on it, what do you do now? Make an index column and sort!?

Could you please elaborate on this in add a keyword to allow specifying target column order in joins · Issue #2753 · JuliaData/DataFrames.jl · GitHub. I would add the kwargs I discuss there relatively quickly when we reach a consensus what options for what joins we want.

2 Likes

For the interested people (as many ask about it) in The state of DataFrames.jl H2O benchmark - #14 by bkamins I have summarized the conclusions from the latest H2O benchmark.

2 Likes

DataFrames.jl 1.2.0 is out. Here you can find the release notes. I have also written a blog post explaining the key user visible changes it introduces.

18 Likes