Release announcements for DataFrames.jl

DataFrames.jl 1.0 is out.

You can find a summary on has changed since last release here.
and if you want to see the details of the changes since 0.22 release they are listed here, along with the list of PR contributors.

Many thanks to everyone who contributed over the years to make this release happen.

The plans for the coming days is to update all tutorials listed here to the latest release.

76 Likes

Thanks for bringing such an integral piece of the ecosystem to 1.0 . It means a lot.

4 Likes

A big congratulations to this achievement. Data frames is such an essential tool for a lot of data scientists out there. :muscle:t2:

7 Likes

First of all, congrats on and thank you for the 1.0 release!

Any link where we can read more about this design? (The linked PR seems to mostly discuss Julia 1.6/1.7 differences, not why df.col .= value being in-place is undesirable.) Intuitively I’d expect this broadcast to be in-place, just like normal_vector .= value. To be clear, I’ve learned to trust you and the rest of the DataFrames team to make careful and wise decisions, so I trust this is right - just want to understand.

1 Like

The rules are described here, be warned though that people say that clicking this link it is like playing chess with Mikhail Tal, whose motto was :smiley: :

β€œYou must take your opponent into a deep, dark forest where 2+2=5 and the path leading out is only wide enough for one.”

Now back to business. There are two layers to the issue.

Layer one is mental model. If you see df.col you should be able to confidently know that it will do exactly the same as writing df[!, :col]. It is a basic principle that these two operations should be the same. They were (and under Julia 1.6 are) inconsistent, which means that users have to learn exceptions when they differ.

Layer two is that for indexing data frame is a collection of columns (similarly to e.g. select/transform/subset/combine but as opposed to other operations like sort/filter/dropmissing/unique where we tend to look at it as a collection of rows - I have warned you that this is a deep dark forest - the short story is that in some operations people tend to find column-oriented view more natural and for other operations row-oriented). Clearly for indexing if you write df.col this is column oriented. Why? Because e.g. if you write:

df.col .= 1

you would like for this operation to work unconditionally. In particular if df is missing column :col you want it created (which is clearly not in-place) - and I hope you agree that most people will want it to work. So think of df.col .= 1 as broadcasting into a df not into a column :col of this data frame (so essentially you are broadcasting into a vector of vectors - as this is an underlying structure that holds columns of a DataFrame).

Now what is the benefit? Before moving forward think of what result you would expect from the following operation:

df = DataFrame(a=1:3)
df.a .= 'x'
df

Now scroll down:

julia> df = DataFrame(a=1:3)
3Γ—1 DataFrame
 Row β”‚ a     
     β”‚ Int64 
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

julia> df.a .= 'x'
3-element Vector{Int64}:
 120
 120
 120

julia> df
3Γ—1 DataFrame
 Row β”‚ a     
     β”‚ Int64 
─────┼───────
   1 β”‚   120
   2 β”‚   120
   3 β”‚   120

although it is consistent with broadcasting rules of Julia Base for vectors I assume that this is not what most people will want when they write df.col .= 'x'. I bet that a majority probably expected a vector of 'x'. Similarly you have:

julia> df = DataFrame(a='a':'c')
3Γ—1 DataFrame
 Row β”‚ a    
     β”‚ Char 
─────┼──────
   1 β”‚ a
   2 β”‚ b
   3 β”‚ c

julia> df.a .= 1
3-element Vector{Char}:
 '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
 '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
 '\x01': ASCII/Unicode U+0001 (category Cc: Other, control)

julia> df
3Γ—1 DataFrame
 Row β”‚ a    
     β”‚ Char 
─────┼──────
   1 β”‚ \x01
   2 β”‚ \x01
   3 β”‚ \x01

sadly - we have just failed to create a column of constant term for e.g. linear regression model (although it is consistent with Julia broadcasting rules).

Also most likely we do not want an error thrown in this case:

julia> df = DataFrame(a=1:3)
3Γ—1 DataFrame
 Row β”‚ a     
     β”‚ Int64 
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

julia> df.a .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

if you are in the middle of 10-step @chain pipeline.

Such considerations are the second layer why we prefer in Julia 1.7 to force df.col .= 1 to replace columns rather than update them in place.

We are aware that being a replace and not in-place operation is sacrifices speed (which I bet 99% of users will never notice), but it is achieved at the benefit of lower surprise (you are sure to get what you most likely expect to get and be sure that the operation will not error) and higher consistency (you know that df.col and df[!, :col] are just aliases).

Finally we have made sure you can do an in-place broadcasting if you want - just write df[:, :col] .= value.

16 Likes

Pardon my ignorance on the development cycle of Julia packages but when can we use DataFrames 1.0 given that we are on version v0.20.2. I am really enthusiastic about trying out some of the new features of DataFrames and well as getting accustomed to the newer way of doing things!

You can upgrade with ] up DataFrames.

3 Likes

Since 0.20.2 version there were already 0.21 and 0.22 releases before 1.0 release. This means that you are going to miss the deprecation messages for things that were removed/changed in your old code.

Also try doing ]up DataFrames@1 to force package version upgrade. If this errors you will see which packages are holding you back.

2 Likes

This is known to devs but if you have scikitlearn.jl (or related package) then you’ll be blocked. If you’re doing data stuff, it’s likely this might be an issue for you. There’s a PR to update the dependency there https://github.com/cstjean/ScikitLearn.jl/pull/96 but the main contributor is swamped so it might take a while (unless someone else picks it up).

3 Likes

According to this benchmark, it seems DataFrames 1.0’s performance does not improve much.
https://h2oai.github.io/db-benchmark/

The benchmark has a bug that I have introduced in the way we read-in the files from disk. We will announce when it is time to check the benchmarks. Here is the PR in which we dicssuss the fix: enable multithreading in Julia by bkamins Β· Pull Request #196 Β· h2oai/db-benchmark Β· GitHub.

Also last week and this week I have posted some example benchmarks of DataFrames.jl 1.0 vs 0.22.7 vs data.table, as I know that a lot of people are looking into this. My examples are of course less comprehensive, but show more of a β€œtypical” usage scenario (working on e.g. a laptop), while H2O benchmarks are more server oriented (100+GB of RAM and 40 core machine).

10 Likes

Hi bkamins I think tlienart might be on to something when it was suggested that something data science related might be blocking the DataFrames 1.0 upgrade as I can install DataFrames 1.0 in it’s own environment. If that is indeed the case, its no big deal I can just wait it out, however these are the packages that I have installed, with MLJ being in it’s own environment.

[69666777] Arrow v0.2.4
[a93c6f00] DataFrames v0.20.2
[7806a523] DecisionTree v0.10.10
[f6006082] EvoTrees v0.4.9
[587475ba] Flux v0.8.3
[7073ff75] IJulia v1.23.2
[682c06a0] JSON v0.21.1
[7acf609c] LightGBM v0.5.2
[eb30cadb] MLDatasets v0.5.6
[b8a86587] NearestNeighbors v0.4.8
[612083be] Queryverse v0.6.2
[ce6b1742] RDatasets v0.7.5
[0aa819cd] SQLite v1.1.4
[8523bd24] ShapML v0.3.0
[c4f8c510] UMAP v0.1.8
[112f6efa] VegaLite v2.4.1
[44d3d7a6] Weave v0.10.7

Edit1: I was able to upgrade to DataFrames v0.21.8 by removing ShapML

Edit2: I was able to upgrade to DataFrames v1.0.1 by removing Queryverse

However I’m going to keep Queryverse since it’s hard to work without Queryverse, DataFrames or MLJ. Btw thank you tlienart for your hard on MLJ, it’s the most delightful and elegant ML frameworks to use.

It’s ShapML here that’s holding you back.

2 Likes

A small announcement is that DataFrames.jl 1.1.0 has just been released. After 1.0 release minor releases should not be problematic for users so the upgrade should be smooth.

The reason we decided to go for 1.1 release so soon (and not just patch release) is the behavior of subset function that was introduced only in 1.0 release and we got a fast user feedback (and many thanks for matthieugomez for pushing it actively on GitHub) about one corner case that was unintended. The details are here. Also this week I will write on my blog more explanation about this change.

The short explanation is the following. Currently this errors:

julia> df = DataFrame(x=zeros(3))
3Γ—1 DataFrame
 Row β”‚ x       
     β”‚ Float64 
─────┼─────────
   1 β”‚     0.0
   2 β”‚     0.0
   3 β”‚     0.0

julia> subset(df, :x => ==(0))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

while in the 1.0 release unintentionally it worked because the condition ==(0) was applied to whole vector df.x, producing a scalar false, and broadcasted. While logically correct clearly it was very error prone and unintended. Most likely user wanted to use ByRow like this:

julia> subset(df, :x => ByRow(==(0)))
3Γ—1 DataFrame
 Row β”‚ x       
     β”‚ Float64 
─────┼─────────
   1 β”‚     0.0
   2 β”‚     0.0
   3 β”‚     0.0

Now - as you can see - we safeguard user code from such mistakes.

The change was on the border of a bug fix and functionality change. So we decided to make 1.1.0 release quickly to make sure that no user code depends on the unintended behavior that was accepted in 1.0 release.

8 Likes

I make type mistakes very often using ==. I wish there were a safer equality operator that would error unless the types of the operands were compatible.

2 Likes

Then now with the 1.1.0 version is also possible to use
subset(df, :x => ==(0))
or only
subset(df, :x => ByRow(==(0)))
?

Only

subset(df, :x => ByRow(==(0)))

The following

subset(df, :x => ==(0))

is a somewhat meaningless comparison and can lead to hard to catch bugs, which is why it’s disallowed.

4 Likes

Or you can write filter(:x => ==(0), df). This is the crucial difference between filter (which works on a element) and subset which takes a whole vector.

We have discussed removing filter support, as its syntax is inconsistent with the rest of the DataFrames.jl minilanguage (as normally :x => fun means passing a whole vector to fun), but the use-case we discuss here is frequent enough that we decided to keep the inconsistency.

1 Like

Agreed. I started adding typeassert to be sure

typeassert(id, eltype(df.id))
only_id = filter(:id => ==(id), df)

but this isn’t a DataFrames problem, I think. The equality is from Julia base.

8 posts were split to a new topic: Performance of DataFrames’ subset and ByRow