Release announcements for DataFrames.jl

Here is a list. But in general mostly what we need to do for Julia 1.0 is:

  • finish deprecation of many things (if you look at src/deprecated.jl it is currently massive)
  • polish inconsistencies in the API provided

So mostly what is left is clean-up stuff (e.g. things like removal of stackdf and meltdf and deciding what to do with this functionality). I assume that major new functionality (like metadata storage or adding possibility to define index for a data frame) will be post 1.0 (of course we are open for PRs with new functionality - we are talking about the focus of the efforts of people who contribute to the package most).

3 Likes

any performance-related change / new best practice?

1 Like

We have small performance improvements (like with vector of Bool indexing), but this PR is mostly functionality: like to allow using Not or Regex for indexing - this, we hope boosts β€œdeveloper performance”.

Interestingly in some places we are faster than Base, e.g. in certain broadcasting operations (see here), but I would not rely on this as a general rule as it is hard to beat @mbauman in this field :smile:.

5 Likes

What a coincidence (someone has just asked if you can broadcast over a DataFrame):
https://stackoverflow.com/questions/57044789/julia-apply-function-to-every-cell-within-a-dataframe-without-loosing-column-n
:smile:

3 Likes

Is there a discussion somewhere on why there’s this strange API change? I’m having a really hard time seeing why df[!, :column] = v is better than df[:column] = v or df.column = v.

3 Likes

The discussion is in the PR comments:

https://github.com/JuliaData/DataFrames.jl/pull/1866

I think these are great changes. The distinction between copying and non-copying access should lead to much cleaner code, and consequently fewer bugs. I find the use of ! especially elegant.

2 Likes

I’m also having a really hard time seeing why df[!, :column] = v is better than df[:column] = v or df.column = v .

I did not find the relevant discussion in the linked PR. Is there maybe corresponding issue where these changes were discussed and motivated so that I could read up on this?

1 Like

Is there an error in the examples? Here:

julia> df .+ 1
3Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ y    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2     β”‚ 3     β”‚ 'b'  β”‚
β”‚ 2   β”‚ 3     β”‚ 4     β”‚ 'c'  β”‚
β”‚ 3   β”‚ 4     β”‚ 5     β”‚ 'd'  β”‚

julia> df .+= ones(Int, size(df))
3Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ y    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 4     β”‚ 6     β”‚ 'b'  β”‚
β”‚ 2   β”‚ 6     β”‚ 8     β”‚ 'c'  β”‚
β”‚ 3   β”‚ 8     β”‚ 10    β”‚ 'd'  β”‚

I don’t see why the second operation should yield e.g. [4,6,8] for x1, and running the code myself (after updating DataFrames) I can’t reproduce this either - shouldn’t it just be the same as the first operation?

As far is I understand, only df[:column] is depreciated. So you should be fine :wink:

df.column = v is still valid and calls the β€œunsafe” method df[!, :columns]

Indeed - this was a bug in the example (I have run the updating operation more times when I was testing the code and forgotten to reset the df data frame before copy-pasting). Sorry for that.

1 Like

df.column is allowed and is the same as df[!, :column].

Now the reasons why df[:column] and df[columns_vector] is disallowed (currently deprecated) are the following (various points were raised by various people - I will comment which is most important for me below):

  • consistency 1: data frame is a two dimensional container; therefore technically it should be indexed-into using both row and column indices; in particular in Base using a single index, like maxtrix[idx] is performing linear indexing (as opposed to column selection - what data frames did)
  • consistency 2: even if we wanted to allow single-index indexing using this index to select columns goes against β€œcollection of rows” understanding of a data frame in other functions, like filter or sort
  • safety 1: ultimately df[:col] and df[col_vector] are unsafe and discouraged syntax; the reason is that they give you access to β€œraw” underlying columns (without copying it); this is the major source of bug reports we get from the user; eg. they do df2 = df[col_vector] then push! a row to df2 or sort df2 and in consequence corrupt the consistency of df. Sometimes this unsafe operation is desirable (as it is very fast, because it is non-copying) therefore we want to keep allowing it, but decided a clear visual signal should be given so that the user immediately is warned that this is an unsafe operation; using ! is a standard in Julia to indicate an operation that might lead to mutation of the argument (and this is essentially what we do here, by writing df[!, :col] we extract β€œprivate” column of df)
  • safety 2: also a very common pattern that lead to bugs was writing something like df[df.x1 .< 0.25] where people thought they were selecting rows (also this kind of bug was reported to happen); actually it used to select columns of a data frame, while df[df.x1 .< 0.25, :] was an indented syntax

For me, personally, the β€œsafety” reasons were more important than the β€œconsistency” reasons but both are valid. Simply - we want the syntax to help users do less bugs. Still we provide df.col syntax as we acknowledge it is a convenient way to pick a column (and unless you programatically generate :col you do not need to write df[!, :col] if you do not want to).

Finally - we have decided that df[!, :col] is not that much more verbose than df[:col] - it is only 2 characters more. we could leave it out and deprecate df[:col] with getproperty(df, :col) and df[col_vector] with select(df, colvector, copycols=false) but this would be an overkill so we have decided to add a special syntax using !.

Finally - I think that having ! syntax will help new users better understand the nature of a data frame. As opposed to Base container like a Matrix a data frame is a nested structure so you have to have a clear mind distinguishing:

  • df[!, col] which β€œaccesses” the internal structure of a data frame
  • df[:, col] which is essentially the same, but copying, so we treat a data frame as β€œa whole” (not mutating its internal structure)

To better understand what I want to say is best viewed with the example. This is the old (deprecated) behavior:

julia> df = DataFrame()
0Γ—0 DataFrame


julia> df[:, :a] = [1,2,3]
β”Œ Warning: `setindex!(df::DataFrame, v::AbstractVector, ::Colon, col_ind::ColumnIndex)` is deprecated, use `begin
β”‚     df[!, col_ind] = v
β”‚     df
β”‚ end` instead.
β”‚   caller = top-level scope at none:0
β”” @ Core none:0
3-element Array{Int64,1}:
 1
 2
 3

julia> df
3Γ—1 DataFrame
β”‚ Row β”‚ a     β”‚
β”‚     β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚
β”‚ 2   β”‚ 2     β”‚
β”‚ 3   β”‚ 3     β”‚

as you can see you could add the column to a data frame using : which was inconsistent. You mutated the β€œinternals” of the data frame using : which should not allow to do this. Now it is clear that if you want to do such an operation you should use ! not : which: a) is consistent, b) warns you that you are going to significantly influence an internal structure of the data frame.

With the ! and : distinction we were able to write a consistent set of rules what each operation does here which is very easy to remember (at least this is what I think). The only thing you have to know is that : works like for matrices and ! is mutating.

Also this has a consequence that we will be able (after deprecation period) to significantly reduce the size of the code base used for defining indexing (much lower number of methods needs to be specified, as the rules are simpler).

24 Likes

Good reasons, I suppose. I believe I also misunderstood which functions were being deprecated – df.col is a really important syntax for me and I assumed it was part of the giant yellow screen of deprecation warnings I had yesterday. Thanks for the writeup!

Thanks for the explanation - the new changes, while initially surprising, make a lot of sense once explained.

Can I ask a clarification about one thing you wrote?

: works like for matrices and ! is mutating.

You show how colon-based modification like df[:, :a] = [1,2,3] is deprecated. But doesn’t that kind of modification work just fine for matrices?

julia> x = rand(3,3)
3Γ—3 Array{Float64,2}:
 0.676306  0.899877  0.620036
 0.585781  0.294398  0.942301
 0.338273  0.508919  0.673037

julia> x[:,1] = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> x
3Γ—3 Array{Float64,2}:
 1.0  0.899877  0.620036
 2.0  0.294398  0.942301
 3.0  0.508919  0.673037

What is deprecated is that ::

  • creates a new column
  • replaces an existing column

In the parlance of matrices what I have written is equivalent of:

julia> x = ones(2,3)
2Γ—3 Array{Float64,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0

julia> x[:, 4] = [1, 2]
ERROR: BoundsError: attempt to access 2Γ—3 Array{Float64,2} at index [Base.Slice(Base.OneTo(2)), 4]

it fails for matrices, but used to work for data frames.

You will still be able to write:

julia> df = DataFrame(ones(2,3))
2Γ—3 DataFrame
β”‚ Row β”‚ x1      β”‚ x2      β”‚ x3      β”‚
β”‚     β”‚ Float64 β”‚ Float64 β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1.0     β”‚ 1.0     β”‚ 1.0     β”‚
β”‚ 2   β”‚ 1.0     β”‚ 1.0     β”‚ 1.0     β”‚

julia> df[:, :x1] = [10, 20]
β”Œ Warning: `setindex!(df::DataFrame, v::AbstractVector, ::Colon, col_ind::ColumnIndex)` is deprecated, use `begin
β”‚     df[!, col_ind] = v
β”‚     df
β”‚ end` instead.
β”‚   caller = top-level scope at none:0
β”” @ Core none:0
2-element Array{Int64,1}:
 10
 20

julia> df
2Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2      β”‚ x3      β”‚
β”‚     β”‚ Int64 β”‚ Float64 β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 10    β”‚ 1.0     β”‚ 1.0     β”‚
β”‚ 2   β”‚ 20    β”‚ 1.0     β”‚ 1.0     β”‚

but in the target syntax the column :x1 will not be replaced but updated β€œin place” like this:

julia> df = DataFrame(ones(2,3))
2Γ—3 DataFrame
β”‚ Row β”‚ x1      β”‚ x2      β”‚ x3      β”‚
β”‚     β”‚ Float64 β”‚ Float64 β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1.0     β”‚ 1.0     β”‚ 1.0     β”‚
β”‚ 2   β”‚ 1.0     β”‚ 1.0     β”‚ 1.0     β”‚

julia> df[1:2, :x1] = [10, 20]
2-element Array{Int64,1}:
 10
 20

julia> df
2Γ—3 DataFrame
β”‚ Row β”‚ x1      β”‚ x2      β”‚ x3      β”‚
β”‚     β”‚ Float64 β”‚ Float64 β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 10.0    β”‚ 1.0     β”‚ 1.0     β”‚
β”‚ 2   β”‚ 20.0    β”‚ 1.0     β”‚ 1.0     β”‚

The fact that : did something different than axes(df, 1) or 1:nrow(df) when used as a row index was a big inconsistency that will get rectified.

4 Likes

This might be a bit much, but I just wanted to commend everyone involved in making these changes. I have wanted to see them for a while (as someone who uses arrays at least as often as dataframes, I really felt the inconsistencies both in the indexing and copying behaviors), and I was worried that fear of upsetting users might prevent them from ever being included. Not only that, but the deprecation warnings make the deprecated usage extremely easy to spot without immediately breaking anything. In my opinion these changes are characteristic of the careful consideration and cool, dispassionate thinking that one so often finds in the Julia community, a precedent set by the design of the language itself. It’s refreshing to see people sacrificing momentary convenience and knee-jerk reactions for long-term simplicity and clarity. Well done.

28 Likes

Thank you so much for detailed explanations. Written this way the new changes make so much sense and I’m now a big fan of them :slight_smile:

2 Likes

What is the diffference between using …?
df[!, cols]
and
@view df[!, col]

(If anyway the former doesn’t copy anything)

You mean semantically, or in terms of implementation? The @view will give you a SubDataFrame, so it is an extra level of indirection, but otherwise they will both share structure and you can modify the underlying df.

My understanding is tha the ! syntax substitutes for view in many cases.

Related to this, for matrices: M[:, col] copies and @view M[:, col] does not.

Why do we need df[!, cols] alongside @view df[:, col] for non-copy access?

I know that technically the latter gives a SubSomething type. But that’s fine for matrix views as well.

Please enlighten me :slight_smile:

@bkamins already explained this above, twice. I don’t think I have anything to add to that.