What if `df.col .= v` was in-place?

You may or may not be surprised to learn that for a DataFrame .= isn’t always in-place. It surprised me for sure, but I didn’t follow the announcements so that’s on me.

What if it was always in-place for existing columns though? And what if normal assignment only created an alias if explicitly declared (with copy on assignment otherwise, with fill for scalars)? Would that be at all breaking?

I’m curious to learn if you have used or seen:

  • df.col .= v on an existing column where you need it to not be in-place

  • df.col = v where you need v === df.col

  • Either of the above with df[!, :col]

Here’s a tentative implementation to run with your unit tests if you’d like to: In-place broadcast assignment by gustafsson · Pull Request #3206 · JuliaData/DataFrames.jl · GitHub

2 Likes

I’m far from an expert DataFrames user but it seems unexpected to me that df.col .= v is not in-place.

5 Likes

Here are two non-exhaustive reasons for the current behavior, as far as I understand it.

  • DataFrames.jl wishes to allow df.newcol .= 1 and df.newcol .= 1:3 to conveniently create new columns. Both of these have to allocate. If we were to allow df.existincol .= 1 to not allocate, then there would be more complicated rules for users to learn.
  • DataFrames.jl doesn’t want new users to worry about conversion rules. if df.x is a Vector{Int}, we do not want df.x .= 'a' to auto-promote to Int.

There are multiple competing goals and DataFrames.jl chose a behavior that satisfied some constraints, but clearly don’t fall in line with everyone’s intuition.

1 Like

As this is a poll:

For the first point, it would be better to throw an error, consistent with Julia arrays syntax, when the array does not exist.

Regarding the second point, df.x .= 'a' does promote to Int (DataFrames v1.3.6).

This is the point of this pool. In DataFrames.jl 1.4 it does not promote because this promotion was confusing users.

Similarly users expected that df.x .= value would work even if :x is not present in a data frame,
exactly in the same way as df[:, :x] = value and similar work if :x is not present in a data frame. Writing df.x = vector or df.x .= value is AFAICT the most common way currently to add columns to a data frame.

Now, the point is that the design idea behind not in-place behavior of df.x .= value is that we wanted to make sure that this operation always produces the same value stored in :x column no matter if :x was previously present in df or not. Exactly like df.x = vector currently behaves (by replacing whatever is or is not present in column :x by vector).

Tomorrow I will write a longer blog post about the reasoning behind this design.

4 Likes

Shouldn’t the deviations from Julia’s base syntax be decorated with macros?

A fake example for a new df.x column about to be created:
@df df.x .= value

3 Likes

From the docs

To get a copy of the column you can use german[:, :Sex] or german[:, “Sex”]. In this case changing the vector returned by this operation does not affect the data stored in the german data frame.

Reading this, I now find the following behavior confusing

df = DataFrame(:A => [1,2,3,4])
df[:, :A] .= 5
df[!, :A] == [5, 5, 5, 5] #true

Is it intended that df[:, col] returns a copy on RHS but not on LHS? It seems to me the desired behavior is that users can modify columns whether or not they exist, and currently this requires allocation. Maybe this is a good opportunity to further distinguish df[!, col] from df[:, col] where

df[!, col_that_does_not_exist] .= value

will raise an error, and

df[:, col_that_does_exist] .= value

will not modify df

edit: ok after thinking about this more I am starting to understand the complexity. it is hard to make both operations consistent, especially if df[:, c] can be used as lvalue

The behavior of df[:, :A] was chosen for consistency with matrices. This can indeed give somewhat complex rules, probably if getindex returned views things would have been simpler for data frames.

BTW, let me stress that the DataFrames.jl behavior won’t change in a breaking way before it reaches 2.0, which isn’t currently planned and hopefully won’t have to happen soon as stability is essential for users.

4 Likes

I agree with this. If you have this dataframe,

df = DataFrame(a=1:2)

then both of the following throw an ArgumentError:

df[!, :b]
df.b

Upon reading the Julia manual section on customizing the broadcasting interface, one would conclude that df[!, :b] .= 1 and df.b .= 1 must throw an ArgumentError. There’s no way to get around the fact that df[!, :b] and df.b always error and thus do not return an object.

So, in order to get around this conundrum, DataFrames has overloaded two internal Julia functions: Base.dotview and Base.dotgetproperty. Unfortunately, these functions are internal and are not a part of any documented Julia interface. So attempting to reason about df.b .= 1 based on prior Julia programming experience is a fruitless exercise. One just has to accept that DataFrames is special.

2 Likes

The more I work with tables and other datastructures, the more I prefer being explicit whether stuff like x[:property] = .../x.property = ... inserts a new value or replaces an existing one. This often makes for more reliable and unambiguous code.

For tables, this approach seems a good fit for the Accessors.jl interface with its @set/@insert macros (and corresponding functions).
An actual working example that shows consistency between getters/setters/inserters:

julia> using StructArrays, AccessorsExtra

julia> tbl = StructArray(a=1:3, b=[:x, :y, :z])
3-element StructArray(::UnitRange{Int64}, ::Vector{Symbol}) with eltype NamedTuple{(:a, :b), Tuple{Int64, Symbol}}:
 (a = 1, b = :x)
 (a = 2, b = :y)
 (a = 3, b = :z)

julia> tbl.b
3-element Vector{Symbol}:
 :x
 :y
 :z

julia> @delete tbl.b
3-element StructArray(::UnitRange{Int64}) with eltype NamedTuple{(:a,), Tuple{Int64}}:
 (a = 1,)
 (a = 2,)
 (a = 3,)

# @set: update existing column
julia> @set tbl.b = 1:3
3-element StructArray(::UnitRange{Int64}, ::UnitRange{Int64}) with eltype NamedTuple{(:a, :b), Tuple{Int64, Int64}}:
 (a = 1, b = 1)
 (a = 2, b = 2)
 (a = 3, b = 3)

# ... but not create a new one
julia> @set tbl.c = 1:3
ERROR: ArgumentError: Failed to assign properties (:c,) to object with properties (:a, :b).

# @insert: create a new column
julia> @insert tbl.c = 1:3
3-element StructArray(::UnitRange{Int64}, ::Vector{Symbol}, ::UnitRange{Int64}) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Symbol, Int64}}:
 (a = 1, b = :x, c = 1)
 (a = 2, b = :y, c = 2)
 (a = 3, b = :z, c = 3)
2 Likes

x-ref about the current design of indexing into a single column of a data frame: DataFrames.jl indexing rules | Blog by Bogumił Kamiński.

5 Likes