Indexing DataFrame with : does not generate a copy

The DataFrames.jl documentation says

Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df . To get a copy of the column use df[:, :col] : changing the vector returned by this syntax does not change df .

Here, df is an instance of DataFrame. I thought the above excerpt of the documentation meant the following. df[!, :col] returns a view of the column named col of df and therefore changing its contents changes the contents of df itself. On the other hand, df[:, :col] returns a copy of the same column and thus changing its contents does not alter the contents of df.

However, df[:, :col] also seems to change the contents of df:

(@v1.7) pkg> st DataFrames
      Status `~/.julia/environments/v1.7/Project.toml`
  [a93c6f00] DataFrames v1.3.2

julia> using DataFrames

julia> df = DataFrame(A=1:3, B=4:6)
3Γ—2 DataFrame
 Row β”‚ A      B     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      4
   2 β”‚     2      5
   3 β”‚     3      6

julia> df[:, :A] .= 0  # intend to change copy of column A by using : instead of !
3-element view(::Vector{Int64}, :) with eltype Int64:
 0
 0
 0

julia> df  # column A of df has changed!
3Γ—2 DataFrame
 Row β”‚ A      B     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     0      4
   2 β”‚     0      5
   3 β”‚     0      6

What did I wrong here? Is my interpretation of the documentation incorrect?

Since I’ve tripped over the same issue, I’ll illustrate below and then paste what @bkamins patiently explained in a Slack forum. This involves getindex and setindex.

using DataFrames


df = DataFrame(A=1:3, B=4:6)

### getindex (: used in df on the right hand side of =)
copyOfA = df[:, :A]

copyOfA .= 0

copyOfA

# note that :A has not changed
df


### setindex (: used in df on left hand side of =)

df[:, :A] .= 0

# now df has changed
df

From Slack discussion:

The issue is that you are reading the documentation of getindex,
while we were discussing the behavior of setindex!/broadcasting assignment context.

The design principle in DataFrames.jl is to provide every option
the user might ask for (as it is a low-level package.
High level packages like DataFramesMeta.jl or DataFrameMacros.jl
expose less to the user but under a simpler API).

In getindex context a user can ask for the following behaviors:
get a vector β€œas is”, without copying; this is achieved with df[!, :a] and df.a;
get a vector with copying; that is achieved with df[:, :a];
get a view of the vector; this is achieved with view(df, :, :a) or view(df, !, :a)

In setindex!/broadcasting assignment context
the user might ask for the following:

set an existing vector in-place,
 this is done by df[:, :a] = vec and df[:, :a] .= something

replace an existing vector without copying:
 df[!, :a] = vec or df.a = vec

replace an existing vector with copying:
 df[!, :a] .= vec and df[!, :a] .= something

add a new vector without copying
 df[!, :new] = vec or df.new = vec

add a new vector with copying
df[:, :new] = vec or df[:, :new] .= something
or 
df[!, :new] .= something

(this last rule breaks things a bit as ! and : behave here
the in same way, but it was added for user convenience)

These are selected basic rules. All the rules of indexing
are described in Indexing Β· DataFrames.jl.

In short: the range of behaviors users might want is very vast,
so we needed to introduce both : and ! to cover every option.

3 Likes

As an additional explanation let me show you what happens in normal Julia matrices:

julia> x = [1 2; 3 4]
2Γ—2 Matrix{Int64}:
 1  2
 3  4

julia> x[:, 1] .= 0 # broadcasting assignment, x is changed
2-element view(::Matrix{Int64}, :, 1) with eltype Int64:
 0
 0

julia> x
2Γ—2 Matrix{Int64}:
 0  2
 0  4

julia> y = x[:, 1] # copy of first column of x
2-element Vector{Int64}:
 0
 0

julia> y .= 10 # now y is a copy, so y is changed, but x is not changed
2-element Vector{Int64}:
 10
 10

julia> y
2-element Vector{Int64}:
 10
 10

julia> x
2Γ—2 Matrix{Int64}:
 0  2
 0  4

DataFrames.jl works the same way here.

1 Like