Copy vs view of DataFrame column?

Hello, do you know an example in which a copy, df[:,:a], should be used instead of a view with df[!,:a]?
Thanks !

Please don’t revive a three-year old thread like this. Make a new post instead.

That said, if you have a function that modifies a vector in-place, you will want to do df[:, :a] instead of df[!, :a].

function make_ten!(x)
    x[1] = 10
end

make_ten!(df[!, :a]) # modifies df
make_ten!(df[:, :a]) # does not modify df

your statement and comment in example contradicts?

Statement should read

if you have a function that modifies a vector in-place, you will want to do df[!, :a] instead of df[:, :a].

using DataFrames
df = DataFrame(rand(10,3),["a","b","c"])
function make_ten!(x)
    x[1] = 10
end
make_ten!(df[:, :a])

Using colon, df is not updated:

julia> df[!,:a]
10-element Vector{Float64}:
 0.6133749814711238
 0.9022335210145525
 0.5930273630651568
 0.25727987475397907
 0.5368154177958848
 0.9789575335373208
 0.07748891516310152
 0.8386526410191439
 0.9929637176775048
 0.5485057586874986

This is assuming you don’t want to modify df. If you want the changes the function makes to persist inside the data frame, then yes, use !.

Thanks @pdeffebach and @cchderrick . Sorry I revived a thread. Is it a problem because it was old? I thought that it was a very related question. Now I will always create new threads.
Regarding the question, thanks for the answers. Would it be good that the compiler gives me a warning or error if trying to do the following:

make_ten!(df[:, :a])

?
Because what I should use is either

make_ten(df[:, :a])

or

make_ten!(df[!, :a])

I mean so as to avoid mistakes.
Do you agree?

Not sure what you mean by β€œmistake”.

There’s nothing wrong with any of the three operations you give.

make_ten!(df[:, :a]) # Saves memory, doesn't modify data frame
make_ten(df[:, :a]) # Ultra-safe, doesn't modify data frame
make_ten!(df[!, :a]) # Saves memory, modifies data frame

There’s no β€œright answer”, its depends on what you want to do. And no, Julia’s compiler generally does not do this sort of thing.

Thanks. By mistake I mean that I use

make_ten!(df[:, :a])

thinking that it will mutate df because the ! in the function name.

Now, with your explanation, I see that there is a memory benefit in using that particular operation.
Thanks!

Once you do df[:, :a], that object (a vector), knows nothing about df. It has no connection at all with the data frame.

The same goes for df[!, :a] in the sense that it’s behavior does not depend on being from a data frame. However df still shares the memory with df[!, :a].

1 Like

So if within the same scope I do:

df = DataFrame(x=[1,2,3], y=[4,5,6])
df[:,:x]=[7,8,9]

is like doing nothing, right? I mean because the new information of df[:,:x] is lost automatically.

No, that’s assigning (setindex) not retrieving (getindex).

df[:, :x] = [7,8,9]

does modify the data frame, since you are using setindex!, i.e. assigning the column.

df[:, :x] = ...

and df[:, :x] on it’s own do different things. But the intuition is the same with ! and : when you are doing setindex

julia> df = DataFrame(a = [1, 2, 3]);

julia> x = [5, 6, 7];

julia> y = [8,9, 10];

julia> df[:, :x] = x;

julia> df[!, :y] = y;

julia> x[1] = 100;

julia> df
3Γ—3 DataFrame
 Row β”‚ a      x      y     
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1      5      8
   2 β”‚     2      6      9
   3 β”‚     3      7     10

julia> y[1] = 100;

julia> df
3Γ—3 DataFrame
 Row β”‚ a      x      y     
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1      5    100
   2 β”‚     2      6      9
   3 β”‚     3      7     10

My recommendation is the following (of course it is only a recommendation):

  • by default always use df[:, :col] as it is safer
  • use df[!, :col] if
    1. speed or memory consumption is important for you (and you are sure that if you modify the data you extracted you will not mess up the source data)
    2. or you want to modify the contents of the column in-place

This applies to getting the column. For setting a column the difference is that df[!, :col] = ... replaces the column, while df[:, :col] = ... updates it in-place. The difference mostly matters when you want to assign values of other type than originally stored in the given column.

Thanks. What is the meaning of β€œupdates in place” ?
If I do df[!, :col] = y
and then I modify y, then df will be modified also?

This is a bit nuanced, but yes df[:, :x] = ... will modify the vector stored in column :x directly. The reasoning for this behavior is a complicated, but derives from the fact that this is how base julia matrices behave.

julia> df = DataFrame(x=[1,2,3], y=[4,5,6])
3Γ—2 DataFrame
 Row β”‚ x      y     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      4
   2 β”‚     2      5
   3 β”‚     3      6

julia> t = df[!, :x];

julia> df[:, :x] = [40, 50, 60];

julia> t
3-element Vector{Int64}:
 40
 50
 60

Thanks, I will have to practice all of this in the REPL because its all new for me.

Thanks, I think I am starting to understand but have some problem with the behavior in the assignments:
df[ ,:x] = ...
Could you give an example of this affirmation: β€œThe difference mostly matters when you want to assign values of other type than originally stored in the given column.” ?
Best.

It’s very niche, but the most dramatic example is characters and integers. df[:, :b] = ... will auto-promote to preserve type (like julia arrays), while df[!, :b] = ... will preserve the type of the new addition. See:

julia> using DataFrames

julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6]);

julia> x = ['p', 'u', 'z'];

julia> df[:, :x] = x;

julia> df[!, :b] = x;

julia> df[:, :a] = x;

julia> df
3Γ—3 DataFrame
 Row β”‚ a      b     x    
     β”‚ Int64  Char  Char 
─────┼───────────────────
   1 β”‚   112  p     p
   2 β”‚   117  u     u
   3 β”‚   122  z     z

Thanks, I see.