Adding multiple new columns to dataframe?

Hi

Previously, one could do the following to add new columns to a dataframe:

df = Dataframe()
df[[:col1, :col2, :col3]] = []

But nowadays, this yields the following error message:

ERROR: MethodError: no method matching setindex!(::DataFrame, ::Array{Any,1}, ::Array{Symbol,1})

What is the new way to add new columns to a dataframe?

Hmm I’m not sure that ever worked. What did you expend df.col1 to return after you ran that code?

Do you want a for loop?

df = DataFrame()
for c in [:col1, :col2, :col3]
    df[:, c] = []
end

Note the :. In more recent version of DataFrames you need to specify both dimensions when indexing a data frame. df[:col] and df[[:col1, :col2]] are both deprecated

1 Like

The closest to the syntax you want is:

julia> df = DataFrame()
0Γ—0 DataFrame

julia> insertcols!(df, ([:col1, :col2, :col3] .=> Ref([]))...)
0Γ—3 DataFrame

julia> describe(df)
3Γ—7 DataFrame
 Row β”‚ variable  mean     min      median   max      nmissing  eltype
     β”‚ Symbol    Nothing  Nothing  Nothing  Nothing  Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 β”‚ col1                                                 0  Any
   2 β”‚ col2                                                 0  Any
   3 β”‚ col3                                                 0  Any
2 Likes

I’m very certain it worked prior to the new pkg update.
Right now, I also have to deal with changing from df[:col] and df[[:col1, :col2]] to df[!, :col] and df[!, [:col1, :col2]] in my code.

I expect it to return an empty dataframe with the columns I specified:

0Γ—3 DataFrame

I intend to populate that dataframe row by row in the next step, hence why I need an empty dataframe with specified columns.

For the record, your for-loop solution works, but I prefer @bkamins one liner.

Thank you!

  1. You do not need to know the list of columns upfront as push! has cols=:union option that will handle this
  2. If you know the list of columns upfront then it is easier to write:
DataFrame([:c1, :c2, :c3] .=> Ref([]))

or

DataFrame(fill([], 3), [:c1, :c2, :c3])

(in my original answer I thought you do not know the list of columns when creating a data frame)

2 Likes

Very useful suggestions!

  1. Currently, I’m pushing a 1xn Array{Any, 2} row-by-row, but I’m going to switch it to a Dict instead, which I think is much safer approach, now that I know cols=:union exists.

  2. I actually have both cases, one where DataFrame([:c1, :c2, :c3] .=> Ref([])) is most useful and another where insertcols!(df, ([:col1, :col2, :col3] .=> Ref([]))...) is most suitable.

Thank you!

what does the β€˜=>’ operator do ? Is that DataFrames specific or a Julia operator ?
p.s. is Ref() necessary to make sure that each new column doesn’t get the same empty list ?

This is a Julia operator used to create the data type Pair. Example:

julia> ([:a, :b, :c] .=> Ref([]))
3-element Array{Pair{Symbol,Array{Any,1}},1}:
 :a => []
 :b => []
 :c => []

I can confirm that you need the Ref([]), otherwise you get a DimensionMismatch:

julia> ([:a, :b, :c] .=> [])
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 0")

what does the β€˜=>’ operator do ? Is that DataFrames specific or a Julia operator ?

It is Pair syntax from Julia Base. Here you have a list of its usages in DataFrames.jl: https://bkamins.github.io/julialang/2020/07/17/pair.html

is Ref() necessary to make sure that each new column doesn’t get the same empty list ?

Ref is necessary for the broadcasting to work. DataFrames.jl automatically takes care that the column [] is copied and not reused (you cound turn it off with copycols=false but in your case do not do this).

2 Likes

That’s very funny. I’m not sure I’ve ever used β€˜=>’. Is that possible ?! lol.

I do remember the Ref([]) operation now, I had a similar problem where a broadcasted assignment didn’t work without.

Thank you.

The original usage of => is to create Dicts:

julia> Dict(1=>2, 3=>4)
Dict{Int64,Int64} with 2 entries:
  3 => 4
  1 => 2
1 Like

Great, thank you for taking the time to write that, it’s really useful (and thank you also for, you know, taking the time to write DataFrames !)

Also. shame on me. Julia’s new and improved help mode can tell me what β€˜=>’ is (the online documentation doesn’t work well for operators).

help?> =>
search: =>

  Pair(x, y)
  x => y

  Construct a Pair object with type Pair{typeof(x), typeof(y)}. The elements
  are stored in the fields first and second. They can also be accessed via
  iteration (but a Pair is treated as a single "scalar" for broadcasting
  operations).
...

Very nice blog btw. A lot of useful tips!

If you need any ideas for future blog post, may I suggest writing about how to optimize performance when using DataFrames?

For example, I noted in the previous pkg version that the difference in computation time performance between filter(row -> row.col1 == x, df) and df[df[:col1 ] .== x, :]

was very significant. For a particular DataFrame, I measured the average time over 1000 runs, and the results were:

  • filter(row -> row.col1 == x, df) -> 9.1429996 ms
  • df[df[:col1 ] .== x, :] -> 0.0309999 ms

This performance difference is documented in filter docstring. The performant syntax is:

filter(:col1 => ==(x), df)

3 Likes