Adding multiple new columns to dataframe?

LasseKatten · November 30, 2020, 2:55pm

Hi

Previously, one could do the following to add new columns to a dataframe:

df = Dataframe()
df[[:col1, :col2, :col3]] = []

But nowadays, this yields the following error message:

ERROR: MethodError: no method matching setindex!(::DataFrame, ::Array{Any,1}, ::Array{Symbol,1})

What is the new way to add new columns to a dataframe?

pdeffebach · November 30, 2020, 3:00pm

Hmm I’m not sure that ever worked. What did you expend df.col1 to return after you ran that code?

Do you want a for loop?

df = DataFrame()
for c in [:col1, :col2, :col3]
    df[:, c] = []
end

Note the :. In more recent version of DataFrames you need to specify both dimensions when indexing a data frame. df[:col] and df[[:col1, :col2]] are both deprecated

bkamins · November 30, 2020, 3:03pm

The closest to the syntax you want is:

julia> df = DataFrame()
0×0 DataFrame

julia> insertcols!(df, ([:col1, :col2, :col3] .=> Ref([]))...)
0×3 DataFrame

julia> describe(df)
3×7 DataFrame
 Row │ variable  mean     min      median   max      nmissing  eltype
     │ Symbol    Nothing  Nothing  Nothing  Nothing  Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ col1                                                 0  Any
   2 │ col2                                                 0  Any
   3 │ col3                                                 0  Any

LasseKatten · November 30, 2020, 3:12pm

I’m very certain it worked prior to the new pkg update.
Right now, I also have to deal with changing from df[:col] and df[[:col1, :col2]] to df[!, :col] and df[!, [:col1, :col2]] in my code.

I expect it to return an empty dataframe with the columns I specified:

0×3 DataFrame

I intend to populate that dataframe row by row in the next step, hence why I need an empty dataframe with specified columns.

For the record, your for-loop solution works, but I prefer @bkamins one liner.

Thank you!

bkamins · November 30, 2020, 3:22pm

You do not need to know the list of columns upfront as push! has cols=:union option that will handle this
If you know the list of columns upfront then it is easier to write:

DataFrame([:c1, :c2, :c3] .=> Ref([]))

or

DataFrame(fill([], 3), [:c1, :c2, :c3])

(in my original answer I thought you do not know the list of columns when creating a data frame)

LasseKatten · November 30, 2020, 3:39pm

Very useful suggestions!

Currently, I’m pushing a 1xn Array{Any, 2} row-by-row, but I’m going to switch it to a Dict instead, which I think is much safer approach, now that I know cols=:union exists.
I actually have both cases, one where DataFrame([:c1, :c2, :c3] .=> Ref([])) is most useful and another where insertcols!(df, ([:col1, :col2, :col3] .=> Ref([]))...) is most suitable.

Thank you!

purplishrock · November 30, 2020, 3:40pm

what does the ‘=>’ operator do ? Is that DataFrames specific or a Julia operator ?
p.s. is Ref() necessary to make sure that each new column doesn’t get the same empty list ?

LasseKatten · November 30, 2020, 3:42pm

This is a Julia operator used to create the data type Pair. Example:

julia> ([:a, :b, :c] .=> Ref([]))
3-element Array{Pair{Symbol,Array{Any,1}},1}:
 :a => []
 :b => []
 :c => []

I can confirm that you need the Ref([]), otherwise you get a DimensionMismatch:

julia> ([:a, :b, :c] .=> [])
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 0")

bkamins · November 30, 2020, 3:44pm

what does the ‘=>’ operator do ? Is that DataFrames specific or a Julia operator ?

It is Pair syntax from Julia Base. Here you have a list of its usages in DataFrames.jl: How is => used in DataFrames.jl? | Blog by Bogumił Kamiński

is Ref() necessary to make sure that each new column doesn’t get the same empty list ?

Ref is necessary for the broadcasting to work. DataFrames.jl automatically takes care that the column [] is copied and not reused (you cound turn it off with copycols=false but in your case do not do this).

purplishrock · November 30, 2020, 3:44pm

That’s very funny. I’m not sure I’ve ever used ‘=>’. Is that possible ?! lol.

I do remember the Ref([]) operation now, I had a similar problem where a broadcasted assignment didn’t work without.

Thank you.

bkamins · November 30, 2020, 3:45pm

The original usage of => is to create Dicts:

julia> Dict(1=>2, 3=>4)
Dict{Int64,Int64} with 2 entries:
  3 => 4
  1 => 2

purplishrock · November 30, 2020, 3:53pm

Great, thank you for taking the time to write that, it’s really useful (and thank you also for, you know, taking the time to write DataFrames !)

Also. shame on me. Julia’s new and improved help mode can tell me what ‘=>’ is (the online documentation doesn’t work well for operators).

help?> =>
search: =>

  Pair(x, y)
  x => y

  Construct a Pair object with type Pair{typeof(x), typeof(y)}. The elements
  are stored in the fields first and second. They can also be accessed via
  iteration (but a Pair is treated as a single "scalar" for broadcasting
  operations).
...

LasseKatten · November 30, 2020, 4:03pm

Very nice blog btw. A lot of useful tips!

If you need any ideas for future blog post, may I suggest writing about how to optimize performance when using DataFrames?

For example, I noted in the previous pkg version that the difference in computation time performance between filter(row -> row.col1 == x, df) and df[df[:col1 ] .== x, :]

was very significant. For a particular DataFrame, I measured the average time over 1000 runs, and the results were:

filter(row -> row.col1 == x, df) → 9.1429996 ms
df[df[:col1 ] .== x, :] → 0.0309999 ms

bkamins · November 30, 2020, 4:18pm

This performance difference is documented in filter docstring. The performant syntax is:

filter(:col1 => ==(x), df)

Topic		Replies	Views
How to add multiple columns to a dataframe at once General Usage	2	1249	November 29, 2022
How to properly add two new columns (in dataframe) from a function that returns two arrays? General Usage dataframes , function	7	801	July 3, 2021
Programmatically adding multiple colums to a dataframe General Usage dataframes , dataframesmeta	4	178	March 22, 2024
How to select rows from a dataframe and then create a dataframe with multiple columns from the selection New to Julia dataframes	2	533	March 9, 2022
Setting multiple columns in DataFrames General Usage	1	1615	July 24, 2017

Adding multiple new columns to dataframe?

Related topics