Concatenate DataFrame columns dynamically

Hello, amazing Julia community! :smiley:

I’m stumped this morning on what I thought would be a fairly trivial task: I am trying to write a function that creates a new column in a DataFrame, the values of which are simply the concatenation of some other columns in the DataFrame. If I know the columns upfront, it is indeed a very trivial task. However, I want this function to take a DataFrame as one argument and then the column numbers to concatenate as another…something like this:

function add_column(df::DataFrame, colname::Symbol, cols_to_concat::Vector{Integer})
    df[!,colname] = df[!, 1] .* df[!, 2] .* df[!, 3]
end

# OR 

function add_column(df::DataFrame, colname::Symbol, cols_to_concat::Vector{Integer})
    df[!,colname] = ["$(row[cols_to_concat[1]])$(row[cols_to_concat[2]])" for row in eachrow(df)]
end

I’m thinking there may be a metaprogramming solution to this? It seems to me that the issue is figuring out how to dynamically construct this piece df[!, 1] .* df[!, 2] .* df[!, 3] or the alternative version in the second function. I think I can dynamically construct the df[!, n] pieces fairly easily, like this:

dfcols = [Symbol("df[!, $n]") for n in cols_to_concat]

but I’m not sure then how to piece them together with the .* in between…

It looks like I can do this:

cols_to_concat = [1,3,5]
dfcols = [i == length(cols_to_concat) ? Symbol("df[!, $n]") : Symbol("df[!, $n] .* ") for (i, n) in enumerate(cols_to_concat)]

but I am still unable to figure out the last piece. I tried:

@eval dfcols...

but that doesn’t work. If I simply print it out, I get the desired output:

julia> println(dfcols...)
df[!, 1] .* df[!, 3] .* df[!, 5]

I’m probably misunderstanding the question, but does this help:

using DataFrames

df = DataFrame(a = ["aa" for _ in 1:10], b = ["bb" for _ in 1:10], 
               c = ["cc" for _ in 1:10], d = ["dd" for _ in 1:10])


function add_column(df::DataFrame, colname::Symbol, cols_to_concat::Vector)
   df[!,colname] = (df[!, names(df)[cols_to_concat[1]]] 
                .* df[!, names(df)[cols_to_concat[2]]] 
                .* df[!, names(df)[cols_to_concat[3]]])
end

add_column(df, :add, [1, 4, 4])

Output:

10Γ—5 DataFrame
β”‚ Row β”‚ a      β”‚ b      β”‚ c      β”‚ d      β”‚ add    β”‚
β”‚     β”‚ String β”‚ String β”‚ String β”‚ String β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 2   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 3   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 4   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 5   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 6   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 7   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 8   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 9   β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚
β”‚ 10  β”‚ aa     β”‚ bb     β”‚ cc     β”‚ dd     β”‚ aadddd β”‚

Hi! I need this to work though when there are more or less than 3 columns. Your function will concatenate the first three columns that are given in the cols_to_concat vector, but I need to write the function so that the cols_to_concat vector could contain 1, 3, 12, etc., columns that need to be concatenated…

I tried this, but it doesn’t work:

function add_column(df::DataFrame, colname::Symbol, cols_to_concat::Vector{Int64})
    dfcols = [i == length(cols_to_concat) ? Symbol("df[!, $n]") : Symbol("df[!, $n] .* ") for (i, n) in enumerate(cols_to_concat)]
    left_side = Symbol("$(df).$(colname)")
    right_side = Symbol(string(dfcols...))
    @eval $left_side = $right_side
    return df
end

julia> add_column(df1, :new, [1,3,5,6])
ERROR: UndefVarError: df[!, 1] .* df[!, 3] .* df[!, 5] .* df[!, 6] not defined
Stacktrace:
 [1] top-level scope at REPL[20]:1
 [2] eval at .\boot.jl:330 [inlined]
 [3] add_column(::DataFrame, ::Symbol, ::Array{Int64,1}) at .\REPL[18]:5
 [4] top-level scope at REPL[20]:1

Not perfect, but you will get the idea:

function add_column(df::DataFrame, colname::Symbol, cols_to_concat::Vector)
	exstring = "df[!, :" * string(colname) *" ] = ("
	exstring *=  "df[!, names(df)["*string(cols_to_concat[1])*"]]"
	for col in 2:length(cols_to_concat)
		exstring *=  ".* df[!, names(df)["*string(cols_to_concat[col])*"]]"
	end
	exstring *= ")"
	eval(Meta.parse(exstring))
end
1 Like

using eval and parse is way overthinking this.

function add_column!(df::DataFrame, colname::Symbol, cols_to_concat::Vector)
	t = df[!, cols_to_concat[1]]
	for col in cols_to_concat[2:end]
		t = t .* df[!, col]
	end
	df[!, colname] = t
	return df
end

You could also use reduce for this.

function add_column2!(df::DataFrame, colname::Symbol, cols_to_concat::Vector)
	function concat(x, y)
		x .* y
	end
	df[!, colname] =reduce(concat, eachcol(df[!, cols_to_concat]))
	return df
end
1 Like

using eval and parse is way overthinking this.

But it’s so cool!!! :stuck_out_tongue_closed_eyes:

Seriously though, thanks so much for this brilliantly simple solution. Let me see if I can walk through how it actually works:

function add_column!(df::DataFrame, colname::Symbol, cols_to_concat::Vector)
	t = df[!, cols_to_concat[1]]             #1
	for col in cols_to_concat[2:end]         #2
		t = t .* df[!, col]                  #3
	end                                      #4
	df[!, colname] = t                       #5
	return df                                #6
end

On line 1, we are grabbing column number cols_to_concat[1] (whatever it may be) and assigning it to a variable t. Then, on lines 2 - 4, we are looping through the rest of the columns in cols_to_concat and updating the value of t on each iteration of the loop by concatenating an additional column until we arrive at the end. Finally, on lines 5 and 6 we create the new column in df, assign it the final value of t and return df…why didn’t I think of that?! :upside_down_face:

Basically, I was trying to figure out how to do it all at once when there was no need - your way simply concatenates the columns in an iterative fashion :+1::+1:

I’m glad I could help. You have the right interpretation.

I think it might be the case that the second solution I posted is faster. But the compiler might be smart and treat the two functions as equal.

Awesome, thanks again!

Just a general note here because I think this is a good example for a broader point. If you strip away everything about the text being stored in a DataFrame, dynamic column selection, etc… and just imagine that each element was stored in the same vector of strings (a much simpler problem), you would likely arrive immediately at the solution. This is what is nice about Julia, no need to get fancy, even when the problem looks fancy.

text = ["aa"; "bb"; "cc"]
function replconcat(str)
    final = str[1]
    for i=2:length(str)
        final = final .* str[i]
    end
    final
end
replconcat(text)

So to apply it to your problem just map appropriately, which is to think about doing this for each row of a DataFrame, which is made easier by the fact that the operation does not necessarily have to be done with a loop as @pdeffebach showed.