question about creating new columns in data frame from existing columns,



Hi, I’m a little confused on what to do in this coding situation and would appreciate some help!

I’m trying to create new columns in a dataframe (D) as a linear combination of the existing columns in that dataframe, given a vector of coefficients (v).

Basically: If I have a vector of Float64 numbers, v_i, and a 20 column data frame, D, of float64 values, how can I concatenate columns to the data frame D, where I am doing an element wise multiplication:
D[:New] = v_1D_1 + v_2D_2 + v_3D_3 + … v_20D_20 ; where D_i = ith column of D

So far I have started to write the function:

for i in 1:length(v)
D[:New] += v[i]*(D[:column_names_of_D[i]])

But I am unsure how to call the D[:column_names_of_D[i]]) part…


I would collect them as a matrix and then use matrix * vector multiplication, eg

using DataFrames

Return columns that start with `prefix` as a matrix.
function cols2matrix(df::DataFrame, prefix)
    matching_names = sort(filter(name -> startswith(String(name), String(prefix)),
    hcat(getindex.(df, matching_names)...)

# make a dataframe
df = DataFrame(v_1 = randn(10), v_2 = randn(10), v_3 = randn(10))

M = cols2matrix(df, "v_")

M * [1, 2, 3]

But I would suggest that storing this kind of data in a matrix may be best in the first place.


Just a note about the hcat command. Matrix() works just find on DataFrames and is probably more idiomatic.

I also am gonna work on a package that overloads some useful string commands, or writes wrappers for them, for common string operations that would be used on the names of dataframes columns.


Excellent idea, as it allows more modular code:

using DataFrames

colnames_with_prefix(df::DataFrame, prefix) =
    sort(filter(name -> startswith(String(name), String(prefix)), names(df)))

df = DataFrame(v_1 = randn(10), v_2 = randn(10), v_3 = randn(10),
               a = rand(1:20, 10))

Matrix(df[colnames_with_prefix(df, :v_)]) * [1, 2, 3]


Thanks Tamas! :slight_smile: