Best method for creating many columns in dataframe for dummy variables

Hi,

I’m working on a problem that requires creating a lot of dummy variables (30+) based on dates and I’m wondering if my approach is optimal.

Currently I am creating each variable as a set with the dates that correspond to the variable:

variable_1 = Set([

Date("2022-05-01"),

Date("2023-05-03"),

Date("2024-05-02")])

Then find the Boolean whether the date in a specific row is in the set:

df[:,:variable_1]= in.(df[!,:DATE],Ref(variable_1))

However, this approach seems fine for a few variables, but doesn’t really seem scalable. Also, it doesn’t seem like a great approach for code organization, but that’s a separate issue.

I thought about creating a set of sets or vector of sets and looping over to create new columns with the Boolean values, but this doesn’t seem to work as intended:

Variables = [variable_1,variable_2]

For x in Variables

Df[:,x] = in.(df[!,:DATE],Ref(Variables)

End

Also, the above approach (if it worked) I don’t think would give any type of appropriate name to the column. I.e. I can’t seem to extract the names in my vector, only the values (e.g. typing Variables[1] only gives me the values in variable_1, but does not return the name “variable_1”).

Any help appreciated.

edit
I figured out the looping bit, this works:

i = 0
for x in Variables
    col_location = ncol(df)+1
    insertcols!(df,col_location,:auto => in.(df[!,:DATE],Ref(Variables[i+1])),makeunique=true)
    i = i+1
end

And would save a lot of time. However, still wondering , what is the best way to get appropriate column names here?

Thanks

In contrast to R you cannot get the name of variables that got handed to a function (well at least if you don’t want to write a macro).
Instead, it’s better to provide names explicitly and use some data structure for that, e.g., a dictionary:

newcols = Dict(:variable_1 => Set([...]), :variable_2 => Set([...]))
for (col, vals) in pairs(newcols)
    df[:, col] = in.(df[!, :DATE], Ref(vals))
end