Hi,
I’m working on a problem that requires creating a lot of dummy variables (30+) based on dates and I’m wondering if my approach is optimal.
Currently I am creating each variable as a set with the dates that correspond to the variable:
variable_1 = Set([
Date("2022-05-01"),
Date("2023-05-03"),
Date("2024-05-02")])
Then find the Boolean whether the date in a specific row is in the set:
df[:,:variable_1]= in.(df[!,:DATE],Ref(variable_1))
However, this approach seems fine for a few variables, but doesn’t really seem scalable. Also, it doesn’t seem like a great approach for code organization, but that’s a separate issue.
I thought about creating a set of sets or vector of sets and looping over to create new columns with the Boolean values, but this doesn’t seem to work as intended:
Variables = [variable_1,variable_2]
For x in Variables
Df[:,x] = in.(df[!,:DATE],Ref(Variables)
End
Also, the above approach (if it worked) I don’t think would give any type of appropriate name to the column. I.e. I can’t seem to extract the names in my vector, only the values (e.g. typing Variables[1] only gives me the values in variable_1, but does not return the name “variable_1”).
Any help appreciated.
edit
I figured out the looping bit, this works:
i = 0
for x in Variables
col_location = ncol(df)+1
insertcols!(df,col_location,:auto => in.(df[!,:DATE],Ref(Variables[i+1])),makeunique=true)
i = i+1
end
And would save a lot of time. However, still wondering , what is the best way to get appropriate column names here?
Thanks