I have columns c1,c2, …, cN in a dataframe and I would like to create a column c_new that takes a unique numeric value for each unique combination of values of c1,…, cN.
If there a function to achieve this?
(In Stata I would use gen c_new = group(c1-cN))
My clunky code for two columns is:
function egen_group2(mydf::DataFrame, colname1::String, colname2::String, target_colname::String)
#=
creates a unique ID for each combination of colname1 and colname2
=#
egen_values_dict = Dict()
col1 = Vector(mydf[:,colname1])
col2 = Vector(mydf[:,colname2])
n = length(mydf[:,colname1])
@assert n == length(mydf[:,colname2])
target_col = zeros(Int64, n)
egen_idx_value = 1
for i=1:n
key = (col1[i], col2[i])
if ~haskey(egen_values_dict, key)
egen_values_dict[key] = egen_idx_value
egen_idx_value += 1
end
target_col[i] = egen_values_dict[key]
end
mydf[:,target_colname] = target_col
end
One small comment groupby(my_df, [:c1, :c2, :c3]) has an undefined order of the indices. If you pass sort kwarg to groupby you can control how the numbers are assigned.
I probably didn’t get the sense of the request, but this isn’t right?
unique(string.(eachcol(df)...))
Now I think I understand better what the request was.
I tried to read the docs related to the groupeddataframe module and I found this example that explains what has already been proposed.
I wanted to submit a deduction of the searched column made with functions of the Base module, even if a little naive, …
v=tuple.(eachcol(df)...)
sp=sortperm(v)
vsp=v[sp]
starts=[1;[i for i in 2:nrow(df) if vsp[i]!=vsp[i-1]];nrow(df)+1]
gidx=reduce(vcat,[fill(i,n) for (i,n) in enumerate(diff(starts))])
df.new_col=last.(sort(tuple.(sp,gidx), by=first))