Tag each unique combination of column values in DataFrames

I have columns c1,c2, …, cN in a dataframe and I would like to create a column c_new that takes a unique numeric value for each unique combination of values of c1,…, cN.

If there a function to achieve this?

(In Stata I would use gen c_new = group(c1-cN))

My clunky code for two columns is:

function egen_group2(mydf::DataFrame, colname1::String, colname2::String, target_colname::String)
	#=
		creates a unique ID for each combination of colname1 and colname2
	=#

	egen_values_dict = Dict()
	col1 = Vector(mydf[:,colname1])
	col2 = Vector(mydf[:,colname2])

	n = length(mydf[:,colname1])
	@assert n == length(mydf[:,colname2])
	target_col = zeros(Int64, n)
	egen_idx_value = 1

        for i=1:n
		key = (col1[i], col2[i])
		if ~haskey(egen_values_dict, key)
			egen_values_dict[key] = egen_idx_value
			egen_idx_value += 1
		end
		target_col[i] = egen_values_dict[key]
	end

	mydf[:,target_colname] = target_col
end

Great question!

I’m not sure I have a great solution, and it looks others have requested the same based on this github issue.

The proposed solution by Bogumil is groupindices)

julia> df = DataFrame(a = rand(1:5, 100), b = rand(11:15, 100));

julia> gd = groupby(df, [:a, :b]);

julia> groupindices(gd)
100-element Vector{Union{Missing, Int64}}:
  7
 17
 17
 22
 17
  7
  7
  6
 11
 11
  ⋮
 12
 10
 14
 19
  1
  9
 20
 12
 18
  7

But as you can see there is still some discussion of adding a convenience function.

2 Likes

Nice! A bit more compactly as:

my_df.c_new = groupby(my_df, [:c1, :c2, :c3]) |> groupindices

One small comment groupby(my_df, [:c1, :c2, :c3]) has an undefined order of the indices. If you pass sort kwarg to groupby you can control how the numbers are assigned.

4 Likes

I probably didn’t get the sense of the request, but this isn’t right?

unique(string.(eachcol(df)...))

Now I think I understand better what the request was.
I tried to read the docs related to the groupeddataframe module and I found this example that explains what has already been proposed.


df = DataFrame(a = rand(1:5, 100), b = rand(11:15, 100))
g=groupby(df,[:a,:b])
select(g, groupindices => :gid)

I wanted to submit a deduction of the searched column made with functions of the Base module, even if a little naive, …

v=tuple.(eachcol(df)...)
sp=sortperm(v)
vsp=v[sp]
starts=[1;[i for i in 2:nrow(df) if vsp[i]!=vsp[i-1]];nrow(df)+1]
gidx=reduce(vcat,[fill(i,n) for (i,n) in enumerate(diff(starts))])
df.new_col=last.(sort(tuple.(sp,gidx), by=first))