Tag each unique combination of column values in DataFrames

Gabriel_Kreindler · February 11, 2022, 2:39am

I have columns c1,c2, …, cN in a dataframe and I would like to create a column c_new that takes a unique numeric value for each unique combination of values of c1,…, cN.

If there a function to achieve this?

(In Stata I would use gen c_new = group(c1-cN))

My clunky code for two columns is:

function egen_group2(mydf::DataFrame, colname1::String, colname2::String, target_colname::String)
	#=
		creates a unique ID for each combination of colname1 and colname2
	=#

	egen_values_dict = Dict()
	col1 = Vector(mydf[:,colname1])
	col2 = Vector(mydf[:,colname2])

	n = length(mydf[:,colname1])
	@assert n == length(mydf[:,colname2])
	target_col = zeros(Int64, n)
	egen_idx_value = 1

        for i=1:n
		key = (col1[i], col2[i])
		if ~haskey(egen_values_dict, key)
			egen_values_dict[key] = egen_idx_value
			egen_idx_value += 1
		end
		target_col[i] = egen_values_dict[key]
	end

	mydf[:,target_colname] = target_col
end

pdeffebach · February 11, 2022, 2:48am

Great question!

I’m not sure I have a great solution, and it looks others have requested the same based on this github issue.

The proposed solution by Bogumil is groupindices)

julia> df = DataFrame(a = rand(1:5, 100), b = rand(11:15, 100));

julia> gd = groupby(df, [:a, :b]);

julia> groupindices(gd)
100-element Vector{Union{Missing, Int64}}:
  7
 17
 17
 22
 17
  7
  7
  6
 11
 11
  ⋮
 12
 10
 14
 19
  1
  9
 20
 12
 18
  7

But as you can see there is still some discussion of adding a convenience function.

Gabriel_Kreindler · February 11, 2022, 3:21am

Nice! A bit more compactly as:

my_df.c_new = groupby(my_df, [:c1, :c2, :c3]) |> groupindices

bkamins · February 11, 2022, 6:36am

One small comment groupby(my_df, [:c1, :c2, :c3]) has an undefined order of the indices. If you pass sort kwarg to groupby you can control how the numbers are assigned.

rocco_sprmnt21 · February 23, 2022, 2:59pm

I probably didn’t get the sense of the request, but this isn’t right?

unique(string.(eachcol(df)...))

Now I think I understand better what the request was.
I tried to read the docs related to the groupeddataframe module and I found this example that explains what has already been proposed.


df = DataFrame(a = rand(1:5, 100), b = rand(11:15, 100))
g=groupby(df,[:a,:b])
select(g, groupindices => :gid)

rocco_sprmnt21 · February 23, 2022, 9:25pm

I wanted to submit a deduction of the searched column made with functions of the Base module, even if a little naive, …

v=tuple.(eachcol(df)...)
sp=sortperm(v)
vsp=v[sp]
starts=[1;[i for i in 2:nrow(df) if vsp[i]!=vsp[i-1]];nrow(df)+1]
gidx=reduce(vcat,[fill(i,n) for (i,n) in enumerate(diff(starts))])
df.new_col=last.(sort(tuple.(sp,gidx), by=first))

Topic		Replies	Views
Tagging values unique to a group in a dataframe Data	2	164	April 27, 2023
If I have a DataFrame and want to get `unique` columns but treat each column as a General Usage	1	860	February 23, 2021
Reduce DataFrame by unique value General Usage dataframes	5	1036	October 14, 2022
Generating group id column in DataFrames? Data dataframes	17	1307	July 19, 2022
Counts of unique values per group in a DataFrame Data question , dataframes	3	10124	May 25, 2020

Tag each unique combination of column values in DataFrames

Related topics