# Recoding variables and counting and removing duplicate rows in dataframes

Hi all-

I seeking solutions for two problems I have encountered while manipulating dataframes. Iβve been struggling to think of a good solution. Although I can probably duct tape some solutions, I was wondering whether there are utilities for these operations or elegant solutions.

Problem 1

I need to recode multiple columns in a dataframe into a variable such that each unique combination of variables in the old columns is assigned a new unique value in the variable. In the following example, unique combinations of values in columns a and b are recoded into the column new_indicator:

``````using DataFrames

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1],new_indicator=[1,2,3,4,1])
``````

Output:

``````5Γ3 DataFrame
β Row β a     β b     β new_indicator β
β     β Int64 β Int64 β Int64         β
βββββββΌββββββββΌββββββββΌββββββββββββββββ€
β 1   β 1     β 1     β 1             β
β 2   β 1     β 2     β 2             β
β 3   β 2     β 1     β 3             β
β 4   β 2     β 2     β 4             β
β 5   β 1     β 1     β 1             β
``````

Problem 2

I have a second problem in which I want to remove duplicate rows (defined by a set of columns) and create a new column for the number of duplicates. Here is an example:

Current data

``````using DataFrames

df = DataFrame(a=[1,1,2,2,3,3],b=[1,1,2,2,1,1],c=[1,1,2,2,1,2])
6Γ3 DataFrame
β Row β a     β b     β c     β
β     β Int64 β Int64 β Int64 β
βββββββΌββββββββΌββββββββΌββββββββ€
β 1   β 1     β 1     β 1     β
β 2   β 1     β 1     β 1     β
β 3   β 2     β 2     β 2     β
β 4   β 2     β 2     β 2     β
β 5   β 3     β 1     β 1     β
β 6   β 3     β 1     β 2     β
``````

desired data:

``````df = DataFrame(a=[1,2,3,3],b=[1,2,1,1],c=[1,2,1,2],counts=[2,2,1,1])

4Γ4 DataFrame
β Row β a     β b     β c     β counts β
β     β Int64 β Int64 β Int64 β Int64  β
βββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
β 1   β 1     β 1     β 1     β 2      β
β 2   β 2     β 2     β 2     β 2      β
β 3   β 3     β 1     β 1     β 1      β
β 4   β 3     β 1     β 2     β 1      β
``````

Problem 1 could by done by numbering the rows of a grouped dataframe and then joining:

``````df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1])

newdf = DataFrame([g[1, :] for g in groupby(df, [:a, :b])])
newdf[!, :new_indicator] = 1:nrow(newdf)

join(df, newdf, on = [:a, :b], kind = :left)
``````

This probably isnβt very performant though - Iβve done something similar before based on keeping a dictionary which has the combinations of `a` and `b` as keys and increments a counter as the value for each key looking through the dataframe row by row. Something like this:

``````function counter(v::AbstractVector{T}) where T
d = Dict{T, Int}()
return [d[el] = get(d, el, 0) + 1 for el in v]
end
``````

where you could generate a column based on the combination of `a` and `b` that you then count.

Problem 2 is a simple groupby with `length`:

``````df = DataFrame(a=[1,1,2,2,3,3],b=[1,1,2,2,1,1],c=[1,1,2,2,1,2])
by(df, [:a, :b, :c], count = :a => length)
``````
1 Like

Thank you. I came up with a similar solution for the first problem, but your second solution is much more elegantβ¦ and Iβm surprised I didnβt make the connection to the examples in the docs. I figured at least one of these was a simple fix. Thanks again!

1 Like

This seems to be quite a bit more performant than the above code for problem #1:

``````df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1])
n = unique([(row.a, row.b) for row in eachrow(df)])
df.new_indicator = [findfirst(x -> (row.a, row.b) == x, n) for row in eachrow(df)]

julia> df
5Γ3 DataFrame
β Row β a     β b     β new_indicator β
β     β Int64 β Int64 β Int64         β
βββββββΌββββββββΌββββββββΌββββββββββββββββ€
β 1   β 1     β 1     β 1             β
β 2   β 1     β 2     β 2             β
β 3   β 2     β 1     β 3             β
β 4   β 2     β 2     β 4             β
β 5   β 1     β 1     β 1             β
``````

And the benchmarking:

``````julia> @btime begin
n = unique([(row.a, row.b) for row in eachrow(df)])
df.new_indicator = [findfirst(x -> (row.a, row.b) == x, n) for row in eachrow(df)]
end
4.900 ΞΌs (57 allocations: 2.42 KiB)

julia> @btime begin
newdf = DataFrame([g[1, :] for g in groupby(df, [:a, :b])])
newdf[!, :new_indicator] = 1:nrow(newdf)
join(df, newdf, on = [:a, :b], kind = :left)
end
32.800 ΞΌs (306 allocations: 20.92 KiB)
``````
1 Like