Recoding variables and counting and removing duplicate rows in dataframes

Hi all-

I seeking solutions for two problems I have encountered while manipulating dataframes. I’ve been struggling to think of a good solution. Although I can probably duct tape some solutions, I was wondering whether there are utilities for these operations or elegant solutions.

Problem 1

I need to recode multiple columns in a dataframe into a variable such that each unique combination of variables in the old columns is assigned a new unique value in the variable. In the following example, unique combinations of values in columns a and b are recoded into the column new_indicator:

using DataFrames

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1],new_indicator=[1,2,3,4,1])

Output:

5Γ—3 DataFrame
β”‚ Row β”‚ a     β”‚ b     β”‚ new_indicator β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64         β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 1     β”‚ 1             β”‚
β”‚ 2   β”‚ 1     β”‚ 2     β”‚ 2             β”‚
β”‚ 3   β”‚ 2     β”‚ 1     β”‚ 3             β”‚
β”‚ 4   β”‚ 2     β”‚ 2     β”‚ 4             β”‚
β”‚ 5   β”‚ 1     β”‚ 1     β”‚ 1             β”‚

Problem 2

I have a second problem in which I want to remove duplicate rows (defined by a set of columns) and create a new column for the number of duplicates. Here is an example:

Current data

using DataFrames

df = DataFrame(a=[1,1,2,2,3,3],b=[1,1,2,2,1,1],c=[1,1,2,2,1,2])
6Γ—3 DataFrame
β”‚ Row β”‚ a     β”‚ b     β”‚ c     β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 1     β”‚ 1     β”‚
β”‚ 2   β”‚ 1     β”‚ 1     β”‚ 1     β”‚
β”‚ 3   β”‚ 2     β”‚ 2     β”‚ 2     β”‚
β”‚ 4   β”‚ 2     β”‚ 2     β”‚ 2     β”‚
β”‚ 5   β”‚ 3     β”‚ 1     β”‚ 1     β”‚
β”‚ 6   β”‚ 3     β”‚ 1     β”‚ 2     β”‚

desired data:

df = DataFrame(a=[1,2,3,3],b=[1,2,1,1],c=[1,2,1,2],counts=[2,2,1,1])

4Γ—4 DataFrame
β”‚ Row β”‚ a     β”‚ b     β”‚ c     β”‚ counts β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚ Int64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 1     β”‚ 1     β”‚ 2      β”‚
β”‚ 2   β”‚ 2     β”‚ 2     β”‚ 2     β”‚ 2      β”‚
β”‚ 3   β”‚ 3     β”‚ 1     β”‚ 1     β”‚ 1      β”‚
β”‚ 4   β”‚ 3     β”‚ 1     β”‚ 2     β”‚ 1      β”‚

Thanks in advance.

Problem 1 could by done by numbering the rows of a grouped dataframe and then joining:

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1])

newdf = DataFrame([g[1, :] for g in groupby(df, [:a, :b])])
newdf[!, :new_indicator] = 1:nrow(newdf)

join(df, newdf, on = [:a, :b], kind = :left)

This probably isn’t very performant though - I’ve done something similar before based on keeping a dictionary which has the combinations of a and b as keys and increments a counter as the value for each key looking through the dataframe row by row. Something like this:

function counter(v::AbstractVector{T}) where T
    d = Dict{T, Int}()
    return [d[el] = get(d, el, 0) + 1 for el in v]
end

where you could generate a column based on the combination of a and b that you then count.

Problem 2 is a simple groupby with length:

df = DataFrame(a=[1,1,2,2,3,3],b=[1,1,2,2,1,1],c=[1,1,2,2,1,2])
by(df, [:a, :b, :c], count = :a => length)
3 Likes

Thank you. I came up with a similar solution for the first problem, but your second solution is much more elegant… and I’m surprised I didn’t make the connection to the examples in the docs. I figured at least one of these was a simple fix. Thanks again!

1 Like

This seems to be quite a bit more performant than the above code for problem #1:

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1])
n = unique([(row.a, row.b) for row in eachrow(df)])
df.new_indicator = [findfirst(x -> (row.a, row.b) == x, n) for row in eachrow(df)]

julia> df
5Γ—3 DataFrame
β”‚ Row β”‚ a     β”‚ b     β”‚ new_indicator β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Int64         β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 1     β”‚ 1             β”‚
β”‚ 2   β”‚ 1     β”‚ 2     β”‚ 2             β”‚
β”‚ 3   β”‚ 2     β”‚ 1     β”‚ 3             β”‚
β”‚ 4   β”‚ 2     β”‚ 2     β”‚ 4             β”‚
β”‚ 5   β”‚ 1     β”‚ 1     β”‚ 1             β”‚

And the benchmarking:

julia> @btime begin
           n = unique([(row.a, row.b) for row in eachrow(df)])
           df.new_indicator = [findfirst(x -> (row.a, row.b) == x, n) for row in eachrow(df)]
       end
  4.900 ΞΌs (57 allocations: 2.42 KiB)

julia> @btime begin
           newdf = DataFrame([g[1, :] for g in groupby(df, [:a, :b])])
           newdf[!, :new_indicator] = 1:nrow(newdf)
           join(df, newdf, on = [:a, :b], kind = :left)
       end
  32.800 ΞΌs (306 allocations: 20.92 KiB)
1 Like