Recoding variables and counting and removing duplicate rows in dataframes

Christopher_Fisher · March 12, 2020, 4:24pm

Hi all-

I seeking solutions for two problems I have encountered while manipulating dataframes. I’ve been struggling to think of a good solution. Although I can probably duct tape some solutions, I was wondering whether there are utilities for these operations or elegant solutions.

Problem 1

I need to recode multiple columns in a dataframe into a variable such that each unique combination of variables in the old columns is assigned a new unique value in the variable. In the following example, unique combinations of values in columns a and b are recoded into the column new_indicator:

using DataFrames

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1],new_indicator=[1,2,3,4,1])

Output:

5×3 DataFrame
│ Row │ a     │ b     │ new_indicator │
│     │ Int64 │ Int64 │ Int64         │
├─────┼───────┼───────┼───────────────┤
│ 1   │ 1     │ 1     │ 1             │
│ 2   │ 1     │ 2     │ 2             │
│ 3   │ 2     │ 1     │ 3             │
│ 4   │ 2     │ 2     │ 4             │
│ 5   │ 1     │ 1     │ 1             │

Problem 2

I have a second problem in which I want to remove duplicate rows (defined by a set of columns) and create a new column for the number of duplicates. Here is an example:

Current data

using DataFrames

df = DataFrame(a=[1,1,2,2,3,3],b=[1,1,2,2,1,1],c=[1,1,2,2,1,2])
6×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 1     │ 1     │ 1     │
│ 3   │ 2     │ 2     │ 2     │
│ 4   │ 2     │ 2     │ 2     │
│ 5   │ 3     │ 1     │ 1     │
│ 6   │ 3     │ 1     │ 2     │

desired data:

df = DataFrame(a=[1,2,3,3],b=[1,2,1,1],c=[1,2,1,2],counts=[2,2,1,1])

4×4 DataFrame
│ Row │ a     │ b     │ c     │ counts │
│     │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼───────┼────────┤
│ 1   │ 1     │ 1     │ 1     │ 2      │
│ 2   │ 2     │ 2     │ 2     │ 2      │
│ 3   │ 3     │ 1     │ 1     │ 1      │
│ 4   │ 3     │ 1     │ 2     │ 1      │

Thanks in advance.

nilshg · March 12, 2020, 4:46pm

Problem 1 could by done by numbering the rows of a grouped dataframe and then joining:

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1])

newdf = DataFrame([g[1, :] for g in groupby(df, [:a, :b])])
newdf[!, :new_indicator] = 1:nrow(newdf)

join(df, newdf, on = [:a, :b], kind = :left)

This probably isn’t very performant though - I’ve done something similar before based on keeping a dictionary which has the combinations of a and b as keys and increments a counter as the value for each key looking through the dataframe row by row. Something like this:

function counter(v::AbstractVector{T}) where T
    d = Dict{T, Int}()
    return [d[el] = get(d, el, 0) + 1 for el in v]
end

where you could generate a column based on the combination of a and b that you then count.

Problem 2 is a simple groupby with length:

df = DataFrame(a=[1,1,2,2,3,3],b=[1,1,2,2,1,1],c=[1,1,2,2,1,2])
by(df, [:a, :b, :c], count = :a => length)

Christopher_Fisher · March 12, 2020, 5:01pm

Thank you. I came up with a similar solution for the first problem, but your second solution is much more elegant… and I’m surprised I didn’t make the connection to the examples in the docs. I figured at least one of these was a simple fix. Thanks again!

mthelm85 · March 12, 2020, 7:29pm

This seems to be quite a bit more performant than the above code for problem #1:

df = DataFrame(a=[1,1,2,2,1],b=[1,2,1,2,1])
n = unique([(row.a, row.b) for row in eachrow(df)])
df.new_indicator = [findfirst(x -> (row.a, row.b) == x, n) for row in eachrow(df)]

julia> df
5×3 DataFrame
│ Row │ a     │ b     │ new_indicator │
│     │ Int64 │ Int64 │ Int64         │
├─────┼───────┼───────┼───────────────┤
│ 1   │ 1     │ 1     │ 1             │
│ 2   │ 1     │ 2     │ 2             │
│ 3   │ 2     │ 1     │ 3             │
│ 4   │ 2     │ 2     │ 4             │
│ 5   │ 1     │ 1     │ 1             │

And the benchmarking:

julia> @btime begin
           n = unique([(row.a, row.b) for row in eachrow(df)])
           df.new_indicator = [findfirst(x -> (row.a, row.b) == x, n) for row in eachrow(df)]
       end
  4.900 μs (57 allocations: 2.42 KiB)

julia> @btime begin
           newdf = DataFrame([g[1, :] for g in groupby(df, [:a, :b])])
           newdf[!, :new_indicator] = 1:nrow(newdf)
           join(df, newdf, on = [:a, :b], kind = :left)
       end
  32.800 μs (306 allocations: 20.92 KiB)

Topic		Replies	Views
Tag each unique combination of column values in DataFrames Data dataframes	5	1116	February 23, 2022
Find unique row in DataFrame General Usage	5	1649	May 17, 2018
Changing many rows to single row julia1.5.3 Data question	8	594	December 13, 2020
Counts of unique values per group in a DataFrame Data question , dataframes	3	10203	May 25, 2020
Delete duplicate rows in a DataFrame New to Julia dataframes	10	6095	June 22, 2023

Recoding variables and counting and removing duplicate rows in dataframes

Related topics