Counts of unique values per group in a DataFrame

Hi,

I’m trying to determine unique counts of values in a column per group in a DataFrame.

As an example, given the following:

using DataFrames

julia> df = DataFrame(lab = [repeat(["Lab1"], 4)...; repeat(["Lab2"], 5)...], value = ['a','a','b','a','a','b','c','c','c'])
9×2 DataFrame
│ Row │ lab    │ value │
│     │ String │ Char  │
├─────┼────────┼───────┤
│ 1   │ Lab1   │ 'a'   │
│ 2   │ Lab1   │ 'a'   │
│ 3   │ Lab1   │ 'b'   │
│ 4   │ Lab1   │ 'a'   │
│ 5   │ Lab2   │ 'a'   │
│ 6   │ Lab2   │ 'b'   │
│ 7   │ Lab2   │ 'c'   │
│ 8   │ Lab2   │ 'c'   │
│ 9   │ Lab2   │ 'c'   │

I would like to get

5×3 DataFrame
│ Row │ lab    │ value │ count │
│     │ String │ Char  │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ Lab1   │ 'a'   │ 3     │
│ 2   │ Lab1   │ 'b'   │ 1     │
│ 3   │ Lab2   │ 'a'   │ 1     │
│ 4   │ Lab2   │ 'b'   │ 1     │
│ 5   │ Lab2   │ 'c'   │ 3     │

I’ve gotten as far as

julia> combine(grouped, :value => (vals -> keys(counter(vals))) => :value, :value => (vals -> values(counter(vals))) => :count)
2×3 DataFrame
│ Row │ lab    │ value           │ count     │
│     │ String │ Base.KeySet…    │ Base.Val… │
├─────┼────────┼─────────────────┼───────────┤
│ 1   │ Lab1   │ ['a', 'b']      │ [3, 1]    │
│ 2   │ Lab2   │ ['a', 'c', 'b'] │ [1, 3, 1] │

However, I would prefer

  1. not to call counter twice
  2. actually split the counts into separate rows

Can someone help?

Thanks!
Kevin

1 Like
julia> combine(groupby(df, [:lab, :value]), nrow => :count)
5×3 DataFrame
│ Row │ lab    │ value │ count │
│     │ String │ Char  │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ Lab1   │ 'a'   │ 3     │
│ 2   │ Lab1   │ 'b'   │ 1     │
│ 3   │ Lab2   │ 'a'   │ 1     │
│ 4   │ Lab2   │ 'b'   │ 1     │
│ 5   │ Lab2   │ 'c'   │ 3     │
2 Likes

See Query.jl solution with a @groupby command that lets you group by multiple columns:

julia> df |> @groupby({_.lab, _.value}) |> @map({lab=key(_)[1], val=key(_)[2], count=length(_)})
5x3 query result
lab  │ val │ count
─────┼─────┼──────
Lab1 │ 'a' │ 3
Lab1 │ 'b' │ 1
Lab2 │ 'a' │ 1
Lab2 │ 'b' │ 1
Lab2 │ 'c' │ 3

In Query, how do would you put this in a function? With dataframes it’s

function combinecount(df, groupvars, newvar)
    @pipe df |> 
        groupby(df, groupvars) |> 
        combine(nrow => newvar)
end