How to count the number of categories present in a column of a DataFrame


In a dataframe with columns :A, :B, :C where :B contains categorical data, I would like to count the number of categories present in the different subdataframes obtained through groupby(df,:A), and finally getting the list of codes in column :A where more than one category is present.

Thank you very much for any advice,

1 Like

Welcome! Can you post a sample dataframe? It’ll be easier to help you then.

Here is an example:
Starting with the following data
df = DataFrame(A = [β€œa”,β€œa”,β€œa”,β€œb”,β€œb”,β€œb”,β€œc”,β€œc”],
B = [β€œX”,β€œY”,β€œZ”,β€œY”,β€œY”,β€œZ”,β€œX”,β€œX”],
C = [2,3,5,2,10,7,5,1])
I would like to count for each value of the A column the number of different values in column B, the result would be something like:
a 3
b 2
c 1
could be tuples, arrays, a dictionary…

Hope this is clearer!
Thanks for your attention.

Try something like

julia> combine(groupby(df, :A), :B=>length∘unique)
3Γ—2 DataFrame
β”‚ Row β”‚ A      β”‚ B_function β”‚
β”‚     β”‚ String β”‚ Int64      β”‚
β”‚ 1   β”‚ a      β”‚ 3          β”‚
β”‚ 2   β”‚ b      β”‚ 2          β”‚
β”‚ 3   β”‚ c      β”‚ 1          β”‚

edit - sorry was thinking of a different API

could also try a countmap(API Β· OnlineStats Docs), or if the data is small call β€œunique” on the requisite column and get the length

Thanks a lot, simple, elegant, efficient!

this is beautiful

Hi, what is the dot between length and unique? how to type it?

Function composition, type it with \circ<tab>. You can always copy paste a unicode symbol into the REPL help mode to find out how to type it.