Split-Apply-Combine on many columns at once? Looking for equivalent to Stata's collapse

Hi all,

I’ve got a dataset with multiple rows per individual that I would like to collapse to one row per individual, but I want to retain a variety of information for each individual (e.g. their sex, race, etc.) that are static across each observation and other things where I need to get the maximum value for each observation and others where I need the mean, etc. The upshot is that I need to collapse a bunch of different columns at once, performing different actions on each, some of which are numerical in nature, others categorical. I would normally do this in Stata with something like:

collapse (mean) test_score (max) wins_scholarship (first) gender race, by(id) 

Still trying to get the hang of split-apply-combine, so this may be obvious, but I would appreciate any help.

1 Like

Its fairly similar. Here is the syntax for plain DataFrames.

combine(groupby(df,:id), :test_score => mean, :wins_scholarship => maximum, [:gender,:race] .=> first)

The names will be autogenerated like :test_score_mean. You can add => :newcolname to any of those to manually name it.

8 Likes

Awesome - thank you!

So I’m getting this error when I try your code on my data:

MethodError: objects of type SubArray{Union{Missing, Float64},1,Array{Union{Missing, Float64},1},Tuple{Array{Int64,1}},false} are not callable
Use square brackets for indexing an Array.

I assume this is about the type of the underlying data? Another possibility is that it’s not possible to apply the same method to a group of columns in a square-brackets for traditional statistics, as you did with the first method in your example. Or maybe you have to be careful about grouping columns with different data types?

I’m guessing you might have missed the .=> (note the period).

This broadcasts first to each column:

[:gender,:race] .=> first

This tries to call first(df.gender,df.race):

[:gender,:race] => first
4 Likes

Oh yes - I’m dumb. That was it. Thanks so much!

Hi everyone, I am also trying to aggregate and I am able to do it with sum and maximum but not with mean. I am pasting the code that works but If I change a parameter to for example :Faltas => mean∘skipmissing it says that mean isn’t found.

Thanks

Found my answer: Using Statistics wasn’t loaded.

2 Likes