Dataframes: Split combined result to different columns

Hello everyone,

I’m pretty new to Julia and struggle with a data wrangling problem. Say, I’ve got the following dataframe:

using DataFrames
df = DataFrame(ID = [1,1,1,2,2,2,3,3],
               OP = [:X,:X,:Y,:X,:Y,:Y,:X,:Y],
               COUNT = [5, 10, 2, 7, 2, 0, 1, 2])

What I want to get is for each ID the count according to OP X or Y. What I can do is to use groupby and combine to aggregate the data:

gdf = groupby(df, [:ID, :OP])
result_df = combine(gdf, :COUNT => sum)

This creates a dataframe, in which for each ID I get two rows - row 1 contains the count for X and row 2 the count for Y. Is there an easy way to get the counts for X and Y in two different columns? Instead of result_df created in the last code chunk I’d like to get the following dataframe.

optimal_DF = DataFrame(ID = [1,2,3],
                       X_COUNT = [15,7,1],
                       Y_COUNT = [2,2,2])

If there is an easy and idiomatic way to do this, I’d like to hear about it.

Regards,

Thomas

I think you want unstack. See the following

julia> @chain df begin 
           groupby([:ID, :OP])
           @combine :count_sum = sum(:COUNT)
           unstack(:ID, :OP, :count_sum)
       end
3×3 DataFrame
 Row │ ID     X       Y      
     │ Int64  Int64?  Int64? 
─────┼───────────────────────
   1 │     1      15       2
   2 │     2       7       2
   3 │     3       1       2

You could do

julia> combine(groupby(df, :ID),
           [:OP, :COUNT] => ((op, c) -> sum((op .== :X).*c)) => :X_COUNT,
           [:OP, :COUNT] => ((op, c) -> sum((op .== :Y).*c)) => :Y_COUNT)
3×3 DataFrame
 Row │ ID     X_COUNT  Y_COUNT 
     │ Int64  Int64    Int64   
─────┼─────────────────────────
   1 │     1       15        2
   2 │     2        7        2
   3 │     3        1        2

The alternative is to unstack after combine.

EDIT: Just saw Peter’s answer, that’s the unstack version.

Hi,

thanks to both of your for the quick answers - it works like a charm!

Regards,

Thomas

1 Like