Combining elements from multiple rows by conditionals into columns with DataFramesMeta

Hello,
I want to create a new dataframe with a column computed from one or more columns of the initial dataframe.

using DataFramesMeta
df1 = DataFrame(category = [1,0,1,0],  group=[1,1,0,0], measure = [401, 402, 200, 300]);
df2 = @somemagic :group == 1  :result = :measure_false - :measure_true 
## I want this output:
# df2 =  DataFrame(category = [1,0], difference = [201, 102]);

In the example I want to compute a function of :measure conditional to the value in :group, this has to happen separately for all the elements that belong to category.

The warranty is only that there are 2 elements for category, or as many elements that can be passed to the func.

I can achieve the same by creating separated dataframes:

func(x,y) = x-y

sort!(df, :category)
aa = @rsubset df :group == 1
bb = @rsubset df :group !== 1
zz = func.(aa.measure, bb.measure)

But I would like something in the macro style

Thanks for your help

func(category0, category1) = category1 - category0

select(
  innerjoin(groupby(df1, :group, sort=true)..., on = :category, makeunique=true), 
  :category, r"measure" => func => [:difference]
)

gives:

2Γ—2 DataFrame
 Row β”‚ category  difference 
     β”‚ Int64     Int64      
─────┼──────────────────────
   1 β”‚        1         201
   2 β”‚        0         102

This also generalizes to more than two categories with appropriate func.

This option doesn’t use macros (others more familiar with DataFrameMeta might help with this).

Another option with unstack:

select(
  unstack(df1, :group, :measure, renamecols=x->Symbol(:_, x)), 
  :category, [:_0, :_1] => func => [:difference]
)

(with the same result)

2 Likes

I didn’t understand the role of the :group variable and if it comes into play.
As far as I understand, I would do this.


gdf=groupby(df1,:category)
combine(gdf,:measure=>x->-(x...))

#or


combine(gdf,:measure=>Base.splat(-)=>:res)

Quickly sifting through the help of the DataframesMeta pkg I find this macro that seems provide the same result.


func(x,y) = x-y
@by df1 :category begin
    :res=func(:measure...)
end

Perhaps, I found a role for :group to play :grin:


df1[df1.group.==0,:group].=-1
using LinearAlgebra
@by df1 :category begin
    :res=dot(:measure, :group)
end

Hello!
Thanks for your advise. In the specific case here, it works!

But in general, I would like to do operations (as in, choose the arguments in func) based on the value in group. In this case, it is only the order (group==0 is subtracted to group==1) , but I could have several values for group field.

The solution proposed by @Dan accounts for this. But it does not use the DataFramesMacro.

If you provide a minimal but more complete and general example, someone can try to provide you with the solution in the form you ask.
What specifically does it mean that group can have many values?
How would these values then be combined (the β€˜-’ function is neither associative nor commutative)?

Hello,

Thanks! This is the type of operation I was looking for.

It took me some time to understand what the two nested functions do, I will explain it here so to facilitate the future readers.

If someone knows how to do the same operations with DataFramesMeta it d be great.

Alessio

1. groupby + innerjoin

func(category0, category1) = category1 - category0

select(
  innerjoin(groupby(df1, :group, sort=true)..., on = :category, makeunique=true), 
  :category, r"measure" => func => [:difference]
)

The groupby function creates a list of dataframes, each for a value of :group.
innerjoin stack the newly created dataframes using :category as unique id. The makeunique argument is pivotal, and it works like this:

  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

The resulting dataframe has columns:
[:category, :group, :group_1, :measure, :measure_1]

The select function uses a regular expression to parse the columns that start with β€œmeasure”.

The order of the stacked data frames is used to pass the columns in order. In this case, it is easy because it is just two :group, but how to have more control on it?

2. unstack

select(
  unstack(df1, :group, :measure, renamecols=x->Symbol(:_, x)), 
  :category, [:_0, :_1] => func => [:difference]
)

The problem of arranging columns based on their name is easier solved with unstack.

unstack creates new columns based on the values in :group in the initial dataframe. I suggest reading the documentation because the positional arguments are not trivial.

The resulting dataframe has columns:
[:category, :_0, :_1]

Where _0 and _1 are derived by the values in :group. This is convenient because the following select function can leverage the named columns:

select(unstacked_df, :category, [:_0, :_1] => func => [:difference]

1 Like

Mate, the minimal example was clear enough. The solution proposed by @Dan was accurate in this regard.

In this case :group can have 0 or 1. If :group was 1,2,3,...10. Then I would like control on the columns used in the select function.

Surely I can order them and be consistent, but it is not a robust solution.
The solution with unstack (see my answer) works well and allows to run an arbitrary complex function based on :group, as in

select(:category, [:_1, :_3, :_1, :_2] => func => [:result])

Maybe now I understand better.
Not quite what you’re looking for, but something that goes a little way in that direction.

I don’t know if it is possible to reproduce with the dataframesmeta macros (I don’t know the package) the expansion into many columns of a namedtuple made by the AsTable function.

julia> func(x,y)=x-y
func (generic function with 1 method)

julia> sdf=@chain df begin
           @by :category  begin
               :res=[(;zip(Symbol.("g_",:group),:measure)...)]
           end
           select(:category,:res=>AsTable)
           @transform :diff= func(:g_1, :g_0)
       end
2Γ—4 DataFrame
 Row β”‚ category  g_1    g_0    diff  
     β”‚ Int64     Int64  Int64  Int64
─────┼───────────────────────────────
   1 β”‚        0    402    300    102
   2 β”‚        1    401    200    201
1 Like

the use of the indexin() function allows you to associate the measures in an orderly manner with the names of the selected groups, on which to apply a generic function.

julia> df = DataFrame(category = [1,0,1,0,1,0],  group=[1,1,0,0,-9,-9], measure = [401, 402, 200, 300,-100,-200])
6Γ—3 DataFrame
 Row β”‚ category  group  measure 
     β”‚ Int64     Int64  Int64
─────┼──────────────────────────
   1 β”‚        1      1      401
   2 β”‚        0      1      402
   3 β”‚        1      0      200
   4 β”‚        0      0      300
   5 β”‚        1     -9     -100
   6 β”‚        0     -9     -200

julia> @chain df begin
       @by :category begin
           :g= [:group]
           :m= [:measure]
           @kwarg ungroup = false
       end
       @transform :diff_1_9= func(:m[1][[indexin([1,-9],:g[1])]...]...)
       end
2Γ—4 DataFrame
 Row β”‚ category  g           m                 diff_1_9 
     β”‚ Int64     SubArray…   SubArray…         Int64
─────┼──────────────────────────────────────────────────
   1 β”‚        0  [1, 0, -9]  [402, 300, -200]       602
   2 β”‚        1  [1, 0, -9]  [401, 200, -100]       501

julia> @chain df begin
       @by :category begin
           :g= [:group]
           :m= [:measure]
           @kwarg ungroup = false
       end
       @select begin
           :category
            :diff_1_9= func(:m[1][[indexin([1,-9],:g[1])]...]...)
       end
       end
2Γ—2 DataFrame
 Row β”‚ category  diff_1_9 
     β”‚ Int64     Int64
─────┼────────────────────
   1 β”‚        0       602
   2 β”‚        1       501
1 Like

Maybe this works?

julia> @chain df begin
           groupby(:category)
           @combine begin
               :x = first(:measure[:group .== 1]) - first(:measure[:group .== 0])
           end
       end
2Γ—2 DataFrame
 Row β”‚ category  x     
     β”‚ Int64     Int64 
─────┼─────────────────
   1 β”‚        0    102
   2 β”‚        1    201
1 Like

This is precisely what I was looking for.

Thanks.