Trouble translating some dplyr code to DataFramesMeta

In the dplyr version you have all your income and profit stuff in a single mutate command, applied to a grouped data frame. In your DataFramesMeta.jl version, they are separate commands.

This is important, as @transform(grouped_df, ...) will return a non-grouped data frame, similar to the keyword argument .groups = "drop" in dplyr.

Use the keyword argument ungroup in @transform

julia> df = DataFrame(order = [1, 1, 1, 2, 2, 2], income = [10, 20, 30, 50, 60, 70]);

julia> @chain df begin
           groupby(:order)
           @transform(:total_income = sum(:income); ungroup = false)
           @transform(:income_frac = :income ./ sum(:income))
       end
6×4 DataFrame
 Row │ order  income  total_income  income_frac 
     │ Int64  Int64   Int64         Float64     
─────┼──────────────────────────────────────────
   1 │     1      10            60     0.166667
   2 │     1      20            60     0.333333
   3 │     1      30            60     0.5
   4 │     2      50           180     0.277778
   5 │     2      60           180     0.333333
   6 │     2      70           180     0.388889

As I’m sure you noticed when writing the code, in DataFramesMeta.jl, you can’t use columns you just created in the same block. As an alternative, you can use the @astable macro-flag in DataFramesMeta.jl, which allows you to create many new columns in the same scope.

However, there is a downside to this. In the following mutate call from dplyr

  mutate(total_income = sum(wholesale_income),
         profit = total - total_income,
         profit_count = sum(profit > 0))

the total_income = sum(wholesale_income) is a scalar that is “spread” across all values in the data frame.

In DataFramesMeta.jl, this kind of “spread-ing” isn’t allowed when multiple columns are returned from a @astable macro-flag. (Really, this limitations is in DataFrames.jl). So inside an @astable block you can’t return a scalar and a vector. You would have to do something like this:

@transform df @astable begin
    s = sum(:income)
    :total_income = fill(s, length(:income))
    :income_frac = :income ./ :total_income
end

So your best bet is to use the ungroup = false keyword argument. Maybe there is something DataFramesMeta.jl can do to make things simpler.

4 Likes