Trouble translating some dplyr code to DataFramesMeta

I’m quite new to DataFrames and tried to translate some of my R code to Julia.
Here is my R code:

df |>
  left_join(df2, by="customer") |>
  left_join(df3, by="order") |>
  left_join(df4, by="item") |>
  group_by(order) |>
  mutate(total_income = sum(wholesale_income),
         profit = total - total_income,
         profit_count = sum(profit > 0)) |>
  arrange(profit_count) |>
  first()

This code merges some data frames, then creates some new variables and outputs the first row.

This is what I tried in Julia:

@chain df begin
    leftjoin(df2, on=:customer, matchmissing=:equal)
    leftjoin(df3, on=:order, matchmissing=:equal)
    leftjoin(df4, on=:item, matchmissing=:equal)
    groupby(:order)
    @transform(:total_income = sum(skipmissing(:wholesale_income)))
    @transform(:profit = :total .- :total_income)
    @transform(:profit_count = sum(skipmissing(:profit) .> 0))
    @orderby(:profit_count)
    first
 end

It looks pretty similar but the results are very different, especially the profit counts.
This may be because of the missing values, it is still a mystery to me how these are treated in Julia, in R this seems to happen automatically.

1 Like

If you’re more comfortable working with R, you could find this useful GitHub - TidierOrg/TidierData.jl: 100% Julia implementation of the dplyr and tidyr R packages.

In the dplyr version you have all your income and profit stuff in a single mutate command, applied to a grouped data frame. In your DataFramesMeta.jl version, they are separate commands.

This is important, as @transform(grouped_df, ...) will return a non-grouped data frame, similar to the keyword argument .groups = "drop" in dplyr.

Use the keyword argument ungroup in @transform

julia> df = DataFrame(order = [1, 1, 1, 2, 2, 2], income = [10, 20, 30, 50, 60, 70]);

julia> @chain df begin
           groupby(:order)
           @transform(:total_income = sum(:income); ungroup = false)
           @transform(:income_frac = :income ./ sum(:income))
       end
6Γ—4 DataFrame
 Row β”‚ order  income  total_income  income_frac 
     β”‚ Int64  Int64   Int64         Float64     
─────┼──────────────────────────────────────────
   1 β”‚     1      10            60     0.166667
   2 β”‚     1      20            60     0.333333
   3 β”‚     1      30            60     0.5
   4 β”‚     2      50           180     0.277778
   5 β”‚     2      60           180     0.333333
   6 β”‚     2      70           180     0.388889

As I’m sure you noticed when writing the code, in DataFramesMeta.jl, you can’t use columns you just created in the same block. As an alternative, you can use the @astable macro-flag in DataFramesMeta.jl, which allows you to create many new columns in the same scope.

However, there is a downside to this. In the following mutate call from dplyr

  mutate(total_income = sum(wholesale_income),
         profit = total - total_income,
         profit_count = sum(profit > 0))

the total_income = sum(wholesale_income) is a scalar that is β€œspread” across all values in the data frame.

In DataFramesMeta.jl, this kind of β€œspread-ing” isn’t allowed when multiple columns are returned from a @astable macro-flag. (Really, this limitations is in DataFrames.jl). So inside an @astable block you can’t return a scalar and a vector. You would have to do something like this:

@transform df @astable begin
    s = sum(:income)
    :total_income = fill(s, length(:income))
    :income_frac = :income ./ :total_income
end

So your best bet is to use the ungroup = false keyword argument. Maybe there is something DataFramesMeta.jl can do to make things simpler.

4 Likes