Trouble translating some dplyr code to DataFramesMeta

wasser_m · December 13, 2023, 9:39am

I’m quite new to DataFrames and tried to translate some of my R code to Julia.
Here is my R code:

df |>
  left_join(df2, by="customer") |>
  left_join(df3, by="order") |>
  left_join(df4, by="item") |>
  group_by(order) |>
  mutate(total_income = sum(wholesale_income),
         profit = total - total_income,
         profit_count = sum(profit > 0)) |>
  arrange(profit_count) |>
  first()

This code merges some data frames, then creates some new variables and outputs the first row.

This is what I tried in Julia:

@chain df begin
    leftjoin(df2, on=:customer, matchmissing=:equal)
    leftjoin(df3, on=:order, matchmissing=:equal)
    leftjoin(df4, on=:item, matchmissing=:equal)
    groupby(:order)
    @transform(:total_income = sum(skipmissing(:wholesale_income)))
    @transform(:profit = :total .- :total_income)
    @transform(:profit_count = sum(skipmissing(:profit) .> 0))
    @orderby(:profit_count)
    first
 end

It looks pretty similar but the results are very different, especially the profit counts.
This may be because of the missing values, it is still a mystery to me how these are treated in Julia, in R this seems to happen automatically.

alfaromartino · December 13, 2023, 11:46am

If you’re more comfortable working with R, you could find this useful GitHub - TidierOrg/TidierData.jl: 100% Julia implementation of the dplyr and tidyr R packages.

pdeffebach · December 13, 2023, 5:28pm

In the dplyr version you have all your income and profit stuff in a single mutate command, applied to a grouped data frame. In your DataFramesMeta.jl version, they are separate commands.

This is important, as @transform(grouped_df, ...) will return a non-grouped data frame, similar to the keyword argument .groups = "drop" in dplyr.

Use the keyword argument ungroup in @transform

julia> df = DataFrame(order = [1, 1, 1, 2, 2, 2], income = [10, 20, 30, 50, 60, 70]);

julia> @chain df begin
           groupby(:order)
           @transform(:total_income = sum(:income); ungroup = false)
           @transform(:income_frac = :income ./ sum(:income))
       end
6×4 DataFrame
 Row │ order  income  total_income  income_frac 
     │ Int64  Int64   Int64         Float64     
─────┼──────────────────────────────────────────
   1 │     1      10            60     0.166667
   2 │     1      20            60     0.333333
   3 │     1      30            60     0.5
   4 │     2      50           180     0.277778
   5 │     2      60           180     0.333333
   6 │     2      70           180     0.388889

As I’m sure you noticed when writing the code, in DataFramesMeta.jl, you can’t use columns you just created in the same block. As an alternative, you can use the @astable macro-flag in DataFramesMeta.jl, which allows you to create many new columns in the same scope.

However, there is a downside to this. In the following mutate call from dplyr

  mutate(total_income = sum(wholesale_income),
         profit = total - total_income,
         profit_count = sum(profit > 0))

the total_income = sum(wholesale_income) is a scalar that is “spread” across all values in the data frame.

In DataFramesMeta.jl, this kind of “spread-ing” isn’t allowed when multiple columns are returned from a @astable macro-flag. (Really, this limitations is in DataFrames.jl). So inside an @astable block you can’t return a scalar and a vector. You would have to do something like this:

@transform df @astable begin
    s = sum(:income)
    :total_income = fill(s, length(:income))
    :income_frac = :income ./ :total_income
end

So your best bet is to use the ungroup = false keyword argument. Maybe there is something DataFramesMeta.jl can do to make things simpler.

Topic		Replies	Views
Simple transformation of a dataframe with grouped data New to Julia	4	197	June 17, 2024
Julia: DataFramesMeta Transformation Data question , package	4	1789	April 30, 2017
Rewriting dplyr code which uses a function of columns in Julia -style using DataFrames.jl General Usage dataframes	5	596	March 25, 2021
DataFrames.jl: keep group by result in every row General Usage dataframes	4	1680	February 9, 2020
ANN: DataFramesMeta 0.9.0 release Data	1	633	August 13, 2021

Trouble translating some dplyr code to DataFramesMeta

Related topics