Simple transformation of a dataframe with grouped data

I am trying to transpose this R script to Julia without success:

library(dplyr)

culture <- tibble(
  ind = c(1, 1, 1, 2, 2, 2),
  year = c(2015, 2015, 2015, 2016, 2016, 2016),
  value = c(10, 20, 30, 26, 34, 67)
)

culture_w <- culture %>%
  group_by(year) %>%
  mutate(total = sum(value),
         w = value / total)

I tried the following solution (and many others) but it gives me an error saying that :total is not in the dataframe.

using DataFrames, DataFramesMeta

culture = DataFrame(
    ind = [1, 1, 1, 2, 2, 2],
    year = [2015, 2015, 2015, 2016, 2016, 2016],
    value = [10, 20, 30, 26, 34, 67]
)
culture_w = @by culture [:year, :ind] begin
    :total = sum(:value)
    :w = :value ./ :total
end

Yes, unlike dplyr, DataFramesMeta.jl does not allow using previously-created variables in the same call.

You would think the solution is to use the @astable macro-flag, which lets you create multiple intermediate variables and return a NamedTuple. But this is not the solution

julia> @by culture [:year, :ind] begin
           @astable begin
               :total = sum(:value)
               :w = :value ./ :total
           end
       end
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed

This is because :total is a scalar while :w is a Vector. When returning a NamedTuple you can’t mix scalars and vectors. See some discussion here.

I think what you want is a @chain with multiple @transform calls, the first one takes in a GroupedDataFrame but preserves the number of rows in the data frame (unlike dplyr, the output drops the grouping by default, which I think is very convenient).

julia> @chain culture begin
           @groupby [:year, :ind]
           @transform :total = sum(:value) # This gets "spread"
           @rtransform :w = :value / :total
       end
6Γ—5 DataFrame
 Row β”‚ ind    year   value  total  w        
     β”‚ Int64  Int64  Int64  Int64  Float64  
─────┼──────────────────────────────────────
   1 β”‚     1   2015     10     60  0.166667
   2 β”‚     1   2015     20     60  0.333333
   3 β”‚     1   2015     30     60  0.5
   4 β”‚     2   2016     26    127  0.204724
   5 β”‚     2   2016     34    127  0.267717
   6 β”‚     2   2016     67    127  0.527559

It’s interesting to know why this does not work in Tidier.jl:

using DataFrames, Chain, Tidier

culture = DataFrame(
    ind = [1, 1, 1, 2, 2, 2],
    year = [2015, 2015, 2015, 2016, 2016, 2016],
    value = [10, 20, 30, 26, 34, 67]
)

culture_w = @chain culture begin
           @group_by(year)
           @mutate(total = sum(value), w = value / total)
       end

You mean it does not work in Tidier.jl either? That’s expected. Both DataFramesMeta.jl and Tidier.jl just call DataFrames.transform(df, ...) and the restriction from not re-using columns created previously is in DataFrames.jl

It’s also relevant to know why this doesn’t work. DataFrames.jl parallelizes code when transform() supplies multiple transformations on grouped data frames.

See for details: Functions Β· DataFrames.jl

So while it’s nice syntactic sugar in R to be able to reuse variables in the same mutate() call in R, Julia takes the stance that it’s better to have code run faster. It’s easy enough to chain together separate @mutate() macros if you really want to add multiple columns sequentially, like this:

culture_w = @chain culture begin
    @group_by(year)
    @mutate(total = sum(value))
    @mutate(w = value / total)
end