Simple transformation of a dataframe with grouped data

I am trying to transpose this R script to Julia without success:

library(dplyr)

culture <- tibble(
  ind = c(1, 1, 1, 2, 2, 2),
  year = c(2015, 2015, 2015, 2016, 2016, 2016),
  value = c(10, 20, 30, 26, 34, 67)
)

culture_w <- culture %>%
  group_by(year) %>%
  mutate(total = sum(value),
         w = value / total)

I tried the following solution (and many others) but it gives me an error saying that :total is not in the dataframe.

using DataFrames, DataFramesMeta

culture = DataFrame(
    ind = [1, 1, 1, 2, 2, 2],
    year = [2015, 2015, 2015, 2016, 2016, 2016],
    value = [10, 20, 30, 26, 34, 67]
)
culture_w = @by culture [:year, :ind] begin
    :total = sum(:value)
    :w = :value ./ :total
end

Yes, unlike dplyr, DataFramesMeta.jl does not allow using previously-created variables in the same call.

You would think the solution is to use the @astable macro-flag, which lets you create multiple intermediate variables and return a NamedTuple. But this is not the solution

julia> @by culture [:year, :ind] begin
           @astable begin
               :total = sum(:value)
               :w = :value ./ :total
           end
       end
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed

This is because :total is a scalar while :w is a Vector. When returning a NamedTuple you can’t mix scalars and vectors. See some discussion here.

I think what you want is a @chain with multiple @transform calls, the first one takes in a GroupedDataFrame but preserves the number of rows in the data frame (unlike dplyr, the output drops the grouping by default, which I think is very convenient).

julia> @chain culture begin
           @groupby [:year, :ind]
           @transform :total = sum(:value) # This gets "spread"
           @rtransform :w = :value / :total
       end
6Γ—5 DataFrame
 Row β”‚ ind    year   value  total  w        
     β”‚ Int64  Int64  Int64  Int64  Float64  
─────┼──────────────────────────────────────
   1 β”‚     1   2015     10     60  0.166667
   2 β”‚     1   2015     20     60  0.333333
   3 β”‚     1   2015     30     60  0.5
   4 β”‚     2   2016     26    127  0.204724
   5 β”‚     2   2016     34    127  0.267717
   6 β”‚     2   2016     67    127  0.527559
2 Likes

It’s interesting to know why this does not work in Tidier.jl:

using DataFrames, Chain, Tidier

culture = DataFrame(
    ind = [1, 1, 1, 2, 2, 2],
    year = [2015, 2015, 2015, 2016, 2016, 2016],
    value = [10, 20, 30, 26, 34, 67]
)

culture_w = @chain culture begin
           @group_by(year)
           @mutate(total = sum(value), w = value / total)
       end
2 Likes

You mean it does not work in Tidier.jl either? That’s expected. Both DataFramesMeta.jl and Tidier.jl just call DataFrames.transform(df, ...) and the restriction from not re-using columns created previously is in DataFrames.jl

3 Likes

It’s also relevant to know why this doesn’t work. DataFrames.jl parallelizes code when transform() supplies multiple transformations on grouped data frames.

See for details: Functions Β· DataFrames.jl

So while it’s nice syntactic sugar in R to be able to reuse variables in the same mutate() call in R, Julia takes the stance that it’s better to have code run faster. It’s easy enough to chain together separate @mutate() macros if you really want to add multiple columns sequentially, like this:

culture_w = @chain culture begin
    @group_by(year)
    @mutate(total = sum(value))
    @mutate(w = value / total)
end
4 Likes