With DataFrames, best practice for applying function across columns, where we also need to reference, in a second argument, the same column for each function call?

This is somewhat difficult to put into words, however, I have two examples of this type of operation and I am unsure how best to tackle it in Julia.

  1. Take a weighted sum of a column within a grouping variable; repeat this operation for many columns. Here is a snippet of code I have that does this, but I am wondering if there is a cleaner way:
nms = setdiff(names(tmpdf), ["probx","type"]);
tmpdf = transform(tmpdf, nms.=> (x-> x .* tmpdf.probx) .=> nms)
tmpdf = groupby(tmpdf, :type)
tmpdf = combine(tmpdf, [:enr, :cont, :graduate, :probx] .=> sum)
  1. Take a weighted mean, within group for a column; repeat for many columns. Here is a snippet of code I am trying, but it has an error in finding probx
tmp3df = groupby(tmp3df, [:type])
nms3 = setdiff(names(tmp3df), ["probx","type"]);
tmp3df = transform(tmp3df, nms3 .=> (x -> mean(x,weights = tmp3df.probx)) .=> nms3)

It seems there is an awkwardness in referencing the same column for an operation across multiple other columns when using DataFrames. At a minimum I’d like to get #2. bug free, but hopefully also someone more experienced has a good approach for these types of operations? They show up constantly in my work.

Here is a minimal, complete example for what I think you are trying to accomplish:

using DataFrames, Statistics
df = DataFrame(x = repeat(1 : 3; outer = 4), y = repeat(22.0 :23.0; outer = 6), z = 1.0:12.0, wt = 2.0:13.0);
gdf = groupby(df, :x);
nms = [:y, :z];
nmPairs = [[nm, :wt] for nm in nms];
wt_mean(x,y) = mean(x, FrequencyWeights(y));
combine(gdf, nmPairs .=> wt_mean .=> nms)
3Γ—3 DataFrame
 Row β”‚ x      y        z
     β”‚ Int64  Float64  Float64
─────┼─────────────────────────
   1 β”‚     1  22.6154  7.23077
   2 β”‚     2  22.4     8.0
   3 β”‚     3  22.5882  8.82353

I replaced mean(x; weights) b/c that method seems undefined (at least in Statistics.jl).

3 Likes

Thanks, I was struggling with how to get that nmParis idea to work. I can adjust this now for all my purposes. This seems like the natural solution for sure!

Julia is still so awkward to me coming from R and more traditional languages like fortran, I feel like a waif

1 Like

The DataFrames minilanguage also allows the use of a vector of col=>fun=>names

wt_mean(x,y) = x'*y/sum(y);
combine(gdf, [[c,:wt]=>wt_mean=>string(c) for c in [:y,:z]])

PS
I had to redefine the weighted average because my system does not have the FrequncyWeigth function.

2 Likes

Woah, this is awesome as well, I like it. Another tool in the toolbox for me, thanks.

Another way, a bit forced to tell the truth, where the difficulty of managing the many columns is moved to the fun of the triple col=>fun=>ncol

wt_m(mx)=[c'*mx[end]/sum(mx[end]) for c in mx[1:end-1]]'
combine(gdf, [:y,:z,:wt] =>((x...)->wt_m(x))=>[:y_m,:z_m])
combine(gdf, Cols(Not(:x)) =>((x...)->wt_m(x))=>[:y_m,:z_m])

Made me curious as a useR, how would you have done it in R? Maybe someone here could show an intuitive Julia translation of that. (As said DataFrames.jl has its own kind of minilanguage so it’s another layer of difference on top of Julia itself.)

1 Like

E.g. in R you could do quite easily:

library(dplyr)
data(mtcars)
mtcars$weight = runif(nrow(mtcars))
mtcars %>%
  group_by(cyl) %>% 
  summarize(across(-weight, ~weighted.mean(., weight)))

In my opinion, the first suggestion by Hendri best translates this type of construction into Julia in a way that respects how Julia/DataFrames is structured

And there is a nice dplyr β†’ Julia translation page, but it lacks the more complicated examples like this one. At the end of the day it can’t be exhaustive and should be focused on the basic operations so I’m not arguing it should have been included or anything. Comparison with Python/R/Stata Β· DataFrames.jl

3 Likes

Im going to ping @kdpsingh for the weighted mean part, because I cant get that atm, but TidierData.jl may be something you are interested in
for example

using TidierData, TidierFiles
mtcars = read_file("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv");

@chain mtcars begin 
    @group_by(cyl)
    @summarize(across((!model),(mean, median)))
end
3Γ—23 DataFrame
 Row β”‚ cyl    mpg_mean  cyl_mean  disp_mean  hp_mean   drat_mean  wt_mean  qsec_mean  vs_mean   am_mean   g β‹―
     β”‚ Int64  Float64   Float64   Float64    Float64   Float64    Float64  Float64    Float64   Float64   F β‹―
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     6   19.7429       6.0    183.314  122.286     3.58571  3.11714    17.9771  0.571429  0.428571    β‹―
   2 β”‚     4   26.6636       4.0    105.136   82.6364    4.07091  2.28573    19.1373  0.909091  0.727273
   3 β”‚     8   15.1          8.0    353.1    209.214     3.22929  3.99921    16.7721  0.0       0.142857
                                                                                           13 columns omitted

[Comparison to DF.jl - TidierData.jl]

2 Likes

This situation (applying multiple functions to many columns) seems different to me than the one exposed by the OP

It can be done directly in DataFrames

julia> gdf = groupby(mtcars, :cyl);

julia> combine(gdf, Cols(Not([:model,:cyl])) .=>[mean maximum])
3Γ—21 DataFrame
 Row β”‚ cyl    mpg_mean  disp_mean  hp_mean   drat_mean  wt_mean  qsec_mean  vs_ β‹―
     β”‚ Int64  Float64   Float64    Float64   Float64    Float64  Float64    Flo β‹―
─────┼───────────────────────────────────────────────────────────────────────────
   1 β”‚     4   26.6636    105.136   82.6364    4.07091  2.28573    19.1373  0.9 β‹―
   2 β”‚     6   19.7429    183.314  122.286     3.58571  3.11714    17.9771  0.5  
   3 β”‚     8   15.1       353.1    209.214     3.22929  3.99921    16.7721  0.0

Wanting to apply the OP request to the cars dataset. Assuming that the last column is the weights one, you could do this way


julia> gdf = groupby(mtcars, :cyl);

julia> wt_m(mx)=[c'*mx[end]/sum(mx[end]) for c in mx[1:end-1]]'
wt_m (generic function with 1 method)

julia> combine(gdf, Cols(Not([:model,:cyl])) =>((x...)->wt_m(x))=>x->x[1:end-1])
3Γ—10 DataFrame
 Row β”‚ cyl    mpg      disp     hp        drat     wt       qsec     vs         β‹―
     β”‚ Int64  Float64  Float64  Float64   Float64  Float64  Float64  Float64    β‹―
─────┼───────────────────────────────────────────────────────────────────────────
   1 β”‚     4  26.3941  109.182   84.1765  4.09588  2.32535  19.0665  0.882353   β‹―
   2 β”‚     6  19.7375  165.575  130.375   3.755    3.03313  17.3075  0.416667    
   3 β”‚     8  14.6837  354.82   228.898   3.29612  4.04441  16.4931  0.0
                                                    
julia> combine(gdf, Cols(Not([:model,:cyl])) =>((x...)->wt_m(x))=>x->x[1:end-1] .* "_wm")
3Γ—10 DataFrame
 Row β”‚ cyl    mpg_wm   disp_wm  hp_wm     drat_wm  wt_wm    qsec_wm  vs_wm      β‹―
     β”‚ Int64  Float64  Float64  Float64   Float64  Float64  Float64  Float64    β‹―
─────┼───────────────────────────────────────────────────────────────────────────
   1 β”‚     4  26.3941  109.182   84.1765  4.09588  2.32535  19.0665  0.882353   β‹―
   2 β”‚     6  19.7375  165.575  130.375   3.755    3.03313  17.3075  0.416667    
   3 β”‚     8  14.6837  354.82   228.898   3.29612  4.04441  16.4931  0.0
                                                                2 columns omitted

2 Likes

Thanks for sharing the R syntax. This nearly works in TidierData.jl but currently runs into an error because it doesn’t cleanly translate into DataFrames.jl code due to a variable scoping issue.

Clearly, as others have shared, the result is achievable in DataFrames.jl directly. In either case, I’ll file this as an issue in TidierData so we can more or less get the same dplyr syntax working in Julia.

1 Like

In Julia, from a certain version onwards, the following syntax is possible

julia> (x...,y)=(1,2,3,4)
(1, 2, 3, 4)

julia> x
(1, 2, 3)

julia> y
4

I tried, expecting it to be possible, to use it in the following way within the minilanguage.
But perhaps because I did not use the appropriate form or perhaps for some other reason it does not seem to work.
I wonder if this is an inconsistency or if my expectation is not well founded.


g(x,wt)=x->[c' * wt/sum(wt) for c in x]'

combine(gdf, [Cols(Not([:model,:cyl,:wt]))...,:wt] => (x...,wt)->g(x,wt) => AsTable )


ERROR: syntax: invalid "..." on non-final argument around c:\Users\sprmn\.julia\environments\v1.10.5\DF_func_2_rows.jl:140
Stacktrace:
 [1] top-level scope
   @ c:\Users\sprmn\.julia\environments\v1.10.5\DF_func_2_rows.jl:140

Another problem (?) that I encounter if I try to go the following way is that no method matching iterate(::Cols{Tuple{InvertedIndex{Vector{String}}}})

julia> mw(wt)=x->[c'*wt/sum(wt) for c in x]'
mw (generic function with 1 method)

julia> combine(gdf, ["wt",names(mtcars, x->!in(x, ["model","cyl","wt"]) )...] => ((wt,x...)->mw(wt)(x)) => names(mtcars, x->!in(x, ["model","cyl","wt"]) ) )
3Γ—10 DataFrame
 Row β”‚ cyl    mpg      disp     hp        drat     qsec     vs       β‹―
     β”‚ Int64  Float64  Float64  Float64   Float64  Float64  Float64  β‹―
─────┼────────────────────────────────────────────────────────────────
   1 β”‚     4  25.935   110.35    83.3945  4.03126  19.3804  0.914887 β‹―
   2 β”‚     6  19.6458  185.239  121.558   3.56917  18.122   0.621219  
   3 β”‚     8  14.8064  362.125  209.373   3.20566  16.8853  0.0       
                                                     3 columns omitted

julia> combine(gdf, ["wt",Cols(Not(["model","cyl","wt"]) )...] => ((wt,x...)->mw(wt)(x)) => names(mtcars, x->!in(x, ["model","cyl","wt"]) ) )
ERROR: MethodError: no method matching iterate(::Cols{Tuple{InvertedIndex{Vector{String}}}})

Closest candidates are:
  iterate(::XLSX.SheetRowStreamIterator)
   @ XLSX C:\Users\sprmn\.julia\packages\XLSX\vzApD\src\stream.jl:78  
  iterate(::XLSX.SheetRowStreamIterator, ::Union{Nothing, XLSX.SheetRowStreamIteratorState})
   @ XLSX C:\Users\sprmn\.julia\packages\XLSX\vzApD\src\stream.jl:78  
  iterate(::Pkg.Resolve.NodePerm, Any...)
   @ Pkg C:\Users\sprmn\.julia\juliaup\julia-1.10.6+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Pkg\src\Resolve\maxsum.jl:240
  ...

Stacktrace:
 [1] top-level scope
   @ c:\Users\sprmn\.julia\environments\v1.10.5\DF_func_2_rows.jl:148