Normalized nrow in a GroupedDataFrame

Storopoli · April 29, 2021, 7:51am

What is the best way to do a “normalized” nrow in a gdf?

using DataFrames
using Chain
df = DataFrame(id=1:6,
                      name=["Aaron Aardvark", "Belen Barboza",
                              "春 陈", "Даниил Дубов",
                              "Elżbieta Elbląg", "Felipe Fittipaldi"],
                      age=[50, 45, 40, 35, 30, 25],
                      eye=["blue", "brown", "hazel", "blue", "green", "brown"],
                      grade_1=[95, 90, 85, 90, 95, 90],
                      grade_2=[75, 80, 65, 90, 75, 95],
                      grade_3=[85, 85, 90, 85, 80, 85])
@chain df begin
    groupby(:eye)
    combine(nrow => :n,  x -> nrow(x) / nrow(df))
end

That outputs:

4×3 DataFrame
 Row │ eye     n      x1       
     │ String  Int64  Float64  
─────┼─────────────────────────
   1 │ blue        2  0.333333
   2 │ brown       2  0.333333
   3 │ hazel       1  0.166667
   4 │ green       1  0.166667

But it i try to rename the x1 column I get a strange thing:

 @chain df begin
           groupby(:eye)
           combine(nrow => :n,  x -> nrow(x) / nrow(df) => :perc)
       end
4×3 DataFrame
 Row │ eye     n      x1              
     │ String  Int64  Pair…           
─────┼────────────────────────────────
   1 │ blue        2  0.333333=>:perc
   2 │ brown       2  0.333333=>:perc
   3 │ hazel       1  0.166667=>:perc
   4 │ green       1  0.166667=>:perc

nilshg · April 29, 2021, 7:59am

That’s operatory precedence for you:

julia> @chain df begin
           groupby(:eye)
           combine(nrow => :n,  :name => (x -> length(x) / nrow(df)) => :perc)
       end
4×3 DataFrame
 Row │ eye     n      perc     
     │ String  Int64  Float64  
─────┼─────────────────────────
   1 │ blue        2  0.333333
   2 │ brown       2  0.333333
   3 │ hazel       1  0.166667
   4 │ green       1  0.166667

(Note the brackets around the anonymous function)

Storopoli · April 29, 2021, 8:02am

Thank you!

fredrikekre · April 29, 2021, 8:10am

Why does the first case work? I don’t see anything in the docs for passing just a function like that, only cols => function or cols => function => newcols. And why is that column called x1?

julia> @chain df begin
           groupby(:eye)
           combine(y -> 1)
       end
4×2 DataFrame
 Row │ eye     x1    
     │ String  Int64 
─────┼───────────────
   1 │ blue        1
   2 │ brown       1
   3 │ hazel       1
   4 │ green       1

julia> @chain df begin
           groupby(:eye)
           combine(y -> 1, x -> 2)
       end
ERROR: ArgumentError: duplicate output column name: :x1

pdeffebach · April 29, 2021, 12:22pm

It’s list item 7 here. You can pass a function which accepts a SubDataFrame. But I guess it doesn’t generate names perfectly so you get an error where it tries to make :x1 twice.

Topic		Replies	Views
A little problem with combine from DataFrames.jl New to Julia dataframes	2	1156	May 10, 2020
DataFramesMeta question Data dataframes	9	941	November 30, 2021
Why doesn't this work with DataFrames.jl `combine`? Data	13	955	August 3, 2021
Combine function not naming column as cols => function => target_cols implies New to Julia question , dataframes	7	199	May 28, 2024
Row number in DataFrames groupby query Data	5	1493	August 13, 2021

Normalized nrow in a GroupedDataFrame

Related topics