Create grouped dataframe by properties of a given column?

phantom · April 22, 2024, 9:44pm

Hi just wondering if there a way to group a DataFrame by properties within a column. E.g. Suppose

julia> df = DataFrame(date = Date(2024,3,25): Day(1):Date(2024,4,5))
12×1 DataFrame
 Row │ date       
     │ Date       
─────┼────────────
   1 │ 2024-03-25
   2 │ 2024-03-26
   3 │ 2024-03-27
   4 │ 2024-03-28
   5 │ 2024-03-29
   6 │ 2024-03-30
   7 │ 2024-03-31
   8 │ 2024-04-01
   9 │ 2024-04-02
  10 │ 2024-04-03
  11 │ 2024-04-04
  12 │ 2024-04-05

And I want to groupby yearmonth.(df.date). Can this be accomplished without creating a new column in the dataframe? May be trivial but I just wanted to make sure I wasn’t missing something similar to the by kwarg in searchsortedfirst. Thanks!

rocco_sprmnt21 · April 22, 2024, 10:08pm

not with the groupby() function of DataFrames.
But you could with SplitApllyCombine group()

julia> using SplitApplyCombine

julia> df = DataFrame(date = Date(2024,12,29): Day(1):Date(2025,1,5))
8×1 DataFrame
 Row │ date       
     │ Date
─────┼────────────
   1 │ 2024-12-29
   2 │ 2024-12-30
   3 │ 2024-12-31
   4 │ 2025-01-01
   5 │ 2025-01-02
   6 │ 2025-01-03
   7 │ 2025-01-04
   8 │ 2025-01-05

julia> group( yearmonth, df.date)
2-element Dictionaries.Dictionary{Tuple{Int64, Int64}, Vector{Date}}
 (2024, 12) │ [Date("2024-12-29"), Date("2024-12-30"), Date("2024-12-31")]        
  (2025, 1) │ [Date("2025-01-01"), Date("2025-01-02"), Date("2025-01-03"), Date("…

julia> group(yearmonth, month,  df.date)
2-element Dictionaries.Dictionary{Tuple{Int64, Int64}, Vector{Int64}}
 (2024, 12) │ [12, 12, 12]
  (2025, 1) │ [1, 1, 1, 1, 1]

julia> group(year, month,  df.date)
2-element Dictionaries.Dictionary{Int64, Vector{Int64}}
 2024 │ [12, 12, 12]
 2025 │ [1, 1, 1, 1, 1]

phantom · April 22, 2024, 11:46pm

Awesome thanks so much, this is super helpful!

So if I wanted to keep it in a DataFrame format my only option would be create an additional column. e.g.

df.yearmonth = yearmonth.(df.date)

or

transform!(df, :date => (x ->yearmonth.(x)) => :yearmonth)

Just as a quick follow up if you have a moment- I believe I get a different result without the ( ) in the expression. i.e.

transform!(df, :date => x -> yearmonth.(x)  => :yearmonth)

is that without the ( ) Julia takes the entire :date column as x for each row of the DataFrame.

hence

julia> transform!(df, :date => (x ->yearmonth.(x)) => :yearmonth)
12×2 DataFrame
 Row │ date        yearmonth 
     │ Date        Tuple…    
─────┼───────────────────────
   1 │ 2024-03-25  (2024, 3)
   2 │ 2024-03-26  (2024, 3)
   3 │ 2024-03-27  (2024, 3)
   4 │ 2024-03-28  (2024, 3)
   5 │ 2024-03-29  (2024, 3)
   6 │ 2024-03-30  (2024, 3)
   7 │ 2024-03-31  (2024, 3)
   8 │ 2024-04-01  (2024, 4)
   9 │ 2024-04-02  (2024, 4)
  10 │ 2024-04-03  (2024, 4)
  11 │ 2024-04-04  (2024, 4)
  12 │ 2024-04-05  (2024, 4)

whereas without the ( )

julia> transform!(df, :date => x ->yearmonth.(x) => :yearmonth)
12×2 DataFrame
 Row │ date        date_function                     
     │ Date        Pair…                             
─────┼───────────────────────────────────────────────
   1 │ 2024-03-25  [(2024, 3), (2024, 3), (2024, 3)…
   2 │ 2024-03-26  [(2024, 3), (2024, 3), (2024, 3)…
   3 │ 2024-03-27  [(2024, 3), (2024, 3), (2024, 3)…
   4 │ 2024-03-28  [(2024, 3), (2024, 3), (2024, 3)…
   5 │ 2024-03-29  [(2024, 3), (2024, 3), (2024, 3)…
   6 │ 2024-03-30  [(2024, 3), (2024, 3), (2024, 3)…
   7 │ 2024-03-31  [(2024, 3), (2024, 3), (2024, 3)…
   8 │ 2024-04-01  [(2024, 3), (2024, 3), (2024, 3)…
   9 │ 2024-04-02  [(2024, 3), (2024, 3), (2024, 3)…
  10 │ 2024-04-03  [(2024, 3), (2024, 3), (2024, 3)…
  11 │ 2024-04-04  [(2024, 3), (2024, 3), (2024, 3)…
  12 │ 2024-04-05  [(2024, 3), (2024, 3), (2024, 3)…

Assuming my interpretation is correct (please feel free to correct me if i’m wrong), I’m still a little confused as to why date_function is assigned as the name of the new column when the () are removed?

bertschi · April 23, 2024, 6:31am

That is due to an unfortunate precedence between => and ->, meaning that :date => x -> yearmonth.(x) => :yearmont gets parsed at :date => (x -> (yearmonth.(x) => :yearmonth)). Using a smaller example, you can see that the new column actually contains a Pair and has the name chosen by default as no new name was given (parsed as :name => function):

julia> transform(df[1:1, :], :date => x -> yearmonth.(x)  => :yearmonth)
1×2 DataFrame
 Row │ date        date_function            
     │ Date        Pair…                    
─────┼──────────────────────────────────────
   1 │ 2024-12-29  [(2024, 12)]=>:yearmonth

rocco_sprmnt21 · April 23, 2024, 2:39pm

I vaguely remember that a long time ago this request was asked (by me?) to Bkamins.
I would say that at the moment it is the most natural solution, even if, if you wanted to push things a little, you could avoid adding a “service column”.
If you show us what you want to do(*) with the groups you want to define, you may get more specific proposals.

(*)Groupings are usually done because you then want to apply some transformation that acts on the individual groups

phantom · April 23, 2024, 6:02pm

Thanks to both of you for the incredibly helpful clarifications. I ended up using the group method to get a dictionary of the different yearmonths in the data which worked really well!

rocco_sprmnt21 · April 23, 2024, 7:15pm

If you don’t want to depend on other packages, you could use something like this

groupbyf(f,data)=[f(ym)=>@view data[findall(d->f(ym)==f(d), data)] for ym in unique(f,data)]

rocco_sprmnt21 · April 24, 2024, 6:47am

function groupviewbyf(f::Function, data)
    fdata=f.(data)
    spi=sortperm(fdata)
    #spi=sortperm(data, by=f)
    l=1
    r=0
    ym=f(data[spi[1]])
    vr= Vector{Tuple{typeof(ym), Vector{Int}}}()
    for i  in eachindex(data)
        if f(data[spi[i]])== ym
            r+=1
        else
            ym=f(data[spi[i]])
            push!(vr,(ym,spi[l:r]))
            l=i
            r+=1
        end
    end
    push!(vr,(ym,spi[l:r]))
    [k=>view(data,v) for (k,v) in vr]
    #vr
end

This is a (non-functional) version that appears to be more efficient.
I’m posting it mainly to ask a question.

Why do the following two expressions

    fdata=f.(data)
    spi=sortperm(fdata)

perform so much better than the next one (which does the same thing)?

    spi=sortperm(data, by=f)

phantom · April 26, 2024, 1:55pm

Would it have to do with how sortperm handles the by kwarg? If it’s handled the same way you mentioned here.

Then could sortperm be computing f(data) multiple times to sort the data whereas with

f.(data) is calculated just once?

rocco_sprmnt21 · April 26, 2024, 3:26pm

I would say no. f.(data) maps f to each data element, similar to how f does searchsortedfirst(data, x; by=f)

In fact I would say almost the opposite, theoretically(*) the f is applied by searchsortedfirst only until the value is reached, not on the entire vector.

(*) by this I mean what I IMAGINE (but I don’t know directly from reading/studying it) is the strategy of the function.
The only doubt is whether f is applied to x (the objective value) only once at the beginning or every time the comparison is made

Topic		Replies	Views
Groupby on an expression or a vector? New to Julia	21	574	June 11, 2024
Group DataFrames by a function of a column Data package	4	1206	December 11, 2019
DataFrame Groupby New to Julia dataframes	2	2148	April 26, 2018
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	709	March 29, 2023
Tidying data: DataFrame New to Julia	12	1311	June 5, 2020

Create grouped dataframe by properties of a given column?

Related topics