Create grouped dataframe by properties of a given column?

Hi just wondering if there a way to group a DataFrame by properties within a column. E.g. Suppose

julia> df = DataFrame(date = Date(2024,3,25): Day(1):Date(2024,4,5))
12×1 DataFrame
 Row │ date       
     │ Date       
─────┼────────────
   1 │ 2024-03-25
   2 │ 2024-03-26
   3 │ 2024-03-27
   4 │ 2024-03-28
   5 │ 2024-03-29
   6 │ 2024-03-30
   7 │ 2024-03-31
   8 │ 2024-04-01
   9 │ 2024-04-02
  10 │ 2024-04-03
  11 │ 2024-04-04
  12 │ 2024-04-05

And I want to groupby yearmonth.(df.date). Can this be accomplished without creating a new column in the dataframe? May be trivial but I just wanted to make sure I wasn’t missing something similar to the by kwarg in searchsortedfirst. Thanks!

not with the groupby() function of DataFrames.
But you could with SplitApllyCombine group()

julia> using SplitApplyCombine

julia> df = DataFrame(date = Date(2024,12,29): Day(1):Date(2025,1,5))
8×1 DataFrame
 Row │ date       
     │ Date
─────┼────────────
   1 │ 2024-12-29
   2 │ 2024-12-30
   3 │ 2024-12-31
   4 │ 2025-01-01
   5 │ 2025-01-02
   6 │ 2025-01-03
   7 │ 2025-01-04
   8 │ 2025-01-05

julia> group( yearmonth, df.date)
2-element Dictionaries.Dictionary{Tuple{Int64, Int64}, Vector{Date}}
 (2024, 12) │ [Date("2024-12-29"), Date("2024-12-30"), Date("2024-12-31")]        
  (2025, 1) │ [Date("2025-01-01"), Date("2025-01-02"), Date("2025-01-03"), Date("…

julia> group(yearmonth, month,  df.date)
2-element Dictionaries.Dictionary{Tuple{Int64, Int64}, Vector{Int64}}
 (2024, 12) │ [12, 12, 12]
  (2025, 1) │ [1, 1, 1, 1, 1]

julia> group(year, month,  df.date)
2-element Dictionaries.Dictionary{Int64, Vector{Int64}}
 2024 │ [12, 12, 12]
 2025 │ [1, 1, 1, 1, 1]
1 Like

Awesome thanks so much, this is super helpful!

So if I wanted to keep it in a DataFrame format my only option would be create an additional column. e.g.

df.yearmonth = yearmonth.(df.date)

or

transform!(df, :date => (x ->yearmonth.(x)) => :yearmonth)

Just as a quick follow up if you have a moment- I believe I get a different result without the ( ) in the expression. i.e.

transform!(df, :date => x -> yearmonth.(x)  => :yearmonth)

is that without the ( ) Julia takes the entire :date column as x for each row of the DataFrame.

hence

julia> transform!(df, :date => (x ->yearmonth.(x)) => :yearmonth)
12×2 DataFrame
 Row │ date        yearmonth 
     │ Date        Tuple…    
─────┼───────────────────────
   1 │ 2024-03-25  (2024, 3)
   2 │ 2024-03-26  (2024, 3)
   3 │ 2024-03-27  (2024, 3)
   4 │ 2024-03-28  (2024, 3)
   5 │ 2024-03-29  (2024, 3)
   6 │ 2024-03-30  (2024, 3)
   7 │ 2024-03-31  (2024, 3)
   8 │ 2024-04-01  (2024, 4)
   9 │ 2024-04-02  (2024, 4)
  10 │ 2024-04-03  (2024, 4)
  11 │ 2024-04-04  (2024, 4)
  12 │ 2024-04-05  (2024, 4)

whereas without the ( )

julia> transform!(df, :date => x ->yearmonth.(x) => :yearmonth)
12×2 DataFrame
 Row │ date        date_function                     
     │ Date        Pair…                             
─────┼───────────────────────────────────────────────
   1 │ 2024-03-25  [(2024, 3), (2024, 3), (2024, 3)…
   2 │ 2024-03-26  [(2024, 3), (2024, 3), (2024, 3)…
   3 │ 2024-03-27  [(2024, 3), (2024, 3), (2024, 3)…
   4 │ 2024-03-28  [(2024, 3), (2024, 3), (2024, 3)…
   5 │ 2024-03-29  [(2024, 3), (2024, 3), (2024, 3)…
   6 │ 2024-03-30  [(2024, 3), (2024, 3), (2024, 3)…
   7 │ 2024-03-31  [(2024, 3), (2024, 3), (2024, 3)…
   8 │ 2024-04-01  [(2024, 3), (2024, 3), (2024, 3)…
   9 │ 2024-04-02  [(2024, 3), (2024, 3), (2024, 3)…
  10 │ 2024-04-03  [(2024, 3), (2024, 3), (2024, 3)…
  11 │ 2024-04-04  [(2024, 3), (2024, 3), (2024, 3)…
  12 │ 2024-04-05  [(2024, 3), (2024, 3), (2024, 3)…

Assuming my interpretation is correct (please feel free to correct me if i’m wrong), I’m still a little confused as to why date_function is assigned as the name of the new column when the () are removed?

That is due to an unfortunate precedence between => and ->, meaning that :date => x -> yearmonth.(x) => :yearmont gets parsed at :date => (x -> (yearmonth.(x) => :yearmonth)). Using a smaller example, you can see that the new column actually contains a Pair and has the name chosen by default as no new name was given (parsed as :name => function):

julia> transform(df[1:1, :], :date => x -> yearmonth.(x)  => :yearmonth)
1×2 DataFrame
 Row │ date        date_function            
     │ Date        Pair…                    
─────┼──────────────────────────────────────
   1 │ 2024-12-29  [(2024, 12)]=>:yearmonth

I vaguely remember that a long time ago this request was asked (by me?) to Bkamins.
I would say that at the moment it is the most natural solution, even if, if you wanted to push things a little, you could avoid adding a “service column”.
If you show us what you want to do(*) with the groups you want to define, you may get more specific proposals.

(*)Groupings are usually done because you then want to apply some transformation that acts on the individual groups

Thanks to both of you for the incredibly helpful clarifications. I ended up using the group method to get a dictionary of the different yearmonths in the data which worked really well!

If you don’t want to depend on other packages, you could use something like this

groupbyf(f,data)=[f(ym)=>@view data[findall(d->f(ym)==f(d), data)] for ym in unique(f,data)]
1 Like
function groupviewbyf(f::Function, data)
    fdata=f.(data)
    spi=sortperm(fdata)
    #spi=sortperm(data, by=f)
    l=1
    r=0
    ym=f(data[spi[1]])
    vr= Vector{Tuple{typeof(ym), Vector{Int}}}()
    for i  in eachindex(data)
        if f(data[spi[i]])== ym
            r+=1
        else
            ym=f(data[spi[i]])
            push!(vr,(ym,spi[l:r]))
            l=i
            r+=1
        end
    end
    push!(vr,(ym,spi[l:r]))
    [k=>view(data,v) for (k,v) in vr]
    #vr
end

This is a (non-functional) version that appears to be more efficient.
I’m posting it mainly to ask a question.

Why do the following two expressions

    fdata=f.(data)
    spi=sortperm(fdata)

perform so much better than the next one (which does the same thing)?

    spi=sortperm(data, by=f)
1 Like

Would it have to do with how sortperm handles the by kwarg? If it’s handled the same way you mentioned here.

Then could sortperm be computing f(data) multiple times to sort the data whereas with

f.(data) is calculated just once?

I would say no. f.(data) maps f to each data element, similar to how f does searchsortedfirst(data, x; by=f)

In fact I would say almost the opposite, theoretically(*) the f is applied by searchsortedfirst only until the value is reached, not on the entire vector.

(*) by this I mean what I IMAGINE (but I don’t know directly from reading/studying it) is the strategy of the function.
The only doubt is whether f is applied to x (the objective value) only once at the beginning or every time the comparison is made

1 Like