Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates?

Hi! hope all is well. Suppose there is the following DataFrame

 dr = collect(Date(2023,1,1):Day(1):Date(2023,1,10));
 df = DataFrame(data = rand((1:10),7), dates=rand(dr,7)); 
 sort!(df, :dates, rev = true)

7×2 DataFrame
 Row │ data   dates      
     │ Int64  Date       
─────┼───────────────────
   1 │     1  2023-01-10
   2 │    10  2023-01-10
   3 │     3  2023-01-07
   4 │     8  2023-01-05
   5 │     9  2023-01-03
   6 │     7  2023-01-02
   7 │     5  2023-01-01

What would be the best way to perform an operation on the data that occurs either on the same day or between consecutive dates e.g. take the difference between Data on the same and consecutive dates such that

 Row │ data   dates   diff     
     │ Int64  Date   Int64?    
─────┼───────────────────
   1 │     1  2023-01-10   - 9   
   2 │    10  2023-01-10  missing
   3 │     3  2023-01-07   missing
   4 │     8  2023-01-05   missing   
   5 │     9  2023-01-03   2    
   6 │     7  2023-01-02   2  
   7 │     5  2023-01-01   missing   

Ideally I would have liked to avoid the missing values by grouping the DataFrame via consecutive dates and then running a function on each grouped DataFrame via transform. As groupby(df, :dates) would only place data of the same date in a group, I was just wondering if there was a way to use groupby to create a GroupedDataFrame of consecutive Dates?

I assume your data is sorted by :dates. In this case the efficient way to do it would use a loop (or reduction function, which is equivalent, but might be harder to implement)

1 Like

Thanks so much! So I couldn’t use groupby directly. Given millions of rows would an efficient approach look something like?

function cnsctvdiff(days, data)
    newcol= [missing; diff(data::Vector{Int64})] 
    for d = Iterators.rest(eachindex(days::Vector{Date}),1)
       (days[d]-days[d-1] > Day(1)) && (newcol[d] = missing) 
    end
    return newcol
end 

df.cdff = cnsctvdiff(df.dates,df.data)

Are type declarations being used correctly here to speed up performance? if operations are to be performed later on df.cdff would a better approach be to break df down into separate DataFrames so that each column is type stable?

e.g creating an additional column and then using groupby

function dateID(days)
    ID = ones(Int64, length(days))
    for d = Iterators.rest(eachindex(days::Vector{Date}),1)
        (days[d]-days[d-1] >Day(1)) && (ID[d:end] .+= 1)
    end
    return ID 
end

df.ID = dateID(df.dates) 
  1. You do not need to add type declarations. Julia will perform this automatically.
  2. A loop within a function that gets columns passed as its positional arguments would be efficient (and you roughly get the fastest possible result).

A minor issue is that [missing; diff(data::Vector{Int64})] would allocate twice (so it could be optimized), but ithis is a minor issue - it should be fast enough.

3 Likes

Thanks! Sorry so is the following a correct understanding of when type declarations improve performance?

  1. If a function takes a column of a DataFrame as an argument then type declarations do not improve performance.

  2. If a function takes an entire DataFrame as an argument then type declarations on each column will improve performance.

The docs on argument type declarations note that they generally do not enhance performance. But as you pointed out in a previous post about efficiency in iterating over columns of a DataFrame that:

and also that

My background is so minimal that I wasn’t sure whether type declarations here improved performance because the loop was entirely outside of a function or because the entire DataFrame was being passed as an argument into a function?

Assuming the latter, why does a function barrier work for a column of a DataFrame but not the entire DataFrame? Wouldn’t the type stability issue affect both the DataFrame object and the DataFrame col?

This is correct.

why does a function barrier work for a column of a DataFrame but not the entire DataFrame?

Because when you pass a concrete column the method specific for that column type is compiled by Julia. If you pass a whole DataFrame then no such compilation happens and Julia has to assume that you could have stored vectors of any type inside a data frame.

2 Likes

Method using transform!:

julia> df = DataFrame(data = rand((1:10),7), dates=rand(dr,7));

julia> sort!(df, :dates, rev = true);

julia> transform!(df, 
  [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
  missing : d[i+1]-d[i] for i in 1:length(d)]) => :diff)
7×3 DataFrame
 Row │ data   dates       diff    
     │ Int64  Date        Int64?  
─────┼────────────────────────────
   1 │     5  2023-01-10        4
   2 │     9  2023-01-09       -3
   3 │     6  2023-01-08  missing 
   4 │     3  2023-01-04        5
   5 │     8  2023-01-04       -2
   6 │     6  2023-01-03       -4
   7 │     2  2023-01-02  missing 
1 Like
using DatFrame, Dates, IterTools
push!(df, merge(last(df), (dates=last(df).dates-Day(2),)))

pdf=partition(zip(df.data,df.dates),2,1)

df.diff=[r<Day(2) ? l : missing for (l,r) in [p.-f for  (p,f) in pdf]]
1 Like

I use the function defined by @Dan as an example to ask a question
What are the differences in terms of performance between the following shapes?
and, if there are, why (or in what cases) would one of these be preferable?


function md1(d, D)
    #d,D=df.data, df.dates
    diff=Vector{Union{Int, Missing}}(undef, length(d))
    for i in eachindex(d)
        diff[i]=i==length(d) || D[i]-D[i+1]>Day(1) ? 
        missing : d[i+1]-d[i]
    end
    diff
end


function md2(d, D)
    diff=Vector{Union{Int, Missing}}(undef, length(d))
    for i in 1:length(d)-1
        diff[i]=D[i]-D[i+1]>Day(1) ? 
        missing : d[i+1]-d[i]
    end
    diff[end]=missing
    diff
end

transform!(df, 
  [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
  missing : d[i+1]-d[i] for i in 1:length(d)]) => :diff) 

transform!(df, 
  [:data, :dates] => ((d,t)->[[t[i]-t[i+1]>Day(1) ? 
  missing : d[i+1]-d[i] for i in 1:length(d)-1];missing]) => :diff) 


transform!(df, [:data, :dates] => md1 => :diff) 

df.diff=md1(df.data, df.dates)

df.diff=md2(df.data, df.dates)

2 Likes

Thanks so much for the incredibly informative responses! Really helpful to see the different ways transform could be used. @rocco_sprmnt21, I didn’t see that big a difference in time between md1 vs md2 or the different transform methods, but in accordance with bkamins initial reply looping through with a function was significantly faster than using transform in this case.

dr = (Date(2023,3,1):Day(1):Date(2023,3,11)); 
df = DataFrame(data=rand(collect(1:10),10), dates=rand(collect(dr),10)); 
sort!(df, :dates, rev= true); 

julia> @benchmark $df.diff = md1(df.data, df.dates)
BenchmarkTools.Trial: 10000 samples with 829 evaluations.
 Range (min … max):  150.584 ns …   6.606 μs  ┊ GC (min … max): 0.00% … 97.49%
 Time  (median):     152.946 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   167.090 ns ± 262.051 ns  ┊ GC (mean ± σ):  6.65% ±  4.13%

    ▄▇█▆▄▂▁   ▁▃▄▂      ▃▅▅▅▄▃▃▃▂▂▂▂▃▂▂▂▁▁▁                     ▂
  ▄▇█████████▇█████▇▆▅▅▇██████████████████████████▇▇▇▆▆▆▆▆▅▅▃▄▄ █
  151 ns        Histogram: log(frequency) by time        171 ns <

 Memory estimate: 192 bytes, allocs estimate: 2.

julia> @benchmark $transform(df, 
         [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
         missing : d[i]-d[i+1] for i in 1:length(d)]) => :diff)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.583 μs …  7.552 ms  ┊ GC (min … max): 0.00% … 99.26%
 Time  (median):     14.166 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.198 μs ± 75.392 μs  ┊ GC (mean ± σ):  4.93% ±  0.99%

    ▂▆█▇▅▂                                                     
  ▂▄██████▇▇▅▄▃▃▃▃▃▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  13.6 μs         Histogram: frequency by time        18.5 μs <

 Memory estimate: 9.27 KiB, allocs estimate: 174.

I understood that your data was several million rows. Perhaps some differences between md1 2d md2 may occur for large dataframes. For example md2 avoids doing this i==length(d) check for each loop. My doubt is whether this makes any difference on large size.
I’m curious what the real structure of your data is. How do you have millions of records with different dates? you have many records with the same day date in many cases evidently. the consecutive data you want to compare are surveys of what?

I take this opportunity to propose another “toy solution”

grp=IterTools.groupby(isequal∘last ,zip(df.data,df.dates))

ggrp=IterTools.groupby(let c=last(first(first(grp))); x->c= (c-last(first(x))) > Day(1) ? last(first(x)) : c end, grp)

collect(Iterators.flatten([diff(first.(g));missing] for g in Iterators.flatten.(ggrp)))

1 Like

Another way to get rid of i == length(d) check, and the corresponding branching short-circuit || and ? : operators would be:

t[i]-t[i*(i==length(d))+1] > Day(1)

This assumes t[length[d]]-t[1] > Day(1) which should be the case when at least 2 days are recorded. The resulting expression:

transform(df, 
  [:data, :dates] => ((d,t)->[abs(t[i]-t[i*(i!=length(d))+1]) > Day(1) ? 
  missing : d[i]-d[i+1] for i in 1:length(d)]) => :diff)

Maybe a bit more branch elimination would help:

transform(df, 
  [:data, :dates] => ((d,t)->[ifelse(abs(t[i]-t[i*(i!=length(d))+1]) > Day(1), 
  missing, d[i]-d[i*(i!=length(d))+1]) for i in 1:length(d)]) => :diff)
1 Like

@rocco_sprmnt21 Sorry I didn’t think about the size of the DataFrame and the overhead involved. Yes the actual data set contains maybe 10^2 -10^3 data points daily. Testing with the following larger DataFrame md1 and md2 are still very close. transform has the same order of magnitude as md1 and md2, but allocates more memory.

dr = Date(2020,1,1):Day(1):Date(2030,12,31)
df3 = DataFrame( data =rand(collect(1:100),10^8),dates = rand(collect(dr),10^8));
sort!(df3, :dates; rev = true);

julia> @benchmark $df3.diff = md2(df3.data,df3.dates)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
 Range (min … max):  185.305 ms … 272.337 ms  ┊ GC (min … max): 0.00% … 20.56%
 Time  (median):     194.374 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   206.284 ms ±  23.500 ms  ┊ GC (mean ± σ):  5.64% ±  8.82%

     ▄█▄▁     ▁                           ▁                      
  ▆▁▆████▆▁▁▁▁█▆▁▆▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  185 ms           Histogram: frequency by time          272 ms <

 Memory estimate: 858.31 MiB, allocs estimate: 4.

julia> @benchmark $df3.diff = md1(df3.data,df3.dates)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  193.103 ms … 245.439 ms  ┊ GC (min … max): 0.00% … 22.36%
 Time  (median):     202.986 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   210.611 ms ±  17.124 ms  ┊ GC (mean ± σ):  5.66% ±  9.03%

      ▃   █  ▃  ▃                                           ▃    
  ▇▁▇▇█▁▇▇█▁▇█▁▁█▁▁▁▁▁▇▇▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▁█▁▇ ▁
  193 ms           Histogram: frequency by time          245 ms <

 Memory estimate: 858.31 MiB, allocs estimate: 4.

( Couldn’t get the toy solution to return the desired result on test data and I didn’t understand it well enough to adjust it. But as is it ran meaningfully slower than the above solutions you proposed, benchmarking at 2.649 seconds with the same DataFrame)

@Dan is the reason d[i]-d[i*(i!=length(d))+1] as opposed to d[I]-d[I+1] used in the transform because the ifelse() operator evaluates all the arguments even before the condition? The docs mention this can improve performance but doesn’t go into detail when this holds true. Here the ifelse() actually appears slightly slower than using ? operator.

julia> @benchmark $transform(df3, 
                [:data, :dates] => ((d,t)->[ifelse(abs(t[i]-t[i*(i!=length(d))+1]) > Day(1), 
                missing, d[i]-d[i*(i!=length(d))+1]) for i in 1:length(d)]) => :diff)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  563.228 ms …    1.204 s  ┊ GC (min … max):  0.00% … 17.49%
 Time  (median):     707.159 ms               ┊ GC (median):     5.23%
 Time  (mean ± σ):   738.541 ms ± 221.995 ms  ┊ GC (mean ± σ):  10.88% ± 10.20%

  █  ▁         ▁  ▁   ▁                                       ▁  
  █▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  563 ms           Histogram: frequency by time           1.2 s <

 Memory estimate: 3.91 GiB, allocs estimate: 191.


julia> @benchmark $transform(df3, 
                [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
         missing : d[i]-d[i+1] for i in 1:length(d)])=> :diff)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  519.620 ms …    1.257 s  ┊ GC (min … max):  0.00% … 17.73%
 Time  (median):     568.886 ms               ┊ GC (median):     8.79%
 Time  (mean ± σ):   664.064 ms ± 248.113 ms  ┊ GC (mean ± σ):  11.89% ±  9.84%

  █                                                              
  █▁▁▇▁▇▁▁▁▁▁▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  520 ms           Histogram: frequency by time          1.26 s <

 Memory estimate: 3.91 GiB, allocs estimate: 191.

yes the toy solution is no good. I had been too hasty.
this should work better.
but it’s certainly not useful for the problem it’s just an exercise to find a “functional” solution


grp=IterTools.groupby(isequal∘last ,zip(df.data,df.dates))

ggrp=IterTools.groupby(let c=first(enumerate(grp)); x->c= last(last(last(c)))-last(last(last(x))) == Day(last(first(x))-first(c)) ? c : x end, enumerate(grp))

fl=Iterators.flatten
cdiff=collect(fl([diff(first.(fl(last.(g))));missing] for g in ggrp))

1 Like

thanks for the clarification!