Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates?

phantom · March 23, 2023, 6:27am

Hi! hope all is well. Suppose there is the following DataFrame

 dr = collect(Date(2023,1,1):Day(1):Date(2023,1,10));
 df = DataFrame(data = rand((1:10),7), dates=rand(dr,7)); 
 sort!(df, :dates, rev = true)

7×2 DataFrame
 Row │ data   dates      
     │ Int64  Date       
─────┼───────────────────
   1 │     1  2023-01-10
   2 │    10  2023-01-10
   3 │     3  2023-01-07
   4 │     8  2023-01-05
   5 │     9  2023-01-03
   6 │     7  2023-01-02
   7 │     5  2023-01-01

What would be the best way to perform an operation on the data that occurs either on the same day or between consecutive dates e.g. take the difference between Data on the same and consecutive dates such that

 Row │ data   dates   diff     
     │ Int64  Date   Int64?    
─────┼───────────────────
   1 │     1  2023-01-10   - 9   
   2 │    10  2023-01-10  missing
   3 │     3  2023-01-07   missing
   4 │     8  2023-01-05   missing   
   5 │     9  2023-01-03   2    
   6 │     7  2023-01-02   2  
   7 │     5  2023-01-01   missing

Ideally I would have liked to avoid the missing values by grouping the DataFrame via consecutive dates and then running a function on each grouped DataFrame via transform. As groupby(df, :dates) would only place data of the same date in a group, I was just wondering if there was a way to use groupby to create a GroupedDataFrame of consecutive Dates?

bkamins · March 23, 2023, 7:11am

I assume your data is sorted by :dates. In this case the efficient way to do it would use a loop (or reduction function, which is equivalent, but might be harder to implement)

phantom · March 23, 2023, 9:30pm

Thanks so much! So I couldn’t use groupby directly. Given millions of rows would an efficient approach look something like?

function cnsctvdiff(days, data)
    newcol= [missing; diff(data::Vector{Int64})] 
    for d = Iterators.rest(eachindex(days::Vector{Date}),1)
       (days[d]-days[d-1] > Day(1)) && (newcol[d] = missing) 
    end
    return newcol
end 

df.cdff = cnsctvdiff(df.dates,df.data)

Are type declarations being used correctly here to speed up performance? if operations are to be performed later on df.cdff would a better approach be to break df down into separate DataFrames so that each column is type stable?

e.g creating an additional column and then using groupby

function dateID(days)
    ID = ones(Int64, length(days))
    for d = Iterators.rest(eachindex(days::Vector{Date}),1)
        (days[d]-days[d-1] >Day(1)) && (ID[d:end] .+= 1)
    end
    return ID 
end

df.ID = dateID(df.dates)

bkamins · March 24, 2023, 9:00am

You do not need to add type declarations. Julia will perform this automatically.
A loop within a function that gets columns passed as its positional arguments would be efficient (and you roughly get the fastest possible result).

A minor issue is that [missing; diff(data::Vector{Int64})] would allocate twice (so it could be optimized), but ithis is a minor issue - it should be fast enough.

phantom · March 24, 2023, 8:13pm

Thanks! Sorry so is the following a correct understanding of when type declarations improve performance?

If a function takes a column of a DataFrame as an argument then type declarations do not improve performance.
If a function takes an entire DataFrame as an argument then type declarations on each column will improve performance.

The docs on argument type declarations note that they generally do not enhance performance. But as you pointed out in a previous post about efficiency in iterating over columns of a DataFrame that:

Is there an equivalent of eachindex() for DataFrames?

The point is that in order to be efficient you must pass a column to a separate function. Then inside this function all will be fast.

The reason is that DataFrame object is not type stable, so for example even:
for col in eachcol(df)
    for v in col
        ... your code
    end
end
will be slow, because Julia does not know the element type of col at compilation time.

and also that

Is there an equivalent of eachindex() for DataFrames?

If all columns have the same type then what is enough is:
for col in eachcol(df)
    for v in col::Vector{String} # assuming this is the type of column
        ... your code
    end
end
of course converting to a Matrix or to Tables.columntable also will work in this case.

My background is so minimal that I wasn’t sure whether type declarations here improved performance because the loop was entirely outside of a function or because the entire DataFrame was being passed as an argument into a function?

Assuming the latter, why does a function barrier work for a column of a DataFrame but not the entire DataFrame? Wouldn’t the type stability issue affect both the DataFrame object and the DataFrame col?

bkamins · March 24, 2023, 9:06pm

This is correct.

why does a function barrier work for a column of a DataFrame but not the entire DataFrame?

Because when you pass a concrete column the method specific for that column type is compiled by Julia. If you pass a whole DataFrame then no such compilation happens and Julia has to assume that you could have stored vectors of any type inside a data frame.

Dan · March 25, 2023, 1:32am

Method using transform!:

julia> df = DataFrame(data = rand((1:10),7), dates=rand(dr,7));

julia> sort!(df, :dates, rev = true);

julia> transform!(df, 
  [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
  missing : d[i+1]-d[i] for i in 1:length(d)]) => :diff)
7×3 DataFrame
 Row │ data   dates       diff    
     │ Int64  Date        Int64?  
─────┼────────────────────────────
   1 │     5  2023-01-10        4
   2 │     9  2023-01-09       -3
   3 │     6  2023-01-08  missing 
   4 │     3  2023-01-04        5
   5 │     8  2023-01-04       -2
   6 │     6  2023-01-03       -4
   7 │     2  2023-01-02  missing

rocco_sprmnt21 · March 27, 2023, 4:21am

using DatFrame, Dates, IterTools
push!(df, merge(last(df), (dates=last(df).dates-Day(2),)))

pdf=partition(zip(df.data,df.dates),2,1)

df.diff=[r<Day(2) ? l : missing for (l,r) in [p.-f for  (p,f) in pdf]]

rocco_sprmnt21 · March 27, 2023, 8:11am

I use the function defined by @Dan as an example to ask a question
What are the differences in terms of performance between the following shapes?
and, if there are, why (or in what cases) would one of these be preferable?


function md1(d, D)
    #d,D=df.data, df.dates
    diff=Vector{Union{Int, Missing}}(undef, length(d))
    for i in eachindex(d)
        diff[i]=i==length(d) || D[i]-D[i+1]>Day(1) ? 
        missing : d[i+1]-d[i]
    end
    diff
end


function md2(d, D)
    diff=Vector{Union{Int, Missing}}(undef, length(d))
    for i in 1:length(d)-1
        diff[i]=D[i]-D[i+1]>Day(1) ? 
        missing : d[i+1]-d[i]
    end
    diff[end]=missing
    diff
end

transform!(df, 
  [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
  missing : d[i+1]-d[i] for i in 1:length(d)]) => :diff) 

transform!(df, 
  [:data, :dates] => ((d,t)->[[t[i]-t[i+1]>Day(1) ? 
  missing : d[i+1]-d[i] for i in 1:length(d)-1];missing]) => :diff) 


transform!(df, [:data, :dates] => md1 => :diff) 

df.diff=md1(df.data, df.dates)

df.diff=md2(df.data, df.dates)

phantom · March 27, 2023, 8:02pm

Thanks so much for the incredibly informative responses! Really helpful to see the different ways transform could be used. @rocco_sprmnt21, I didn’t see that big a difference in time between md1 vs md2 or the different transform methods, but in accordance with bkamins initial reply looping through with a function was significantly faster than using transform in this case.

dr = (Date(2023,3,1):Day(1):Date(2023,3,11)); 
df = DataFrame(data=rand(collect(1:10),10), dates=rand(collect(dr),10)); 
sort!(df, :dates, rev= true); 

julia> @benchmark $df.diff = md1(df.data, df.dates)
BenchmarkTools.Trial: 10000 samples with 829 evaluations.
 Range (min … max):  150.584 ns …   6.606 μs  ┊ GC (min … max): 0.00% … 97.49%
 Time  (median):     152.946 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   167.090 ns ± 262.051 ns  ┊ GC (mean ± σ):  6.65% ±  4.13%

    ▄▇█▆▄▂▁   ▁▃▄▂      ▃▅▅▅▄▃▃▃▂▂▂▂▃▂▂▂▁▁▁                     ▂
  ▄▇█████████▇█████▇▆▅▅▇██████████████████████████▇▇▇▆▆▆▆▆▅▅▃▄▄ █
  151 ns        Histogram: log(frequency) by time        171 ns <

 Memory estimate: 192 bytes, allocs estimate: 2.

julia> @benchmark $transform(df, 
         [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
         missing : d[i]-d[i+1] for i in 1:length(d)]) => :diff)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.583 μs …  7.552 ms  ┊ GC (min … max): 0.00% … 99.26%
 Time  (median):     14.166 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.198 μs ± 75.392 μs  ┊ GC (mean ± σ):  4.93% ±  0.99%

    ▂▆█▇▅▂                                                     
  ▂▄██████▇▇▅▄▃▃▃▃▃▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  13.6 μs         Histogram: frequency by time        18.5 μs <

 Memory estimate: 9.27 KiB, allocs estimate: 174.

rocco_sprmnt21 · March 28, 2023, 2:04pm

I understood that your data was several million rows. Perhaps some differences between md1 2d md2 may occur for large dataframes. For example md2 avoids doing this i==length(d) check for each loop. My doubt is whether this makes any difference on large size.
I’m curious what the real structure of your data is. How do you have millions of records with different dates? you have many records with the same day date in many cases evidently. the consecutive data you want to compare are surveys of what?

I take this opportunity to propose another “toy solution”

grp=IterTools.groupby(isequal∘last ,zip(df.data,df.dates))

ggrp=IterTools.groupby(let c=last(first(first(grp))); x->c= (c-last(first(x))) > Day(1) ? last(first(x)) : c end, grp)

collect(Iterators.flatten([diff(first.(g));missing] for g in Iterators.flatten.(ggrp)))

Dan · March 28, 2023, 2:25pm

Another way to get rid of i == length(d) check, and the corresponding branching short-circuit || and ? : operators would be:

t[i]-t[i*(i==length(d))+1] > Day(1)

This assumes t[length[d]]-t[1] > Day(1) which should be the case when at least 2 days are recorded. The resulting expression:

transform(df, 
  [:data, :dates] => ((d,t)->[abs(t[i]-t[i*(i!=length(d))+1]) > Day(1) ? 
  missing : d[i]-d[i+1] for i in 1:length(d)]) => :diff)

Maybe a bit more branch elimination would help:

transform(df, 
  [:data, :dates] => ((d,t)->[ifelse(abs(t[i]-t[i*(i!=length(d))+1]) > Day(1), 
  missing, d[i]-d[i*(i!=length(d))+1]) for i in 1:length(d)]) => :diff)

phantom · March 29, 2023, 4:38am

@rocco_sprmnt21 Sorry I didn’t think about the size of the DataFrame and the overhead involved. Yes the actual data set contains maybe 10^2 -10^3 data points daily. Testing with the following larger DataFrame md1 and md2 are still very close. transform has the same order of magnitude as md1 and md2, but allocates more memory.

dr = Date(2020,1,1):Day(1):Date(2030,12,31)
df3 = DataFrame( data =rand(collect(1:100),10^8),dates = rand(collect(dr),10^8));
sort!(df3, :dates; rev = true);

julia> @benchmark $df3.diff = md2(df3.data,df3.dates)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
 Range (min … max):  185.305 ms … 272.337 ms  ┊ GC (min … max): 0.00% … 20.56%
 Time  (median):     194.374 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   206.284 ms ±  23.500 ms  ┊ GC (mean ± σ):  5.64% ±  8.82%

     ▄█▄▁     ▁                           ▁                      
  ▆▁▆████▆▁▁▁▁█▆▁▆▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  185 ms           Histogram: frequency by time          272 ms <

 Memory estimate: 858.31 MiB, allocs estimate: 4.

julia> @benchmark $df3.diff = md1(df3.data,df3.dates)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  193.103 ms … 245.439 ms  ┊ GC (min … max): 0.00% … 22.36%
 Time  (median):     202.986 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   210.611 ms ±  17.124 ms  ┊ GC (mean ± σ):  5.66% ±  9.03%

      ▃   █  ▃  ▃                                           ▃    
  ▇▁▇▇█▁▇▇█▁▇█▁▁█▁▁▁▁▁▇▇▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▁█▁▇ ▁
  193 ms           Histogram: frequency by time          245 ms <

 Memory estimate: 858.31 MiB, allocs estimate: 4.

( Couldn’t get the toy solution to return the desired result on test data and I didn’t understand it well enough to adjust it. But as is it ran meaningfully slower than the above solutions you proposed, benchmarking at 2.649 seconds with the same DataFrame)

@Dan is the reason d[i]-d[i*(i!=length(d))+1] as opposed to d[I]-d[I+1] used in the transform because the ifelse() operator evaluates all the arguments even before the condition? The docs mention this can improve performance but doesn’t go into detail when this holds true. Here the ifelse() actually appears slightly slower than using ? operator.

julia> @benchmark $transform(df3, 
                [:data, :dates] => ((d,t)->[ifelse(abs(t[i]-t[i*(i!=length(d))+1]) > Day(1), 
                missing, d[i]-d[i*(i!=length(d))+1]) for i in 1:length(d)]) => :diff)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  563.228 ms …    1.204 s  ┊ GC (min … max):  0.00% … 17.49%
 Time  (median):     707.159 ms               ┊ GC (median):     5.23%
 Time  (mean ± σ):   738.541 ms ± 221.995 ms  ┊ GC (mean ± σ):  10.88% ± 10.20%

  █  ▁         ▁  ▁   ▁                                       ▁  
  █▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  563 ms           Histogram: frequency by time           1.2 s <

 Memory estimate: 3.91 GiB, allocs estimate: 191.


julia> @benchmark $transform(df3, 
                [:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ? 
         missing : d[i]-d[i+1] for i in 1:length(d)])=> :diff)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  519.620 ms …    1.257 s  ┊ GC (min … max):  0.00% … 17.73%
 Time  (median):     568.886 ms               ┊ GC (median):     8.79%
 Time  (mean ± σ):   664.064 ms ± 248.113 ms  ┊ GC (mean ± σ):  11.89% ±  9.84%

  █                                                              
  █▁▁▇▁▇▁▁▁▁▁▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  520 ms           Histogram: frequency by time          1.26 s <

 Memory estimate: 3.91 GiB, allocs estimate: 191.

rocco_sprmnt21 · March 29, 2023, 5:35am

yes the toy solution is no good. I had been too hasty.
this should work better.
but it’s certainly not useful for the problem it’s just an exercise to find a “functional” solution


grp=IterTools.groupby(isequal∘last ,zip(df.data,df.dates))

ggrp=IterTools.groupby(let c=first(enumerate(grp)); x->c= last(last(last(c)))-last(last(last(x))) == Day(last(first(x))-first(c)) ? c : x end, enumerate(grp))

fl=Iterators.flatten
cdiff=collect(fl([diff(first.(fl(last.(g))));missing] for g in ggrp))

phantom · March 29, 2023, 6:02am

thanks for the clarification!

Topic		Replies	Views
DataFrames operation scales badly Performance	21	2787	December 10, 2018
Stack overflow in DataFrames group by Data	16	4101	October 15, 2017
Julia performs poorly on group-by benchmarks Data performance	48	5935	January 23, 2019
Various by-group strategies compared Data	36	4078	January 30, 2018
How to compute a "cumulative" in a dataframe (without a for loop) Data question , dataframes	44	9662	September 11, 2021

Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates?

Related topics