Using DataFrames to split overlapping datetime intervals into non overlapping intervals


test = DataFrame(Time1 = [DateTime(2022, 1, 1, 10, 0, 0), DateTime(2022, 1, 1, 10, 5, 0), DateTime(2022, 1, 1, 10, 14, 0)], Time2 = [DateTime(2022, 1, 1, 10, 15, 0),  DateTime(2022, 1, 1, 10, 10, 0), DateTime(2022, 1, 1, 10, 20, 0)], 
Value = [10, 15, 20])

image
Given the test dataframe above, I want to convert it to every single time range that can be constructed with the Value column summed up for all the times that lie in that time range as shown in the result below:

result = DataFrame(Time1 = [DateTime(2022, 1, 1, 10, 0, 0), DateTime(2022, 1, 1, 10, 5, 0), DateTime(2022, 1, 1, 10, 10, 0), DateTime(2022, 1, 1, 10, 14, 0), DateTime(2022, 1, 1, 10, 15, 0)], 
Time2 = [DateTime(2022, 1, 1, 10, 5, 0), DateTime(2022, 1, 1, 10, 10, 0), DateTime(2022, 1, 1, 10, 14, 0), DateTime(2022, 1, 1, 10, 15, 0), DateTime(2022, 1, 1, 10, 20, 0)], ValueSum = [10, 25, 10, 30,20 ])

image

Any idea how to achieve this?

I faced a similar problem on a dataset with similar format as yours. You can find a session resolution algorithm that I proposed in this demo paper: A Large-scale Disease Outbreak Analytics System based on Wi-Fi Session Logs | IEEE Conference Publication | IEEE Xplore

And here is the code to resolve the sessions in SQL: https://github.com/NUS-IDS/cofi/blob/master/bin/02_stationary_view_v2.py

I reckon the algorithm could be adapted to Julia.

Here’s at least a start on generating the result dataframe:

julia> all_times = sort(unique([test.Time1; test.Time2]));

julia> time_pairs = [(all_times[i], all_times[i+1]) for i ∈ 1:length(all_times)-1];

julia> res = DataFrame(start = first.(time_pairs), ending = last.(time_pairs))
5Γ—2 DataFrame
 Row β”‚ start                ending
     β”‚ DateTime             DateTime
─────┼──────────────────────────────────────────
   1 β”‚ 2022-01-01T10:00:00  2022-01-01T10:05:00
   2 β”‚ 2022-01-01T10:05:00  2022-01-01T10:10:00
   3 β”‚ 2022-01-01T10:10:00  2022-01-01T10:14:00
   4 β”‚ 2022-01-01T10:14:00  2022-01-01T10:15:00
   5 β”‚ 2022-01-01T10:15:00  2022-01-01T10:20:00

You can then go through the original dataframe and check overlap row by row, although whether that is feasible or whether you need a smarter algorithm will pobably depend on how large your actual data is…

1 Like
julia> res.ValueSum=cumsum(last.(sort(mapreduce(((s,e,v),)->[(s,v),(e,-v)] ,vcat, zip(test.Time1, test.Time2, test.Value)))))[1:end-1] 
5-element Vector{Int64}:
 10
 25
 10
 30
 20

julia> res
5Γ—3 DataFrame
 Row β”‚ start                ending               ValueSum 
     β”‚ DateTime             DateTime             Int64
─────┼────────────────────────────────────────────────────
   1 β”‚ 2022-01-01T10:00:00  2022-01-01T10:05:00        10
   2 β”‚ 2022-01-01T10:05:00  2022-01-01T10:10:00        25
   3 β”‚ 2022-01-01T10:10:00  2022-01-01T10:14:00        10
   4 β”‚ 2022-01-01T10:14:00  2022-01-01T10:15:00        30
   5 β”‚ 2022-01-01T10:15:00  2022-01-01T10:20:00        20
1 Like

I have some doubts about the presence of unique, in case there are more intervals with the same start.
I have a mental model of considering the timestamps as stations on a railway line. A little train leaves from the first station and at each station (startTime/stoptime) people get on and/or off, each with her weight. The ValueSum is the load that the train carries for each leg.

julia> test = DataFrame(Time1 = [DateTime(2022, 1, 1, 10, 0, 0), DateTime(2022, 1, 1, 10, 5, 0), DateTime(2022, 1, 1, 10, 5, 0),DateTime(2022, 1, 1, 10, 14, 0)], Time2 = [DateTime(2022, 1, 1, 10, 15, 0),  DateTime(2022, 1, 1, 10, 10, 0),DateTime(2022, 1, 1, 10, 14, 0), DateTime(2022, 1, 1, 10, 20, 0)],
       Value = [10, 15,5, 20])
4Γ—3 DataFrame
 Row β”‚ Time1                Time2                Value 
     β”‚ DateTime             DateTime             Int64
─────┼─────────────────────────────────────────────────
   1 β”‚ 2022-01-01T10:00:00  2022-01-01T10:15:00     10
   2 β”‚ 2022-01-01T10:05:00  2022-01-01T10:10:00     15
   3 β”‚ 2022-01-01T10:05:00  2022-01-01T10:14:00      5
   4 β”‚ 2022-01-01T10:14:00  2022-01-01T10:20:00     20

julia> dfr=DataFrame(t1=sort([test.Time1; test.Time2])[1:end-1],t2=sort([test.Time1; test.Time2])[2:end])
7Γ—2 DataFrame
 Row β”‚ t1                   t2
     β”‚ DateTime             DateTime
─────┼──────────────────────────────────────────
   1 β”‚ 2022-01-01T10:00:00  2022-01-01T10:05:00
   2 β”‚ 2022-01-01T10:05:00  2022-01-01T10:05:00
   3 β”‚ 2022-01-01T10:05:00  2022-01-01T10:10:00
   4 β”‚ 2022-01-01T10:10:00  2022-01-01T10:14:00
   5 β”‚ 2022-01-01T10:14:00  2022-01-01T10:14:00
   6 β”‚ 2022-01-01T10:14:00  2022-01-01T10:15:00
   7 β”‚ 2022-01-01T10:15:00  2022-01-01T10:20:00

julia> dfr.ValueSum=cumsum(last.(sort(mapreduce(((s,e,v),)->[(s,v),(e,-v)] ,vcat, zip(test.Time1, test.Time2, test.Value)))))[1:end-1] 
7-element Vector{Int64}:
 10
 15
 30
 15
 10
 30
 20

julia> dfr
7Γ—3 DataFrame
 Row β”‚ t1                   t2                   ValueSum 
     β”‚ DateTime             DateTime             Int64
─────┼────────────────────────────────────────────────────
   1 β”‚ 2022-01-01T10:00:00  2022-01-01T10:05:00        10
   2 β”‚ 2022-01-01T10:05:00  2022-01-01T10:05:00        15
   3 β”‚ 2022-01-01T10:05:00  2022-01-01T10:10:00        30
   4 β”‚ 2022-01-01T10:10:00  2022-01-01T10:14:00        15
   5 β”‚ 2022-01-01T10:14:00  2022-01-01T10:14:00        10
   6 β”‚ 2022-01-01T10:14:00  2022-01-01T10:15:00        30
   7 β”‚ 2022-01-01T10:15:00  2022-01-01T10:20:00        20

julia> filter(r->r.t1!=r.t2,dfr)
5Γ—3 DataFrame
 Row β”‚ t1                   t2                   ValueSum 
     β”‚ DateTime             DateTime             Int64
─────┼────────────────────────────────────────────────────
   1 β”‚ 2022-01-01T10:00:00  2022-01-01T10:05:00        10
   2 β”‚ 2022-01-01T10:05:00  2022-01-01T10:10:00        30
   3 β”‚ 2022-01-01T10:10:00  2022-01-01T10:14:00        15
   4 β”‚ 2022-01-01T10:14:00  2022-01-01T10:15:00        30
   5 β”‚ 2022-01-01T10:15:00  2022-01-01T10:20:00        20

You may also try FlexiJoins, and innerjoin the original table to this one by @nilshg with final intervals. Use something like innerjoin((res, test), by_predicate(x->x.start..x.end, βŠ†, x->x.Time1..x.Time2), groupby=1) and sum afterwards. Not tested rn, maybe the groupby argument doesn’t work for dataframes – then just group and aggregate afterwards.

Wow you guys are amazing. Thanks @rocco_sprmnt21 @nilshg @aplavin , will play around with the solutions and see which one works well!

@aplavin I see that you have a new package IntervalUnion. Do you think that would help create non overlapping intervals?
I haven’t had a chance to work on a solution to the question posted here but I saw your package on twitter so wondered if the package would help

IntervalUnions.jl is for representing unions of intervals with corresponding set operations, like in, union, etc. It should be possible to add its efficient support to FlexiJoins, even though I don’t have actual plans to do that in the near future.

But as I understand IntervalUnions is not really related to this particular problem discussed here.