What would be the best way to perform an operation on the data that occurs either on the same day or between consecutive dates e.g. take the difference between Data on the same and consecutive dates such that
Ideally I would have liked to avoid the missing values by grouping the DataFrame via consecutive dates and then running a function on each grouped DataFrame via transform. As groupby(df, :dates) would only place data of the same date in a group, I was just wondering if there was a way to use groupby to create a GroupedDataFrame of consecutive Dates?
I assume your data is sorted by :dates. In this case the efficient way to do it would use a loop (or reduction function, which is equivalent, but might be harder to implement)
Thanks so much! So I couldn’t use groupby directly. Given millions of rows would an efficient approach look something like?
function cnsctvdiff(days, data)
newcol= [missing; diff(data::Vector{Int64})]
for d = Iterators.rest(eachindex(days::Vector{Date}),1)
(days[d]-days[d-1] > Day(1)) && (newcol[d] = missing)
end
return newcol
end
df.cdff = cnsctvdiff(df.dates,df.data)
Are type declarations being used correctly here to speed up performance? if operations are to be performed later on df.cdff would a better approach be to break df down into separate DataFrames so that each column is type stable?
e.g creating an additional column and then using groupby
function dateID(days)
ID = ones(Int64, length(days))
for d = Iterators.rest(eachindex(days::Vector{Date}),1)
(days[d]-days[d-1] >Day(1)) && (ID[d:end] .+= 1)
end
return ID
end
df.ID = dateID(df.dates)
You do not need to add type declarations. Julia will perform this automatically.
A loop within a function that gets columns passed as its positional arguments would be efficient (and you roughly get the fastest possible result).
A minor issue is that [missing; diff(data::Vector{Int64})] would allocate twice (so it could be optimized), but ithis is a minor issue - it should be fast enough.
Thanks! Sorry so is the following a correct understanding of when type declarations improve performance?
If a function takes a column of a DataFrame as an argument then type declarations do not improve performance.
If a function takes an entireDataFrame as an argument then type declarations on each column will improve performance.
The docs on argument type declarations note that they generally do not enhance performance. But as you pointed out in a previous post about efficiency in iterating over columns of a DataFrame that:
and also that
My background is so minimal that I wasn’t sure whether type declarations here improved performance because the loop was entirely outside of a function or because the entire DataFrame was being passed as an argument into a function?
Assuming the latter, why does a function barrier work for a column of a DataFrame but not the entire DataFrame? Wouldn’t the type stability issue affect both the DataFrame object and the DataFrame col?
why does a function barrier work for a column of a DataFrame but not the entire DataFrame?
Because when you pass a concrete column the method specific for that column type is compiled by Julia. If you pass a whole DataFrame then no such compilation happens and Julia has to assume that you could have stored vectors of any type inside a data frame.
using DatFrame, Dates, IterTools
push!(df, merge(last(df), (dates=last(df).dates-Day(2),)))
pdf=partition(zip(df.data,df.dates),2,1)
df.diff=[r<Day(2) ? l : missing for (l,r) in [p.-f for (p,f) in pdf]]
I use the function defined by @Dan as an example to ask a question
What are the differences in terms of performance between the following shapes?
and, if there are, why (or in what cases) would one of these be preferable?
function md1(d, D)
#d,D=df.data, df.dates
diff=Vector{Union{Int, Missing}}(undef, length(d))
for i in eachindex(d)
diff[i]=i==length(d) || D[i]-D[i+1]>Day(1) ?
missing : d[i+1]-d[i]
end
diff
end
function md2(d, D)
diff=Vector{Union{Int, Missing}}(undef, length(d))
for i in 1:length(d)-1
diff[i]=D[i]-D[i+1]>Day(1) ?
missing : d[i+1]-d[i]
end
diff[end]=missing
diff
end
transform!(df,
[:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ?
missing : d[i+1]-d[i] for i in 1:length(d)]) => :diff)
transform!(df,
[:data, :dates] => ((d,t)->[[t[i]-t[i+1]>Day(1) ?
missing : d[i+1]-d[i] for i in 1:length(d)-1];missing]) => :diff)
transform!(df, [:data, :dates] => md1 => :diff)
df.diff=md1(df.data, df.dates)
df.diff=md2(df.data, df.dates)
Thanks so much for the incredibly informative responses! Really helpful to see the different ways transform could be used. @rocco_sprmnt21, I didn’t see that big a difference in time between md1 vs md2 or the different transform methods, but in accordance with bkamins initial reply looping through with a function was significantly faster than using transform in this case.
dr = (Date(2023,3,1):Day(1):Date(2023,3,11));
df = DataFrame(data=rand(collect(1:10),10), dates=rand(collect(dr),10));
sort!(df, :dates, rev= true);
julia> @benchmark $df.diff = md1(df.data, df.dates)
BenchmarkTools.Trial: 10000 samples with 829 evaluations.
Range (min … max): 150.584 ns … 6.606 μs ┊ GC (min … max): 0.00% … 97.49%
Time (median): 152.946 ns ┊ GC (median): 0.00%
Time (mean ± σ): 167.090 ns ± 262.051 ns ┊ GC (mean ± σ): 6.65% ± 4.13%
▄▇█▆▄▂▁ ▁▃▄▂ ▃▅▅▅▄▃▃▃▂▂▂▂▃▂▂▂▁▁▁ ▂
▄▇█████████▇█████▇▆▅▅▇██████████████████████████▇▇▇▆▆▆▆▆▅▅▃▄▄ █
151 ns Histogram: log(frequency) by time 171 ns <
Memory estimate: 192 bytes, allocs estimate: 2.
julia> @benchmark $transform(df,
[:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ?
missing : d[i]-d[i+1] for i in 1:length(d)]) => :diff)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.583 μs … 7.552 ms ┊ GC (min … max): 0.00% … 99.26%
Time (median): 14.166 μs ┊ GC (median): 0.00%
Time (mean ± σ): 15.198 μs ± 75.392 μs ┊ GC (mean ± σ): 4.93% ± 0.99%
▂▆█▇▅▂
▂▄██████▇▇▅▄▃▃▃▃▃▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
13.6 μs Histogram: frequency by time 18.5 μs <
Memory estimate: 9.27 KiB, allocs estimate: 174.
I understood that your data was several million rows. Perhaps some differences between md1 2d md2 may occur for large dataframes. For example md2 avoids doing this i==length(d) check for each loop. My doubt is whether this makes any difference on large size.
I’m curious what the real structure of your data is. How do you have millions of records with different dates? you have many records with the same day date in many cases evidently. the consecutive data you want to compare are surveys of what?
I take this opportunity to propose another “toy solution”
grp=IterTools.groupby(isequal∘last ,zip(df.data,df.dates))
ggrp=IterTools.groupby(let c=last(first(first(grp))); x->c= (c-last(first(x))) > Day(1) ? last(first(x)) : c end, grp)
collect(Iterators.flatten([diff(first.(g));missing] for g in Iterators.flatten.(ggrp)))
Another way to get rid of i == length(d) check, and the corresponding branching short-circuit || and ? : operators would be:
t[i]-t[i*(i==length(d))+1] > Day(1)
This assumes t[length[d]]-t[1] > Day(1) which should be the case when at least 2 days are recorded. The resulting expression:
transform(df,
[:data, :dates] => ((d,t)->[abs(t[i]-t[i*(i!=length(d))+1]) > Day(1) ?
missing : d[i]-d[i+1] for i in 1:length(d)]) => :diff)
Maybe a bit more branch elimination would help:
transform(df,
[:data, :dates] => ((d,t)->[ifelse(abs(t[i]-t[i*(i!=length(d))+1]) > Day(1),
missing, d[i]-d[i*(i!=length(d))+1]) for i in 1:length(d)]) => :diff)
@rocco_sprmnt21 Sorry I didn’t think about the size of the DataFrame and the overhead involved. Yes the actual data set contains maybe 10^2 -10^3 data points daily. Testing with the following larger DataFramemd1 and md2 are still very close. transform has the same order of magnitude as md1 and md2, but allocates more memory.
dr = Date(2020,1,1):Day(1):Date(2030,12,31)
df3 = DataFrame( data =rand(collect(1:100),10^8),dates = rand(collect(dr),10^8));
sort!(df3, :dates; rev = true);
julia> @benchmark $df3.diff = md2(df3.data,df3.dates)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
Range (min … max): 185.305 ms … 272.337 ms ┊ GC (min … max): 0.00% … 20.56%
Time (median): 194.374 ms ┊ GC (median): 0.00%
Time (mean ± σ): 206.284 ms ± 23.500 ms ┊ GC (mean ± σ): 5.64% ± 8.82%
▄█▄▁ ▁ ▁
▆▁▆████▆▁▁▁▁█▆▁▆▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
185 ms Histogram: frequency by time 272 ms <
Memory estimate: 858.31 MiB, allocs estimate: 4.
julia> @benchmark $df3.diff = md1(df3.data,df3.dates)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
Range (min … max): 193.103 ms … 245.439 ms ┊ GC (min … max): 0.00% … 22.36%
Time (median): 202.986 ms ┊ GC (median): 0.00%
Time (mean ± σ): 210.611 ms ± 17.124 ms ┊ GC (mean ± σ): 5.66% ± 9.03%
▃ █ ▃ ▃ ▃
▇▁▇▇█▁▇▇█▁▇█▁▁█▁▁▁▁▁▇▇▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▁█▁▇ ▁
193 ms Histogram: frequency by time 245 ms <
Memory estimate: 858.31 MiB, allocs estimate: 4.
( Couldn’t get the toy solution to return the desired result on test data and I didn’t understand it well enough to adjust it. But as is it ran meaningfully slower than the above solutions you proposed, benchmarking at 2.649 seconds with the same DataFrame)
@Dan is the reason d[i]-d[i*(i!=length(d))+1] as opposed to d[I]-d[I+1] used in the transform because the ifelse() operator evaluates all the arguments even before the condition? The docs mention this can improve performance but doesn’t go into detail when this holds true. Here the ifelse() actually appears slightly slower than using ? operator.
julia> @benchmark $transform(df3,
[:data, :dates] => ((d,t)->[ifelse(abs(t[i]-t[i*(i!=length(d))+1]) > Day(1),
missing, d[i]-d[i*(i!=length(d))+1]) for i in 1:length(d)]) => :diff)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 563.228 ms … 1.204 s ┊ GC (min … max): 0.00% … 17.49%
Time (median): 707.159 ms ┊ GC (median): 5.23%
Time (mean ± σ): 738.541 ms ± 221.995 ms ┊ GC (mean ± σ): 10.88% ± 10.20%
█ ▁ ▁ ▁ ▁ ▁
█▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
563 ms Histogram: frequency by time 1.2 s <
Memory estimate: 3.91 GiB, allocs estimate: 191.
julia> @benchmark $transform(df3,
[:data, :dates] => ((d,t)->[i==length(d) || t[i]-t[i+1]>Day(1) ?
missing : d[i]-d[i+1] for i in 1:length(d)])=> :diff)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 519.620 ms … 1.257 s ┊ GC (min … max): 0.00% … 17.73%
Time (median): 568.886 ms ┊ GC (median): 8.79%
Time (mean ± σ): 664.064 ms ± 248.113 ms ┊ GC (mean ± σ): 11.89% ± 9.84%
█
█▁▁▇▁▇▁▁▁▁▁▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
520 ms Histogram: frequency by time 1.26 s <
Memory estimate: 3.91 GiB, allocs estimate: 191.
yes the toy solution is no good. I had been too hasty.
this should work better.
but it’s certainly not useful for the problem it’s just an exercise to find a “functional” solution
grp=IterTools.groupby(isequal∘last ,zip(df.data,df.dates))
ggrp=IterTools.groupby(let c=first(enumerate(grp)); x->c= last(last(last(c)))-last(last(last(x))) == Day(last(first(x))-first(c)) ? c : x end, enumerate(grp))
fl=Iterators.flatten
cdiff=collect(fl([diff(first.(fl(last.(g))));missing] for g in ggrp))