I have written the piece of code below. It takes a vector of DateTime, a vector of Float64, and an integer g as input to calculate the mean of the the type Float64 by grouped DateTime type given time granularity g.
In this example, the data starts from Jan. 1st, 2024 at midnight and goes for seven days in increments of 11 seconds. I choose g = 15 and set out to find out how many rows the grouped data will have. I do that with a while-loop. Once I have the number of rows, I create an array to store the output and compute the mean per groups. I use the for-loop to do that. Specifically, I begin with the first available DateTime and increment it by 15 seconds. Then, I group the date (Or should I say filter? Subset?) and the compute the mean, assigning results to the array I created. You’ll notice I also an if-statement to deal with groups/filtrations/subsets that are empty.
This works as intended, but it takes a lot of time. Can you think of a way to make this code more efficient so it runs faster? All help is appreciated.
# Data as is
timestamp = collect(DateTime(2024,01,01,00,00,00):Second(11):DateTime(2024,01,07,00,00,10));
data = rand(length(timestamp))
# Define granularity by which to aggregate data
g = 15 # 15s
# Define number of groups to be aggregated.
# With row_count, set dimensions of agg_data Array
x = DataFrame(date=timestamp; value=data)
test = true
row_count = 0
a = first(x.date)
b = last(x.date)
while test
if a + Second(g) <= b
row_count = row_count + 1
a = a + Second(g)
else
test = false
end
end
# Initialize arrays and parameters
agg_data = Array{Any}(undef, row_count, 2)
start = first(x.date)
for i in 1:row_count
# Define the time interval and filter data
agg_data[i, 1] = start
x_subset = x[start.<=x.date.<start+Second(g), :]
# If said interval is empty, ie, nrow = 0, then assume the mean for
# time interval is zero.
if nrow(x_subset) != 0
agg_data[i, 2] = mean(x_subset.value)
else
agg_data[i, 2] = 0
end
# Update start
start = start + Second(g)
end