I have written the piece of code below. It takes a vector of DateTime
, a vector of Float64
, and an integer g
as input to calculate the mean of the the type Float64
by grouped DateTime
type given time granularity g
.
In this example, the data starts from Jan. 1st, 2024 at midnight and goes for seven days in increments of 11 seconds. I choose g = 15
and set out to find out how many rows the grouped data will have. I do that with a while-loop
. Once I have the number of rows, I create an array to store the output and compute the mean per groups. I use the for-loop
to do that. Specifically, I begin with the first available DateTime
and increment it by 15 seconds. Then, I group the date (Or should I say filter? Subset?) and the compute the mean, assigning results to the array I created. You’ll notice I also an if-statement
to deal with groups/filtrations/subsets that are empty.
This works as intended, but it takes a lot of time. Can you think of a way to make this code more efficient so it runs faster? All help is appreciated.
# Data as is
timestamp = collect(DateTime(2024,01,01,00,00,00):Second(11):DateTime(2024,01,07,00,00,10));
data = rand(length(timestamp))
# Define granularity by which to aggregate data
g = 15 # 15s
# Define number of groups to be aggregated.
# With row_count, set dimensions of agg_data Array
x = DataFrame(date=timestamp; value=data)
test = true
row_count = 0
a = first(x.date)
b = last(x.date)
while test
if a + Second(g) <= b
row_count = row_count + 1
a = a + Second(g)
else
test = false
end
end
# Initialize arrays and parameters
agg_data = Array{Any}(undef, row_count, 2)
start = first(x.date)
for i in 1:row_count
# Define the time interval and filter data
agg_data[i, 1] = start
x_subset = x[start.<=x.date.<start+Second(g), :]
# If said interval is empty, ie, nrow = 0, then assume the mean for
# time interval is zero.
if nrow(x_subset) != 0
agg_data[i, 2] = mean(x_subset.value)
else
agg_data[i, 2] = 0
end
# Update start
start = start + Second(g)
end