I have a data frame with two columns. Column time
has datetime
elements while column values
has Float64
elements. I want to group the column time
by year, month, day, and hour to calculate the average of the values.
I created a short example that helps drive the point home. I first create some data at step (2). I didn’t know how to create DateTime
objects, so I created strings that I later transform. Then, I extract the parameters for the groups (step (4)) and add them as new columns in the data frame (step (5)). My final (failed) attempt is to use @groupby
.
# (1) Pkgs
using DataFrames, Dates, Timezone, Query
# (2) Creating some data
time = ["2017-01-01 00:00:00", "2017-01-01 00:20:00", "2017-01-01 00:40:00", "2017-01-01 01:00:00", "2017-01-01 01:20:00", "2017-01-01 01:40:00"]
values = [100, 200, 300, 500, 600, 700]
df = DataFrame(time = time, values = values)
# (3) Transform column time to datetime.
datetime = DateTime.(df.time, DateFormat("y-m-d HH:MM:SSzzzz"))
df.time = datetime
# (4) Extract year, month, day, hour from datetime
year_ = year.(df.time_index)
month_ = month.(df.time_index)
day_ = day.(df.time_index)
hour_ = hour.(df.time_index)
# (5) Insert this vectors as columns in df
insertcols!(df, 1, :year_ => year_)
insertcols!(df 2, :month_ => month_)
insertcols!(df, 3, :day_ => day)
insertcols!(df, 4, :hour_ => hour_)
# (6) My attempt to use groupby
x |> @groupby(df,[:year_, :month_, :day_, :hour_])
I’d expect to have a simpler, faster, and easier-to-read way to do this. Creating individual vectors to later join them to the data frame as columns seems a bit of extra work. Yet, bare in mind, I’m still a Julia novice and I’m simply trying to achieve goals without focusing on early code optimization.
In R
, I know how to group steps 4, 5, and 6 into one single command with the help of packages dplyr
and lubridate
.
aux <-
df %>%
group_by(year(time_index),
month(time_index),
day(time_index),
hour(time_index)) %>%
summarize(avg_values = mean(values)) %>%
ungroup()
where aux
would be a vector of length 2 with elements c(200, 600)
, in R
notation.
Does anyone know how I’d accomplish this? Thanks a lot.