Summarizing Float64 values after grouping by different time scales from DateTime object

I have a data frame with two columns. Column time has datetime elements while column values has Float64 elements. I want to group the column time by year, month, day, and hour to calculate the average of the values.

I created a short example that helps drive the point home. I first create some data at step (2). I didn’t know how to create DateTime objects, so I created strings that I later transform. Then, I extract the parameters for the groups (step (4)) and add them as new columns in the data frame (step (5)). My final (failed) attempt is to use @groupby.

# (1) Pkgs
using DataFrames, Dates, Timezone, Query

# (2) Creating some data
time = ["2017-01-01 00:00:00", "2017-01-01 00:20:00", "2017-01-01 00:40:00", "2017-01-01 01:00:00", "2017-01-01 01:20:00", "2017-01-01 01:40:00"]
values = [100, 200, 300, 500, 600, 700]

df = DataFrame(time = time, values = values)

# (3) Transform column time to datetime. 
datetime = DateTime.(df.time, DateFormat("y-m-d HH:MM:SSzzzz"))
df.time = datetime 

# (4) Extract year, month, day, hour from datetime
year_ = year.(df.time_index)
month_ = month.(df.time_index)
day_ = day.(df.time_index)
hour_ = hour.(df.time_index)

# (5) Insert this vectors as columns in df
insertcols!(df, 1, :year_ => year_)
insertcols!(df 2, :month_ => month_)
insertcols!(df, 3, :day_ => day)
insertcols!(df, 4, :hour_ => hour_)

# (6) My attempt to use groupby
x |> @groupby(df,[:year_, :month_,  :day_, :hour_])

I’d expect to have a simpler, faster, and easier-to-read way to do this. Creating individual vectors to later join them to the data frame as columns seems a bit of extra work. Yet, bare in mind, I’m still a Julia novice and I’m simply trying to achieve goals without focusing on early code optimization.

In R, I know how to group steps 4, 5, and 6 into one single command with the help of packages dplyr and lubridate.

  aux <-
    df %>%
    group_by(year(time_index),
             month(time_index),
             day(time_index),
             hour(time_index)) %>%
    summarize(avg_values = mean(values)) %>%
    ungroup()

where aux would be a vector of length 2 with elements c(200, 600), in R notation.

Does anyone know how I’d accomplish this? Thanks a lot.

I am assuming by looking at the R code that you want to compute summary of values every hour? It is a timeseries problem. You can use TSFrames.jl:

julia> using TSFrames

julia> time = [
    DateTime(2017, 01, 01, 0, 0),
    DateTime(2017, 01, 01, 0, 20),
    DateTime(2017, 01, 01, 0, 40),
    DateTime(2017, 01, 01, 01, 00),
    DateTime(2017, 01, 01, 01, 20),
    DateTime(2017, 01, 01, 01, 40)
];
julia> values = [100, 200, 300, 500, 600, 700];

julia> ts = TSFrame(values, time)
TSFrame(values, time)
6×1 TSFrame with DateTime Index
 Index                x1
 DateTime             Int64
────────────────────────────
 2017-01-01T00:00:00    100
 2017-01-01T00:20:00    200
 2017-01-01T00:40:00    300
 2017-01-01T01:00:00    500
 2017-01-01T01:20:00    600
 2017-01-01T01:40:00    700

## compute hourly mean
julia> apply(ts, Hour(1), mean)
2×1 TSFrame with DateTime Index
 Index                x1_mean
 DateTime             Float64
──────────────────────────────
 2017-01-01T00:00:00    200.0
 2017-01-01T01:00:00    600.0

Here, Hour(1) specifies the period to summarise on and mean is the summary function.

REF: TSFrames API docs

Thanks a lot for your feedback, @chiraganand

1 Like