How to carry out a sum over data indexed by time and grouped by year and month?

Hello, everyone

I am new to Julia and I’m trying to carry out the task of grouping a variable by year and month and then summing for each month and year along the duration of a time-index variable.

I know how to do this in R and I provide a small reproducible example bellow. In a nutshell, the variable x is indexed by date. With that, I can get monthly sums of x for the years 2021 and 2022 by grouping my data by year and month. Helped by dplyr and lubridate, I can then easily sum all x in that period.

How would I do that in Julia? While I don’t expect someone to fully reproduce this example, I’d appreciate some directions as to what packages I should look into? From what I’ve gathered, it’s basically either Queryverse and/or DataFramesMeta.

# Libraries
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

# Generate fake data
x = runif(2*365,0, 50)

# Generate a sequence of dates
my_seq = seq(as.Date("2021/1/1"), by = "day", length.out = length(x))

# Set a dataframe
my_data = tibble(my_seq, x)

# Use tidyr to group data points by month and sum them for each 
# month.
monthly_x =
  my_data |>
  mutate(year = year(my_seq), 
         month = month(my_seq)) |>
  group_by(year, month) |>
  summarise(sum_of_x = sum(x))
#> `summarise()` has grouped output by 'year'. You can override using the
#> `.groups` argument.

# Check January 2021
sum(x[1:31]) == monthly_x$sum_of_x[1]
#> [1] TRUE
#
#Created on 2022-04-26 by the reprex package (v2.0.1)

It’s much the same:

my_data[:, :year] = year.(my_data.my_seq)
my_data[:, month] = month.(my_data.my_seq)
combine(groupby(my_data, [:year, :month]), :x => sum => sum_of_x)

There are various packages that allow for a more “tidyr” style chaining of commands, like e.g. Chain.jl (which also has a useful comparison between the different chaining packages in it’s Readme, so check out the GitHub page)

1 Like

Here is an example with DataFramesMeta.jl and Chain.jl

@chain df begin 
    @rtransform :year = year(:my_seq)
    @rtransform :month = month(:my_seq)
    groupby([:year, :month])
    @combine :sum_of_x = sum(:x)
end
1 Like

Thanks, @nilshg

Thanks, @pdeffebach

Alternatively, with arrays/dictionaries instead of dataframes:

julia> using Dates, DataPipes, SplitApplyCombine, StructArrays

# create table:
julia> data = [
    (; my_seq, x=rand())
    for my_seq in range(Date(2021, 1, 1), step=Day(1), length=50)
] |> StructArray

# compute sums:
julia> @p data |>
           groupview((year=year(_.my_seq), month=month(_.my_seq))) |>
           map(sum(_.x))
1 Like