Hi all,
Just wanted to post a little snippet I learned from @kevbonham about how to create user-defined grouped data frames. What I mean by this, is I was working with time series data recently and wanted to generate groups for when dates were continuous. So, for example, if I was given data with the following dates:
using Dates
dates = [
Date("2020-03-18"),
Date("2020-03-19"),
Date("2020-04-18"),
Date("2020-04-19"),
Date("2020-04-25"),
Date("2020-04-26"),
]
I would want groupings that contain date based groups like this:
2×1 SubDataFrame
Row │ dates
│ Date
─────┼────────────
1 │ 2020-03-18
2 │ 2020-03-19
2×1 SubDataFrame
Row │ dates
│ Date
─────┼────────────
1 │ 2020-04-18
2 │ 2020-04-19
2×1 SubDataFrame
Row │ dates
│ Date
─────┼────────────
1 │ 2020-04-25
2 │ 2020-04-26
So how to get this? You create your own custom grouping filter! In my case, I did something like this:
group_axis = [1]
counter = 1
for i = 1:(length(dates) - 1)
if (dates[i + 1] - dates[i]) != Day(1)
counter += 1
push!(group_axis, counter)
else
push!(group_axis, counter)
end
end
Where I look for continuous segments in my dates. The group_axis
in this case defines each different grouping by assigning it an integer value. From there, I can then do something like this to get the desired groups I want:
using DataFrames
date_groups =
groupby(DataFrame([group_axis, dates], [:groups, :dates]), :groups)
Which produce the following:
GroupedDataFrame with 3 groups based on key: groups
First Group (2 rows): groups = 1
Row │ groups dates
│ Int64 Date
─────┼────────────────────
1 │ 1 2020-03-18
2 │ 1 2020-03-19
⋮
Last Group (2 rows): groups = 3
Row │ groups dates
│ Int64 Date
─────┼────────────────────
1 │ 3 2020-04-25
2 │ 3 2020-04-26
Similar methods can be done for custom group filtering you need to do. I wanted to provide this as an example to show what one could do to approach the problem of creating custom grouped dataframes. For reference, here is the full code example:
using DataFrames
using Dates
dates = [
Date("2020-03-18"),
Date("2020-03-19"),
Date("2020-04-18"),
Date("2020-04-19"),
Date("2020-04-25"),
Date("2020-04-26"),
]
group_axis = [1]
counter = 1
for i = 1:(length(dates) - 1)
if (dates[i + 1] - dates[i]) != Day(1)
counter += 1
push!(group_axis, counter)
else
push!(group_axis, counter)
end
end
date_groups =
groupby(DataFrame([group_axis, dates], [:groups, :dates]), :groups)
~ tcp