Creating User-Defined Grouped DataFrames

Hi all,

Just wanted to post a little snippet I learned from @kevbonham about how to create user-defined grouped data frames. What I mean by this, is I was working with time series data recently and wanted to generate groups for when dates were continuous. So, for example, if I was given data with the following dates:

using Dates

dates = [
    Date("2020-03-18"),
    Date("2020-03-19"),
    Date("2020-04-18"),
    Date("2020-04-19"),
    Date("2020-04-25"),
    Date("2020-04-26"),
]

I would want groupings that contain date based groups like this:

2×1 SubDataFrame
 Row │ dates
     │ Date
─────┼────────────
   1 │ 2020-03-18
   2 │ 2020-03-19

2×1 SubDataFrame
 Row │ dates
     │ Date
─────┼────────────
   1 │ 2020-04-18
   2 │ 2020-04-19

2×1 SubDataFrame
 Row │ dates
     │ Date
─────┼────────────
   1 │ 2020-04-25
   2 │ 2020-04-26

So how to get this? You create your own custom grouping filter! In my case, I did something like this:

    group_axis = [1]
    counter = 1
    for i = 1:(length(dates) - 1)
        if (dates[i + 1] - dates[i]) != Day(1)
            counter += 1
            push!(group_axis, counter)
        else
            push!(group_axis, counter)
        end
    end

Where I look for continuous segments in my dates. The group_axis in this case defines each different grouping by assigning it an integer value. From there, I can then do something like this to get the desired groups I want:

using DataFrames

    date_groups =
        groupby(DataFrame([group_axis, dates], [:groups, :dates]), :groups)

Which produce the following:

GroupedDataFrame with 3 groups based on key: groups
First Group (2 rows): groups = 1
 Row │ groups  dates
     │ Int64   Date
─────┼────────────────────
   1 │      1  2020-03-18
   2 │      1  2020-03-19
⋮
Last Group (2 rows): groups = 3
 Row │ groups  dates
     │ Int64   Date
─────┼────────────────────
   1 │      3  2020-04-25
   2 │      3  2020-04-26

Similar methods can be done for custom group filtering you need to do. I wanted to provide this as an example to show what one could do to approach the problem of creating custom grouped dataframes. For reference, here is the full code example:

using DataFrames
using Dates

dates = [
    Date("2020-03-18"),
    Date("2020-03-19"),
    Date("2020-04-18"),
    Date("2020-04-19"),
    Date("2020-04-25"),
    Date("2020-04-26"),
]
    group_axis = [1]
    counter = 1
    for i = 1:(length(dates) - 1)
        if (dates[i + 1] - dates[i]) != Day(1)
            counter += 1
            push!(group_axis, counter)
        else
            push!(group_axis, counter)
        end
    end
    date_groups =
        groupby(DataFrame([group_axis, dates], [:groups, :dates]), :groups)

~ tcp :deciduous_tree:

2 Likes

With help from @bkamins and here is an even more concise example of the above code snippet:

using DataFrames
using Dates

dates = [
    Date("2020-03-18"),
    Date("2020-03-19"),
    Date("2020-04-18"),
    Date("2020-04-19"),
    Date("2020-04-25"),
    Date("2020-04-26"),
]

group_axis = cumsum([1; diff(dates) .!= Day(1)])

    date_groups =
        groupby(DataFrame([group_axis, dates], [:groups, :dates]), :groups)
6 Likes