Bug in DataFrames grouping

Here is a reproducible example, and I explain the bug at the end. I am just copying/pasting my code here, and some of the steps may not be needed. But it’s fully contained code, not very complicated, and documented. It seems like a lot, but I tend to be very verbose to make it easier on everyone.

using Distributions
using Statistics
using DataFrames
using CSV
using HTTP
using Dates

## get the data
  contstates = ("AL", "AZ" ,"AR" ,"CA" ,"CO" ,"CT" ,"DE" ,"DC" ,"FL" ,"GA", "ID" ,"IL" ,"IN" ,"IA" ,"KS" ,"KY" ,"LA" ,"ME" ,"MD" ,"MA" ,"MI" ,"MN" ,"MS" ,"MO" ,"MT" ,"NE" ,"NV" ,"NH" ,"NJ" ,"NM" ,"NY" ,"NC" ,"ND" ,"OH" ,"OK" ,"OR" ,"PA" ,"RI" ,"SC" ,"SD" ,"TN" ,"TX" ,"UT" ,"VT" ,"VA" ,"WA" ,"WV" ,"WI" ,"WY")
f = download("https://covidtracking.com/api/v1/states/daily.csv") |> CSV.File |> DataFrame!
f = f[:, (1:4)]   # select only the first four columns
f.date = Date.(string.(f.date), DateFormat("yyyymmdd")) # convert the date column 
filter!(row -> row[:state] in contstates, f)  # remove unwanted states
sort!(f, [:state, :date])  # sort the data

gd = groupby(f, :state)  ## SET UP A GROUPED DATA FRAME based on state

Next I do some operations on the grouping:

function calc_incidence(cuminc)
    _tmp = circshift(cuminc, 1)
    _tmp[1] = 0
    cuminc - _tmp
end
transform!(gd, :positive => calc_incidence => :incidence)

the transform! function modifies the original f dataframe, and adds the incidence column for each group (i.e. for each state). Next I simply just want to get a summary of the grouped (state) data

# for each state, get the incidence on last day + the total cumulative f
f_summary = combine([:positive] => (p) -> (positive=p[end]), gd)

Okay this should give me the one value per state (and it does, but the grouping gets messed up). The result is

49×2 DataFrame
│ Row │ date       │ positive_function │
│     │ Date       │ Int64             │
├─────┼────────────┼───────────────────┤
│ 1   │ 2020-03-07 │ 74212             │
│ 2   │ 2020-03-06 │ 36259             │
│ 3   │ 2020-03-04 │ 152944            │
⋮
│ 46  │ 2020-01-22 │ 49247             │
│ 47  │ 2020-03-04 │ 49669             │
│ 48  │ 2020-03-06 │ 5550              │
│ 49  │ 2020-03-07 │ 2346              │

Why did it give me arbitrary dates? the gd is grouped on State. I expected the results to be

49×2 DataFrame
│ Row │ state  │ positive_function │
│     │ String │ Int64             │
├─────┼────────┼───────────────────┤
│ 1   │ AL     │ 74212             │
│ 2   │ AR     │ 36259             │
│ 3   │ AZ     │ 152944            │
⋮
│ 46  │ WA     │ 49247             │
│ 47  │ WI     │ 49669             │
│ 48  │ WV     │ 5550              │
│ 49  │ WY     │ 2346              │

Seems like if I directly call groupby in the transform/combine functions, things work:

#gd = groupby(f, :state)
transform!(groupby(f, :state), :positive => calc_incidence => :incidence)
f_summary = combine([:positive] => (p) -> (positive=p[end]), groupby(f, :state))

julia> f_summary
49×2 DataFrame
│ Row │ state  │ positive_function │
│     │ String │ Int64             │
├─────┼────────┼───────────────────┤
│ 1   │ AL     │ 74212             │
│ 2   │ AR     │ 36259             │
│ 3   │ AZ     │ 152944            │
⋮
│ 46  │ WA     │ 49247             │
│ 47  │ WI     │ 49669             │
│ 48  │ WV     │ 5550              │
│ 49  │ WY     │ 2346              │

transform! has an ungroup option that defaults to true. So, it adds your column but then undoes the grouping. I think if you set it to false this might work.

This did not fix the problem

transform!(gd, :positive => calc_incidence => :incidence, ungroup=false)

Yea this seems like a bug somehow… You can see that the grouping does in fact switch to date rather than state when you call transform!. Call keys(gd) before and after it. It is unrelated to what you are doing in the transform! because I swapped out your function for a number of different things and got the same behavior. You should file an issue

Okay thanks! Is there a smaller, reproducible code you can share with me for the bug report?

I just used your code.

Ha! Interesting that if you just run the transform! line one after the other it will just keep flipping back and forth between being grouped by state and grouped by date.

Yes! I noticed that too. I just didn’t want to include that in the post. I ran my code again to reproduce and saw that it was “fixed”. I ran it again to double check, and the keys were switched again!

Very interesting.