Bug in DataFrames grouping

affans · July 24, 2020, 2:33am

Here is a reproducible example, and I explain the bug at the end. I am just copying/pasting my code here, and some of the steps may not be needed. But it’s fully contained code, not very complicated, and documented. It seems like a lot, but I tend to be very verbose to make it easier on everyone.

using Distributions
using Statistics
using DataFrames
using CSV
using HTTP
using Dates

## get the data
  contstates = ("AL", "AZ" ,"AR" ,"CA" ,"CO" ,"CT" ,"DE" ,"DC" ,"FL" ,"GA", "ID" ,"IL" ,"IN" ,"IA" ,"KS" ,"KY" ,"LA" ,"ME" ,"MD" ,"MA" ,"MI" ,"MN" ,"MS" ,"MO" ,"MT" ,"NE" ,"NV" ,"NH" ,"NJ" ,"NM" ,"NY" ,"NC" ,"ND" ,"OH" ,"OK" ,"OR" ,"PA" ,"RI" ,"SC" ,"SD" ,"TN" ,"TX" ,"UT" ,"VT" ,"VA" ,"WA" ,"WV" ,"WI" ,"WY")
f = download("https://covidtracking.com/api/v1/states/daily.csv") |> CSV.File |> DataFrame!
f = f[:, (1:4)]   # select only the first four columns
f.date = Date.(string.(f.date), DateFormat("yyyymmdd")) # convert the date column 
filter!(row -> row[:state] in contstates, f)  # remove unwanted states
sort!(f, [:state, :date])  # sort the data

gd = groupby(f, :state)  ## SET UP A GROUPED DATA FRAME based on state

Next I do some operations on the grouping:

function calc_incidence(cuminc)
    _tmp = circshift(cuminc, 1)
    _tmp[1] = 0
    cuminc - _tmp
end
transform!(gd, :positive => calc_incidence => :incidence)

the transform! function modifies the original f dataframe, and adds the incidence column for each group (i.e. for each state). Next I simply just want to get a summary of the grouped (state) data

# for each state, get the incidence on last day + the total cumulative f
f_summary = combine([:positive] => (p) -> (positive=p[end]), gd)

Okay this should give me the one value per state (and it does, but the grouping gets messed up). The result is

49×2 DataFrame
│ Row │ date       │ positive_function │
│     │ Date       │ Int64             │
├─────┼────────────┼───────────────────┤
│ 1   │ 2020-03-07 │ 74212             │
│ 2   │ 2020-03-06 │ 36259             │
│ 3   │ 2020-03-04 │ 152944            │
⋮
│ 46  │ 2020-01-22 │ 49247             │
│ 47  │ 2020-03-04 │ 49669             │
│ 48  │ 2020-03-06 │ 5550              │
│ 49  │ 2020-03-07 │ 2346              │

Why did it give me arbitrary dates? the gd is grouped on State. I expected the results to be

49×2 DataFrame
│ Row │ state  │ positive_function │
│     │ String │ Int64             │
├─────┼────────┼───────────────────┤
│ 1   │ AL     │ 74212             │
│ 2   │ AR     │ 36259             │
│ 3   │ AZ     │ 152944            │
⋮
│ 46  │ WA     │ 49247             │
│ 47  │ WI     │ 49669             │
│ 48  │ WV     │ 5550              │
│ 49  │ WY     │ 2346              │

affans · July 24, 2020, 2:38am

Seems like if I directly call groupby in the transform/combine functions, things work:

#gd = groupby(f, :state)
transform!(groupby(f, :state), :positive => calc_incidence => :incidence)
f_summary = combine([:positive] => (p) -> (positive=p[end]), groupby(f, :state))

julia> f_summary
49×2 DataFrame
│ Row │ state  │ positive_function │
│     │ String │ Int64             │
├─────┼────────┼───────────────────┤
│ 1   │ AL     │ 74212             │
│ 2   │ AR     │ 36259             │
│ 3   │ AZ     │ 152944            │
⋮
│ 46  │ WA     │ 49247             │
│ 47  │ WI     │ 49669             │
│ 48  │ WV     │ 5550              │
│ 49  │ WY     │ 2346              │

tbeason · July 24, 2020, 2:41am

transform! has an ungroup option that defaults to true. So, it adds your column but then undoes the grouping. I think if you set it to false this might work.

affans · July 24, 2020, 2:51am

This did not fix the problem

transform!(gd, :positive => calc_incidence => :incidence, ungroup=false)

tbeason · July 24, 2020, 3:14am

Yea this seems like a bug somehow… You can see that the grouping does in fact switch to date rather than state when you call transform!. Call keys(gd) before and after it. It is unrelated to what you are doing in the transform! because I swapped out your function for a number of different things and got the same behavior. You should file an issue

affans · July 24, 2020, 3:14am

Okay thanks! Is there a smaller, reproducible code you can share with me for the bug report?

tbeason · July 24, 2020, 3:17am

I just used your code.

tbeason · July 24, 2020, 3:25am

Ha! Interesting that if you just run the transform! line one after the other it will just keep flipping back and forth between being grouped by state and grouped by date.

affans · July 24, 2020, 3:34am

Yes! I noticed that too. I just didn’t want to include that in the post. I ran my code again to reproduce and saw that it was “fixed”. I ran it again to double check, and the keys were switched again!

Very interesting.

Topic		Replies	Views
Creating User-Defined Grouped DataFrames Data first-steps , data , dataframes , time-series	1	512	July 9, 2021
Grouped Data Frame -- Two different types General Usage dataframes	3	605	August 6, 2022
Trouble creating new column in grouped object Data query	1	420	August 26, 2019
DataFrames groupby() by a column of mutable custom type General Usage dataframes , mutable-structure	4	547	June 9, 2021
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	692	March 29, 2023

Bug in DataFrames grouping

Related topics