Here is a reproducible example, and I explain the bug at the end. I am just copying/pasting my code here, and some of the steps may not be needed. But it’s fully contained code, not very complicated, and documented. It seems like a lot, but I tend to be very verbose to make it easier on everyone.
using Distributions
using Statistics
using DataFrames
using CSV
using HTTP
using Dates
## get the data
contstates = ("AL", "AZ" ,"AR" ,"CA" ,"CO" ,"CT" ,"DE" ,"DC" ,"FL" ,"GA", "ID" ,"IL" ,"IN" ,"IA" ,"KS" ,"KY" ,"LA" ,"ME" ,"MD" ,"MA" ,"MI" ,"MN" ,"MS" ,"MO" ,"MT" ,"NE" ,"NV" ,"NH" ,"NJ" ,"NM" ,"NY" ,"NC" ,"ND" ,"OH" ,"OK" ,"OR" ,"PA" ,"RI" ,"SC" ,"SD" ,"TN" ,"TX" ,"UT" ,"VT" ,"VA" ,"WA" ,"WV" ,"WI" ,"WY")
f = download("https://covidtracking.com/api/v1/states/daily.csv") |> CSV.File |> DataFrame!
f = f[:, (1:4)] # select only the first four columns
f.date = Date.(string.(f.date), DateFormat("yyyymmdd")) # convert the date column
filter!(row -> row[:state] in contstates, f) # remove unwanted states
sort!(f, [:state, :date]) # sort the data
gd = groupby(f, :state) ## SET UP A GROUPED DATA FRAME based on state
Next I do some operations on the grouping:
function calc_incidence(cuminc)
_tmp = circshift(cuminc, 1)
_tmp[1] = 0
cuminc - _tmp
end
transform!(gd, :positive => calc_incidence => :incidence)
the transform!
function modifies the original f
dataframe, and adds the incidence
column for each group (i.e. for each state). Next I simply just want to get a summary of the grouped (state) data
# for each state, get the incidence on last day + the total cumulative f
f_summary = combine([:positive] => (p) -> (positive=p[end]), gd)
Okay this should give me the one value per state (and it does, but the grouping gets messed up). The result is
49×2 DataFrame
│ Row │ date │ positive_function │
│ │ Date │ Int64 │
├─────┼────────────┼───────────────────┤
│ 1 │ 2020-03-07 │ 74212 │
│ 2 │ 2020-03-06 │ 36259 │
│ 3 │ 2020-03-04 │ 152944 │
⋮
│ 46 │ 2020-01-22 │ 49247 │
│ 47 │ 2020-03-04 │ 49669 │
│ 48 │ 2020-03-06 │ 5550 │
│ 49 │ 2020-03-07 │ 2346 │
Why did it give me arbitrary dates? the gd
is grouped on State. I expected the results to be
49×2 DataFrame
│ Row │ state │ positive_function │
│ │ String │ Int64 │
├─────┼────────┼───────────────────┤
│ 1 │ AL │ 74212 │
│ 2 │ AR │ 36259 │
│ 3 │ AZ │ 152944 │
⋮
│ 46 │ WA │ 49247 │
│ 47 │ WI │ 49669 │
│ 48 │ WV │ 5550 │
│ 49 │ WY │ 2346 │