Create Running Total Columns in a Data Frame for Multiple Variables with Dynamic Column Name Creation

I’ve got some sports data in a dataframe I’m playing with that has data on teams throughout a season and I have a number of performance measures I’d like to create cumulative sums for. I found some great suggestions for how to do this here, but I’m struggling with how to conveniently iterate through the columns I want to create because 1) the colon in the name indexing of columns makes things tricky and 2) I’m not sure how to iterate over a list of column names and use those column names to dynamically create new column names for the cumulative sums. Lastly - I would love to do these cumulative sums so that they were lagged by one time period (i.e. the value for the current row is not included in the running total). Here’s what I’ve got so far:

#To Pull the Data
using HTTP, CSV, DataFrames
url = "https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv"
f38 = CSV.read(HTTP.get(url).body)

For creating just one column of cumulative sums, this works great:

f38[:, :cumul_elo1_pre] .= 0.0
for subdf in groupby(f38,[:team1, :season])
    subdf[:, :cumul_elo1_pre] .= cumsum(subdf.elo1_pre)    
end

But if I’d like to do the same thing for all these columns, how can I do it without writing everything out, but iterating?

cumul_vars = ["elo1_pre", "elo2_pre", "qbelo1_pre", "qbelo2_pre", "qb1_value_pre", "qb2_value_pre", 
              "qb1_adj", "qb2_adj", "score1", "score2"];

Any suggestions?

1 Like

What language are you coming from? This looks stata-esque.

The solution here is to do combine. Read the DataFrames docs here for more info.

cumul_vars = ["elo1_pre", "elo2_pre", "qbelo1_pre", "qbelo2_pre", "qb1_value_pre", "qb2_value_pre", "qb1_adj", "qb2_adj", "score1", "score2"]

combine(groupby(df, [:team, :season]), cumul_vars .=> cumsum)

Actually, I think you want to keep existing columns, right? Then you would do

transform(groupby(df, [:team, :season]), cumul_vars .=> cumsum)
2 Likes

Did you know that you can use strings as column names? So this would work

#To Pull the Data
using HTTP, CSV, DataFrames
url = "https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv"
f38 = CSV.read(HTTP.get(url).body)


cumul_vars = ["elo1_pre", "elo2_pre", "qbelo1_pre", "qbelo2_pre", "qb1_value_pre", "qb2_value_pre",
              "qb1_adj", "qb2_adj", "score1", "score2"];

combine(groupby(f38, ["team1", "season"]), cumul_vars .=> cumsum)
1 Like

I am indeed coming from Stata - trying to shed those old habits, but not succeeding. Many thanks for this! Is it possible to do a version that is lagged by one time unit?

And thanks @xiaodai - that’s super helpful to know.

you can use ShiftedArrays to get a lag so presumably you want

transform(groupby(df, [:team, :season]), cumul_vars .=>  cumsum ∘ lag)
1 Like

Great - I will look into that. Thank you!

This works great - just two (hopefully) quick follow-ups on this:

  1. I did this successfully for team 1, but I’d like to do it as well for their opponent (team 2), but modifying the code for team2 would produce the same variable names (it’s iterating over the same column set, just for different grouped dfs). I know I can add another => to the transform statement to specify a new column name, but that doesn’t seem to be possible for a list of variables like I have in cumul_vars here. Any suggestions for how to modifying the resulting column names and be able to execute this for the opposing team as well?

  2. To do the lag, I went simple and just subtracted the current observation’s value from the cumulative variable with the following code, which works just fine but issues a syntax warning and is definitely a very Stata way of doing things. So in the interest of retraining my brain and to learn better coding practices, I wonder if you have advice on a better way to write it.

for var in cumul_vars
    full_data["$(var)_cumsum"] = full_data[!, "$(var)_cumsum"] .- full_data[!, "$(var)"]
end

you can keep broadcasting the => operator.

julia> df = DataFrame(rand(2,2));

julia> input_vars = ["x1", "x2"];

julia> fun(x) =  x .+ 1;

julia> output_vars = [xi * "_new_variable_fun" for xi in input_vars];

julia> transform(df, input_vars .=> fun .=> output_vars)
2×4 DataFrame
│ Row │ x1       │ x2       │ x1_new_variable_fun │ x2_new_variable_fun │
│     │ Float64  │ Float64  │ Float64             │ Float64             │
├─────┼──────────┼──────────┼─────────────────────┼─────────────────────┤
│ 1   │ 0.516845 │ 0.606987 │ 1.51685             │ 1.60699             │
│ 2   │ 0.453556 │ 0.965991 │ 1.45356             │ 1.96599             │

Please read the syntax warning! You are getting it from the first indexing, you need a [:,"$(var)_cumsum"], notice the new :.

I’m not sure I have a better way of doing that cumsun thing. You could certainly write a small function and use transform, but then you have to have an input like [[:x1, :x2], [:y1, :y2]] .=> f which is complicated.

The indexing behavior is definitely a bit ugly… perhaps that can be fixed with metaprogramming.

1 Like