How to compute a "cumulative" in a dataframe (without a for loop)

sylvaticus · February 17, 2017, 4:10pm

I am trying to add a new column with a cumulative value in regard to an other column., e.g.:

In:

df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1])

Out:

	region	year	value
1	US	2010	3
2	US	2011	3
3	US	2012	2
4	US	2013	2
5	EU	2010	2
6	EU	2011	2
7	EU	2012	1
8	EU	2013	1

I did try something like:

In:

df[:cumValue] = copy(df[:value])
[r[:cumValue] .= df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue] for r in eachrow(df) if r[:year] != minimum(df[:year])]

But I get the following error:
Out:
LoadError: MethodError: no method matching broadcast!(::Base.#identity, ::Int64, ::DataArrays.DataArray{Int64,1})

amellnik · February 17, 2017, 4:23pm

Use cumsum.

df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1])
df[:cumValue] = cumsum(df[:value])
df

region	year	value	cumValue
1	US	2010	3	3
2	US	2011	3	6
3	US	2012	2	8
4	US	2013	2	10
5	EU	2010	2	12
6	EU	2011	2	14
7	EU	2012	1	15
8	EU	2013	1	16

It’s not as efficient, but you can also do things like df[:cumValue] = [sum(df[1:i, :value]) for i in 1:nrow(df)], which can be adapted if you need to do something other than sum the value

sylvaticus · February 17, 2017, 4:32pm

Thank you, but I need to subgroup the summing by region (sorry I didn’t specified).

At the end I found:

with a for loop (maybe more clear):

df[:cumValue] = cumsum(df[:value])
for r in eachrow(df)
    if r[:year] != minimum(df[:year])
        value = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue]
        r[:cumValue] = value[1] + r[:value] 
    end
end

with list comprehension:

df[:cumValue] = cumsum(df[:value])
[r[:cumValue] = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue][1] + r[:value]  for r in eachrow(df) if r[:year] != minimum(df[:year])]

Out:

	region	year	value	cumValue
1	US	2010	3	3
2	US	2011	3	6
3	US	2012	2	8
4	US	2013	2	10
5	EU	2010	2	2
6	EU	2011	2	4
7	EU	2012	1	5
8	EU	2013	1	6

piever · February 17, 2017, 4:57pm

Not sure if it’s the most efficient, but I often find it useful to allocate a new column and then modify it in place with a by.

In this case:

df[:cumsum] = similar(df[:value])
by(df, :region) do dd
       dd[:cumsum] = cumsum(dd[:value])
       return
end

The empty return is to specify that you are not interested in getting any output from this by, you only want the side effect.

ElOceanografo · February 17, 2017, 6:21pm

You can also accomplish this using DataFramesMeta and chaining operations together using the @linq macro and pipe |>,

using DataFramesMeta
df = @linq df |>
  groupby(:region) |>
  transform(cumValue = cumsum(:value))

which yields

8×4 DataFrames.DataFrame
│ Row │ region │ year │ value │ cumvalue │
├─────┼────────┼──────┼───────┼──────────┤
│ 1   │ "EU"   │ 2010 │ 2     │ 2        │
│ 2   │ "EU"   │ 2011 │ 2     │ 4        │
│ 3   │ "EU"   │ 2012 │ 1     │ 5        │
│ 4   │ "EU"   │ 2013 │ 1     │ 6        │
│ 5   │ "US"   │ 2010 │ 3     │ 3        │
│ 6   │ "US"   │ 2011 │ 3     │ 6        │
│ 7   │ "US"   │ 2012 │ 2     │ 8        │
│ 8   │ "US"   │ 2013 │ 2     │ 10       │

I haven’t compared the performance of these various approaches, but for most “normal” sized data sets the difference shouldn’t be worth worrying about. Personally, I find this clearer to read than a “by, do” loop, especially if you start doing a lot of successive transformations and manipulations.

sylvaticus · February 19, 2017, 3:08pm

Thank you. Both @piever and @ElOceanografo solutions are much more efficient (and cleaner) than the “manual” naive solution I found by myself. Good I did ask

SubTer · September 3, 2021, 12:34am

I have a question along this line, especially @ElOceanografo response, been playing with DataFramesMeta, which is awesome so far, but stuck on trying to get more conditional parameters to work.
So in the example in this thread, how would I go about adding a condition to the transform(CumValue) statement if say I wanted to have another column that added only for the last 2 years, and not everything in the groupby(:region)? like CumvalueTrailing2Years type of column.

thanks!

pdeffebach · September 3, 2021, 12:42am

RollingFunctions.jl will handle this, I think. Something like

@chain df begin 
    groupby(:state)
    @transform :y_rolling_mean = rollmean(:y, 12)
end

But if your data is in dates and isn’t like, every day or every week, meaning x[i-5] always meaning the same thing, then I think things are harder, and I don’t know what exactly the solution would be.

SubTer · September 3, 2021, 12:47am

thats a cool package, didnt know it existed!

but I dont see it allowing for product like functions, or rolling sums. ie. it doesnt have cumprod or cumsum equivalent that i can see in the docs. (might be blind…)

pdeffebach · September 3, 2021, 12:48am

It seems to support having your own functions, see here. I’m sure we could help with an MWE, though I haven’t worked too much with this function.

SubTer · September 3, 2021, 12:54am

hmm… how would that look like code wise?
transform(columnname = rolling(cumsum(:value), 2) <–?

pdeffebach · September 3, 2021, 12:59am

It would be rolling(cumsum, :value, 2).

EDIT: I guess the function in rolling needs to return a scalar. So I’m not 100% sure what the solution to this problem is.

SubTer · September 3, 2021, 1:07am

yeah was about to say that, as i run into that issue trying to execute it… hoping there is an elegant solution with dataframesmeta here…

pdeffebach · September 3, 2021, 1:08am

I think this is possible. Can you give an example of what you want, say with the input x = [1, 2, 3, 4, 5, 6]?

SubTer · September 3, 2021, 1:20am

just using the original posters’ df here, example of what I am trying to get to:

df9 = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1]) 
  df9 = @linq df9 |>
  groupby(:region) |>
  transform(cumValueTrailing2YR = rolling(cumsum, :value, 2)

and hopefully this type of solution works on cumprod, other functions. But at the base of it, i have a bunch of datasets with Date / Values / GroupTypes, where I need to add columns that are trailing in nature doing cumulative product function.

with example above, you can just do “=cumprod(:value)”, but then thats just doing it by your grouping function, and no way I know of to give is depedency on another column group, say Year in that above df.

pdeffebach · September 3, 2021, 1:27am

Please use triple backticks

```
like this
```

to format code.

I mean, given the input

x = [1, 2, 3, 4, 5, 6]

What kind of output do you want? Because I am having trouble understanding what kind of output you actually want. Forget about grouping for the time being, I think.

SubTer · September 3, 2021, 1:33am

oh the output would just be another column in the df, thats simply cumulative sum of last 2 years per group. So you need 2 inputs, you cant have a single vector input in this problem. cumsum is dependent on a column thats not part of the groupby.

pdeffebach · September 3, 2021, 1:37am

But not every entry in the array can be a cumulative sum of a different window, right?

I can’t imagine how that would work without it just bring a rolling sum, not a cumsum.

SubTer · September 3, 2021, 1:44am

depends on the condition of the function, you would end up with earlier parts of the array as null (or defaulted value of some sort).

so if X = [ 2001, 2002, 2003, 2004], and Y = [ 2, 3, 6, 4] and you are rolling sum every 2 years, your theoretical column Z = [ null, 5, 9, 10], as for 2001, there is no last 2 years to sum.

pdeffebach · September 3, 2021, 1:55am

That’s just rollsum, then

julia> using RollingFunctions

julia> Y = [ 2, 3, 6, 4];

julia> rolling(sum, Y, 2)
3-element Vector{Float64}:
  5.0
  9.0
 10.0

I see what you mean about the missing values. It’s unfortunate this isn’t supported. I will file an issue to add it. You can do

Z = [missing; rolling(sum, Y, 2))]

to get the behavior you want.

Topic		Replies	Views
Nested Task error: Bad window span - what? Data question , dataframes	6	799	September 13, 2021
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	709	March 29, 2023
Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language New to Julia question , dataframes	22	702	December 23, 2022
Why I get 'RefValue{SubArray{Int64' and not "simply" 'SubArray{Int64' Data	13	810	January 11, 2021
Cummulative death by day by country Data question , dataframes , piping	29	1999	June 29, 2020

How to compute a "cumulative" in a dataframe (without a for loop)

Related topics