How to compute a "cumulative" in a dataframe (without a for loop)

I am trying to add a new column with a cumulative value in regard to an other column., e.g.:

In:

df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1]) 

Out:

	region	year	value
1	US	2010	3
2	US	2011	3
3	US	2012	2
4	US	2013	2
5	EU	2010	2
6	EU	2011	2
7	EU	2012	1
8	EU	2013	1

I did try something like:

In:

df[:cumValue] = copy(df[:value])
[r[:cumValue] .= df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue] for r in eachrow(df) if r[:year] != minimum(df[:year])]

But I get the following error:
Out:
LoadError: MethodError: no method matching broadcast!(::Base.#identity, ::Int64, ::DataArrays.DataArray{Int64,1})

Use cumsum.

df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1])
df[:cumValue] = cumsum(df[:value])
df
region	year	value	cumValue
1	US	2010	3	3
2	US	2011	3	6
3	US	2012	2	8
4	US	2013	2	10
5	EU	2010	2	12
6	EU	2011	2	14
7	EU	2012	1	15
8	EU	2013	1	16

It’s not as efficient, but you can also do things like df[:cumValue] = [sum(df[1:i, :value]) for i in 1:nrow(df)], which can be adapted if you need to do something other than sum the value

1 Like

Thank you, but I need to subgroup the summing by region (sorry I didn’t specified).

At the end I found:

  • with a for loop (maybe more clear):
df[:cumValue] = cumsum(df[:value])
for r in eachrow(df)
    if r[:year] != minimum(df[:year])
        value = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue]
        r[:cumValue] = value[1] + r[:value] 
    end
end
  • with list comprehension:
df[:cumValue] = cumsum(df[:value])
[r[:cumValue] = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue][1] + r[:value]  for r in eachrow(df) if r[:year] != minimum(df[:year])]    

Out:

	region	year	value	cumValue
1	US	2010	3	3
2	US	2011	3	6
3	US	2012	2	8
4	US	2013	2	10
5	EU	2010	2	2
6	EU	2011	2	4
7	EU	2012	1	5
8	EU	2013	1	6

Not sure if it’s the most efficient, but I often find it useful to allocate a new column and then modify it in place with a by.

In this case:

df[:cumsum] = similar(df[:value])
by(df, :region) do dd
       dd[:cumsum] = cumsum(dd[:value])
       return
end

The empty return is to specify that you are not interested in getting any output from this by, you only want the side effect.

2 Likes

You can also accomplish this using DataFramesMeta and chaining operations together using the @linq macro and pipe |>,

using DataFramesMeta
df = @linq df |>
  groupby(:region) |>
  transform(cumValue = cumsum(:value))

which yields

8Γ—4 DataFrames.DataFrame
β”‚ Row β”‚ region β”‚ year β”‚ value β”‚ cumvalue β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ "EU"   β”‚ 2010 β”‚ 2     β”‚ 2        β”‚
β”‚ 2   β”‚ "EU"   β”‚ 2011 β”‚ 2     β”‚ 4        β”‚
β”‚ 3   β”‚ "EU"   β”‚ 2012 β”‚ 1     β”‚ 5        β”‚
β”‚ 4   β”‚ "EU"   β”‚ 2013 β”‚ 1     β”‚ 6        β”‚
β”‚ 5   β”‚ "US"   β”‚ 2010 β”‚ 3     β”‚ 3        β”‚
β”‚ 6   β”‚ "US"   β”‚ 2011 β”‚ 3     β”‚ 6        β”‚
β”‚ 7   β”‚ "US"   β”‚ 2012 β”‚ 2     β”‚ 8        β”‚
β”‚ 8   β”‚ "US"   β”‚ 2013 β”‚ 2     β”‚ 10       β”‚

I haven’t compared the performance of these various approaches, but for most β€œnormal” sized data sets the difference shouldn’t be worth worrying about. Personally, I find this clearer to read than a β€œby, do” loop, especially if you start doing a lot of successive transformations and manipulations.

3 Likes

Thank you. Both @piever and @ElOceanografo solutions are much more efficient (and cleaner) than the β€œmanual” naive solution I found by myself. Good I did ask :wink:

1 Like

I have a question along this line, especially @ElOceanografo response, been playing with DataFramesMeta, which is awesome so far, but stuck on trying to get more conditional parameters to work.
So in the example in this thread, how would I go about adding a condition to the transform(CumValue) statement if say I wanted to have another column that added only for the last 2 years, and not everything in the groupby(:region)? like CumvalueTrailing2Years type of column.

thanks!

RollingFunctions.jl will handle this, I think. Something like

@chain df begin 
    groupby(:state)
    @transform :y_rolling_mean = rollmean(:y, 12)
end

But if your data is in dates and isn’t like, every day or every week, meaning x[i-5] always meaning the same thing, then I think things are harder, and I don’t know what exactly the solution would be.

thats a cool package, didnt know it existed!

but I dont see it allowing for product like functions, or rolling sums. ie. it doesnt have cumprod or cumsum equivalent that i can see in the docs. (might be blind…)

It seems to support having your own functions, see here. I’m sure we could help with an MWE, though I haven’t worked too much with this function.

hmm… how would that look like code wise?
transform(columnname = rolling(cumsum(:value), 2) <–?

It would be rolling(cumsum, :value, 2).

EDIT: I guess the function in rolling needs to return a scalar. So I’m not 100% sure what the solution to this problem is.

yeah was about to say that, as i run into that issue trying to execute it… hoping there is an elegant solution with dataframesmeta here…

I think this is possible. Can you give an example of what you want, say with the input x = [1, 2, 3, 4, 5, 6]?

just using the original posters’ df here, example of what I am trying to get to:

df9 = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1]) 
  df9 = @linq df9 |>
  groupby(:region) |>
  transform(cumValueTrailing2YR = rolling(cumsum, :value, 2)

and hopefully this type of solution works on cumprod, other functions. But at the base of it, i have a bunch of datasets with Date / Values / GroupTypes, where I need to add columns that are trailing in nature doing cumulative product function.

with example above, you can just do β€œ=cumprod(:value)”, but then thats just doing it by your grouping function, and no way I know of to give is depedency on another column group, say Year in that above df.

Please use triple backticks

```
like this
```

to format code.

I mean, given the input

x = [1, 2, 3, 4, 5, 6]

What kind of output do you want? Because I am having trouble understanding what kind of output you actually want. Forget about grouping for the time being, I think.

oh the output would just be another column in the df, thats simply cumulative sum of last 2 years per group. So you need 2 inputs, you cant have a single vector input in this problem. cumsum is dependent on a column thats not part of the groupby.

But not every entry in the array can be a cumulative sum of a different window, right?

I can’t imagine how that would work without it just bring a rolling sum, not a cumsum.

depends on the condition of the function, you would end up with earlier parts of the array as null (or defaulted value of some sort).

so if X = [ 2001, 2002, 2003, 2004], and Y = [ 2, 3, 6, 4] and you are rolling sum every 2 years, your theoretical column Z = [ null, 5, 9, 10], as for 2001, there is no last 2 years to sum.

That’s just rollsum, then

julia> using RollingFunctions

julia> Y = [ 2, 3, 6, 4];

julia> rolling(sum, Y, 2)
3-element Vector{Float64}:
  5.0
  9.0
 10.0

I see what you mean about the missing values. It’s unfortunate this isn’t supported. I will file an issue to add it. You can do

Z = [missing; rolling(sum, Y, 2))]

to get the behavior you want.

1 Like