How to compute a "cumulative" in a dataframe (without a for loop)

question

#1

I am trying to add a new column with a cumulative value in regard to an other column., e.g.:

In:

df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1]) 

Out:

	region	year	value
1	US	2010	3
2	US	2011	3
3	US	2012	2
4	US	2013	2
5	EU	2010	2
6	EU	2011	2
7	EU	2012	1
8	EU	2013	1

I did try something like:

In:

df[:cumValue] = copy(df[:value])
[r[:cumValue] .= df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue] for r in eachrow(df) if r[:year] != minimum(df[:year])]

But I get the following error:
Out:
LoadError: MethodError: no method matching broadcast!(::Base.#identity, ::Int64, ::DataArrays.DataArray{Int64,1})


#2

Use cumsum.

df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1])
df[:cumValue] = cumsum(df[:value])
df
region	year	value	cumValue
1	US	2010	3	3
2	US	2011	3	6
3	US	2012	2	8
4	US	2013	2	10
5	EU	2010	2	12
6	EU	2011	2	14
7	EU	2012	1	15
8	EU	2013	1	16

It’s not as efficient, but you can also do things like df[:cumValue] = [sum(df[1:i, :value]) for i in 1:nrow(df)], which can be adapted if you need to do something other than sum the value


#3

Thank you, but I need to subgroup the summing by region (sorry I didn’t specified).

At the end I found:

  • with a for loop (maybe more clear):
df[:cumValue] = cumsum(df[:value])
for r in eachrow(df)
    if r[:year] != minimum(df[:year])
        value = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue]
        r[:cumValue] = value[1] + r[:value] 
    end
end
  • with list comprehension:
df[:cumValue] = cumsum(df[:value])
[r[:cumValue] = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue][1] + r[:value]  for r in eachrow(df) if r[:year] != minimum(df[:year])]    

Out:

	region	year	value	cumValue
1	US	2010	3	3
2	US	2011	3	6
3	US	2012	2	8
4	US	2013	2	10
5	EU	2010	2	2
6	EU	2011	2	4
7	EU	2012	1	5
8	EU	2013	1	6

#4

Not sure if it’s the most efficient, but I often find it useful to allocate a new column and then modify it in place with a by.

In this case:

df[:cumsum] = similar(df[:value])
by(df, :region) do dd
       dd[:cumsum] = cumsum(dd[:value])
       return
end

The empty return is to specify that you are not interested in getting any output from this by, you only want the side effect.


#5

You can also accomplish this using DataFramesMeta and chaining operations together using the @linq macro and pipe |>,

using DataFramesMeta
df = @linq df |>
  groupby(:region) |>
  transform(cumValue = cumsum(:value))

which yields

8Γ—4 DataFrames.DataFrame
β”‚ Row β”‚ region β”‚ year β”‚ value β”‚ cumvalue β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ "EU"   β”‚ 2010 β”‚ 2     β”‚ 2        β”‚
β”‚ 2   β”‚ "EU"   β”‚ 2011 β”‚ 2     β”‚ 4        β”‚
β”‚ 3   β”‚ "EU"   β”‚ 2012 β”‚ 1     β”‚ 5        β”‚
β”‚ 4   β”‚ "EU"   β”‚ 2013 β”‚ 1     β”‚ 6        β”‚
β”‚ 5   β”‚ "US"   β”‚ 2010 β”‚ 3     β”‚ 3        β”‚
β”‚ 6   β”‚ "US"   β”‚ 2011 β”‚ 3     β”‚ 6        β”‚
β”‚ 7   β”‚ "US"   β”‚ 2012 β”‚ 2     β”‚ 8        β”‚
β”‚ 8   β”‚ "US"   β”‚ 2013 β”‚ 2     β”‚ 10       β”‚

I haven’t compared the performance of these various approaches, but for most β€œnormal” sized data sets the difference shouldn’t be worth worrying about. Personally, I find this clearer to read than a β€œby, do” loop, especially if you start doing a lot of successive transformations and manipulations.


#6

Thank you. Both @piever and @ElOceanografo solutions are much more efficient (and cleaner) than the β€œmanual” naive solution I found by myself. Good I did ask :wink: