I am trying to add a new column with a cumulative value in regard to an other column., e.g.:
In:
df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
year = [2010,2011,2012,2013,2010,2011,2012,2013],
value = [3,3,2,2,2,2,1,1])
Out:
region year value
1 US 2010 3
2 US 2011 3
3 US 2012 2
4 US 2013 2
5 EU 2010 2
6 EU 2011 2
7 EU 2012 1
8 EU 2013 1
I did try something like:
In:
df[:cumValue] = copy(df[:value])
[r[:cumValue] .= df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue] for r in eachrow(df) if r[:year] != minimum(df[:year])]
But I get the following error:
Out: LoadError: MethodError: no method matching broadcast!(::Base.#identity, ::Int64, ::DataArrays.DataArray{Int64,1})
df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
year = [2010,2011,2012,2013,2010,2011,2012,2013],
value = [3,3,2,2,2,2,1,1])
df[:cumValue] = cumsum(df[:value])
df
region year value cumValue
1 US 2010 3 3
2 US 2011 3 6
3 US 2012 2 8
4 US 2013 2 10
5 EU 2010 2 12
6 EU 2011 2 14
7 EU 2012 1 15
8 EU 2013 1 16
Itβs not as efficient, but you can also do things like df[:cumValue] = [sum(df[1:i, :value]) for i in 1:nrow(df)], which can be adapted if you need to do something other than sum the value
Thank you, but I need to subgroup the summing by region (sorry I didnβt specified).
At the end I found:
with a for loop (maybe more clear):
df[:cumValue] = cumsum(df[:value])
for r in eachrow(df)
if r[:year] != minimum(df[:year])
value = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue]
r[:cumValue] = value[1] + r[:value]
end
end
with list comprehension:
df[:cumValue] = cumsum(df[:value])
[r[:cumValue] = df[(df[:region] .== r[:region]) & (df[:year] .== (r[:year]-1)),:cumValue][1] + r[:value] for r in eachrow(df) if r[:year] != minimum(df[:year])]
Out:
region year value cumValue
1 US 2010 3 3
2 US 2011 3 6
3 US 2012 2 8
4 US 2013 2 10
5 EU 2010 2 2
6 EU 2011 2 4
7 EU 2012 1 5
8 EU 2013 1 6
I havenβt compared the performance of these various approaches, but for most βnormalβ sized data sets the difference shouldnβt be worth worrying about. Personally, I find this clearer to read than a βby, doβ loop, especially if you start doing a lot of successive transformations and manipulations.
Thank you. Both @piever and @ElOceanografo solutions are much more efficient (and cleaner) than the βmanualβ naive solution I found by myself. Good I did ask
I have a question along this line, especially @ElOceanografo response, been playing with DataFramesMeta, which is awesome so far, but stuck on trying to get more conditional parameters to work.
So in the example in this thread, how would I go about adding a condition to the transform(CumValue) statement if say I wanted to have another column that added only for the last 2 years, and not everything in the groupby(:region)? like CumvalueTrailing2Years type of column.
@chain df begin
groupby(:state)
@transform :y_rolling_mean = rollmean(:y, 12)
end
But if your data is in dates and isnβt like, every day or every week, meaning x[i-5] always meaning the same thing, then I think things are harder, and I donβt know what exactly the solution would be.
but I dont see it allowing for product like functions, or rolling sums. ie. it doesnt have cumprod or cumsum equivalent that i can see in the docs. (might be blindβ¦)
and hopefully this type of solution works on cumprod, other functions. But at the base of it, i have a bunch of datasets with Date / Values / GroupTypes, where I need to add columns that are trailing in nature doing cumulative product function.
with example above, you can just do β=cumprod(:value)β, but then thats just doing it by your grouping function, and no way I know of to give is depedency on another column group, say Year in that above df.
What kind of output do you want? Because I am having trouble understanding what kind of output you actually want. Forget about grouping for the time being, I think.
oh the output would just be another column in the df, thats simply cumulative sum of last 2 years per group. So you need 2 inputs, you cant have a single vector input in this problem. cumsum is dependent on a column thats not part of the groupby.
depends on the condition of the function, you would end up with earlier parts of the array as null (or defaulted value of some sort).
so if X = [ 2001, 2002, 2003, 2004], and Y = [ 2, 3, 6, 4] and you are rolling sum every 2 years, your theoretical column Z = [ null, 5, 9, 10], as for 2001, there is no last 2 years to sum.