How to compute a "cumulative" in a dataframe (without a for loop)

ahhh

using RollingFunctions
df9 = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1]) 

  df9 = @linq df9 |>
  groupby(:region) |>
  transform(SumValue = ["missing"; rolling(sum, :value, 2)])

I think that works!
I see your point in having to indicate it “missing” or null starting value… the rolling function isnt smart enough to basically help the user out with that.
hmm… will be painful on a large dataset.

Note that missing is not the same as "missing".

It shouldn’t be that painful. There are easy ways to write a small helper function that won’t use up as much memory. But no need to re-write for optimization until you actually see a bottleneck.

true, but “missing” and missing functions the same way, ie. same limitation is it not?

for example, when you want to say 3 prior row roll, you need to literally add another “missing”, or it wont work.

df9 = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
               year   = [2010,2011,2012,2013,2010,2011,2012,2013],
               value  = [3,3,2,2,2,2,1,1]) 

  df9 = @linq df9 |>
  groupby(:region) |>
  transform(SumProd = [missing; missing; rolling(prod, :value, 3)])

in a large dataset where you might not know the exact amount, or have like last 30 periods rolls, this would become impractical, or am i missing some easy fix?

No. They do not function the same way. Using the value missing will have much better performance than the string "missing". And there are lots and lots of utilities to make working with missing simple compared to the string "missing".

No, I would just do

transform(SumProd = vcat(fill(missing, t-1), rolling(prod, :value, t))`

or write a helper function to do that for you.

julia> function rolling_pad(f, x, t)
       z = rolling(f, x, t)
       return vcat(fill(missing, t-1), z)
       end
rolling_pad (generic function with 1 method)

julia> rolling_pad(sum, Y, 2)
4-element Vector{Union{Missing, Float64}}:
   missing
  5.0
  9.0
 10.0

julia> rolling_pad(sum, Y, 3)
4-element Vector{Union{Missing, Float64}}:
   missing
   missing
 11.0
 13.0
1 Like

gotcha, much appreciated!

participants in this thread
Please let me know of questions you may have on the use of RollingFunctions.jl.


This package supports two kinds of applicative windows sliding over sequential data.

  • the sequential data (index ordered) contains (N) values, here indexed 1:N.
  • the operational window size (W) is given and W <= N.

The familiar kind rolls over data and with each advance applies a function to the subsequence spanned by the window. We call this approach a rolling function.

  • a rolling max-of-n or mean-of-n taken over a data sequence.
  • a windowed volatility calculation applied over a time series.

vanilla rolling functions take
a sequence of N (length) values and a window span W (count of indicies).
and return
a sequence of N - W + 1 elements (calculated summary values)


We offer a second kind of applicative window; one that preserves the length of the given data sequence in the length of the value sequence that results. We call this approach a running function.

  • for most of the data, the corresponding rolling function applies
  • to obtain the remaining values, a tapering version of the function is applied.

vanilla running functions take
a sequence of N (length) values and a window span W (count of indicies).
and return
a sequence of N elements (calculated summary values)


There is more :slight_smile: see the README

  • arbitrary and normalized weights may be used within a window
  • there are an assortment of predefined rolling and running functions
  • you may define your own rolling / running functions easily
  • covariance and correlation is available for paired data sequences
  • you may define your own rolling / running functions of two data streams
4 Likes

I’m not sure I understand what you’re trying to do, but if it’s “in every row, put the sum of the value column for the two rows above, except when there aren’t two above, in which case put missing”, you can combine RollingFunctions with ShiftedArrays like this:

julia> using DataFrames, ShiftedArrays, RollingFunctions

julia> df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU"],
                      year   = [2010,2011,2012,2013,2010,2011,2012,2013],
                                     value  = [3,3,2,2,2,2,1,1]);

julia> transform(groupby(df, :region),
           :value => (x -> running(sum, lag(x, 1), 2)) => :cumval)
8×4 DataFrame
 Row │ region  year   value  cumval    
     │ String  Int64  Int64  Float64?  
─────┼─────────────────────────────────
   1 │ US       2010      3  missing   
   2 │ US       2011      3  missing   
   3 │ US       2012      2        6.0
   4 │ US       2013      2        5.0
   5 │ EU       2010      2  missing   
   6 │ EU       2011      2  missing   
   7 │ EU       2012      1        4.0
   8 │ EU       2013      1        3.0

So EU in 2013 is 3 (sum of 2012 and 2011), in 2012 is 4 (sum of 2010 and 2011) while in 2011 it’s missing as there’s no 2009 observation.

2 Likes

Apologies for not reading the documentation more closely! Indeed, OP wants

running(sum, x, 2)
1 Like

@pdeffebach & @JeffreySarnoff - running function works nicely! Much appreciated for the help!

will ReadME more on the RollingFunctions.jl, that package looks like it will come in handy very often!

**also, discord not allowing me to reply until i wait 24 hours because i replied too much already on day 1 is a tad frustrating i must say.

almost for fun



using ShiftedArrays
function runsh(arr, sh)
    if sh>0
    d=cumsum(arr).-ShiftedArray(cumsum(arr),sh,default=0)
    ShiftedArray(collect(ShiftedArray(d,1-sh)),sh)
    else
        return arr
    end
end
                     
transform(groupby(df, :region),
           :value => (x -> runsh(x, 2)) => :cumval)

PS
without the collect function between the two nested shiftdearray the result is not as expected.
Can anyone explain why?

PPS
note tha runsh (arr, sh) also works for sh = 0 and keeps the data type of arr
and
it works even if sh is greater than the length of some of the groups

hey @JeffreySarnoff - qq (i hope) on the RollingFunctions package. I am running into an issue when using Dataframes & Rollingfunction, in similar way that was done in this thread. Specifically I get an error:
"
nested task error:
Bad window span (12) for length 10.
"
when doing this:
"
transform(Trailing12 = running(prod, (:Growth .+1), 12).-1)
"

the odd thing about it is that when I pass a df to it with sample data, it runs just fine, then I pass the full dataset which is about 100x larger, I get the Bad Window span error. Cant quite figure out why that would be happening, as the data set is the same, just more of it… ie. 200k rows vs 2k rows. Hoping it’s something simple happening here?

seems like whenever the df is over 8200 x 8, or so, it starts to show that error message… hm…

I need a minimal working (or failing) example.

hmm…not super sure how to share a failing example…data is private, but cant replicate this error with a random filled df.
doesnt seem to be a pure size issue, as I did a quick test on that right now, doing this:

bigDf = DataFrame(i = 1:1000000, Growth = rand(1000000), Categories = rand(["a", "b", "c", "d", "f", "g"], 1000000))

TimedDfTest = @linq bigDf |>
              groupby(:Categories) |>
              transform(Trailing12 = running(prod, (:Growth .+1), 12).-1)

and it worked just fine.
yet with my actual data it seems to result in that odd error message the moment i go over 8200 rows in the dataframe being passed to it.

is the error message itself indicative of anything potentially?

Think i know whats going on… but not sure how to work around it best.

Timeddf = @linq dfData |>
          groupby(:ID) |>
          transform(Trailing12 = running(prod, (:Growth .+1), 12).-1) 

the 12 in there, there are ID categories that only have up to 10 rows, so it gets broken the moment it tries to do 12 passes where only 10 rows exist for the groupby category.
Hmmm… is there a simple way to tell it to ignore those? alternatively I guess i can filter out original df to ensure those dont occur.

I will let a DataFrames expert answer that. As a general principal, keeping the data flow clean at each transformation place is good practice.

so below fails same way as I have error with my dataset, I think it’s the way running function is trying to use the Groupby, where i will only be 1 row for all of the dataframe, and trying to ask it do running 2 prior crashes the process.

bigDf = DataFrame(i = 1:100, Growth = rand(100), Categories = rand(["a", "b", "c", "d", "f", "g"], 100))

TimedDfTest = @linq bigDf |>
               groupby(:i) |>
                transform(Trailing12 = running(prod, (:Growth .+1), 2).-1)

The error does not have to do with DataFrames, here is an MWE


julia> running(prod, [1], 2)
ERROR: 
	Bad window span (2) for length 1.

Stacktrace:
 [1] nrolled
   @ ~/.julia/packages/RollingFunctions/4Jh9c/src/support.jl:20 [inlined]
 [2] running(fun::Function, data::Vector{Int64}, windowspan::Int64)
   @ RollingFunctions ~/.julia/packages/RollingFunctions/4Jh9c/src/run/running.jl:8
 [3] top-level scope
   @ REPL[21]:100: 

You can’t have a window size longer than the length of the vector.

if i understand linq action correctly here, it loops through the df, by given group, and allows the transform function to run with it’s specified action (prod or sum in this case).

the error also doesnt fully make sense to me, as what makes the window size here?

The only thing the grouped transform is doing here is passing sub-vectors to the given function. See the following

julia> df = DataFrame(a = [1, 1, 2, 2], b = rand(4));

julia> @chain df begin 
           groupby(:a)
           @transform :x = begin 
               @show typeof(:b)
               :b
           end
       end
typeof(var"##267") = SubArray{Float64, 1, Vector{Float64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false}
typeof(var"##267") = SubArray{Float64, 1, Vector{Float64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false}
4×3 DataFrame
 Row │ a      b         x        
     │ Int64  Float64   Float64  
─────┼───────────────────────────
   1 │     1  0.829674  0.829674
   2 │     1  0.101632  0.101632
   3 │     2  0.926839  0.926839
   4 │     2  0.561204  0.561204

This SubArray type has nothing to do with DataFrames. It behaves just like a normal Julia vector.

If a group has just one row in it, you will get a command equivalent to the MWE I gave above.

It’s in your code, the 2 in the call below taken from your code above.

running(prod, (:Growth .+1), 2)

are Julia arrays known to have odd issues?
in my working data example, I have the code working completely fine with a data set I load into a df, then I add another 200 rows to the df with an additional ID, and I start getting the Bad window span errors.

I know it’s not a data issue at this point, I’ve taken that new ID & associated data set, and run it through the linq transform code by itself, and it works fine. So i am completely stumped by Julia array all of a sudden thinks window span is smaller than the vector - as it’s not, I’ve manually looked through the data, and obviously it works fine on it’s own with the same code.