RollingFunctions with a variable window width

Derek_Vetsch · September 21, 2022, 8:34pm

How do I go about applying a running function to a dataframe column with variable window length?
Consider for example:

using DataFrames, RollingFunctions

df = DataFrame(:a => rand(100), :b => repeat([ x for x in 1:10 if isodd(x) ], 20))

Now say I want to use df.b to determine the windowspan argument to running, but I am in a long @chain… Here is what my intuition is, but this throws a MethodError because I’m feeding a vector to running for the windowspan

@chain df begin
    # a lot of code
    # I know there is `runstd` but my real use has a bit more to it
    transform([:a, :b] => (a, b) -> running(x -> std(x), a, b))
end

bkamins · September 21, 2022, 8:36pm

use GitHub - JeffreySarnoff/RollingFunctions.jl: Roll a window over data; apply a function over the window.

Derek_Vetsch · September 21, 2022, 8:43pm

Sorry I had prematurely hit enter on my post - my first post has been edited with the rest of the content

bkamins · September 21, 2022, 8:52pm

I am not fully clear what you want. Can you write what you want without using transform, but just assuming a and b are just variables? Then I can help you translate this to operation specification syntax.

Derek_Vetsch · September 21, 2022, 8:58pm

sure thing:

a = rand(100)
b = repeat([x for x in 1:10 if isodd(x)], 20)

running(x -> std(x), a, b) # This doesn't work

For example, if b is 6, then I want running(x -> std(x), a, 6)

bkamins · September 21, 2022, 9:02pm

But my problem is that in your case b is a vector, do you mean you want:

running.(std, Ref(a), b)

i.e. apply std to a for each value in vector b?

Derek_Vetsch · September 21, 2022, 9:08pm

But my problem is that b is a vector

This is the problem I am running into in the DataFrames context. For example, if we go back to this:

df = DataFrame(:a => rand(100), :b => repeat([ x for x in 1:10 if isodd(x) ], 20))

Then what I really want is to say “if b is 2, do running(x -> std(x), a, 2), if b is 4, then do running(x -> std(x), a, 4), etc”. Is it a problem that I’ve laid out my data like this in general?

To put it another way, if I were writing this in dplyr, what I would do is:

library(dplyr)
library(zoo)
df %>%
    mutate(
        foo = rollapply(data = a, width = b, FUN = sd)
    )

bkamins · September 21, 2022, 9:24pm

OK - so you want to apply different window spans for different elements of a and the spans should come from b? Then I think RollingFunctions.jl does not have it implemented currently. @JeffreySarnoff probably can confirm this.

Currently probably the simplest thing is to either:

write a custom code (this will be most efficient)
or
compute several vectors with different fixed rolling windows and then for each element of the output pick the value from the correct vector (this would work if you have only a few values of window size - but it is a workaround)

Derek_Vetsch · September 21, 2022, 9:29pm

Thanks for this. I’ll make a PR to RollingFunctions if I write anything of value.

Dan · September 21, 2022, 11:18pm

Actually, digging into RollingFunctions.jl you can actually get the desired effect. Try the following:

Using RollingFunctions

ragged_run(f,a,b) = 
    running((d1,d2)->( w = min(length(d1),Int64(d2[end])) ;
                       f(@view d1[end-w+1:end])) , 
            a, float.(b), maximum(b))

Now ragged_run(std, a, b) should work, and it actually is pretty efficient (not as a bespoke function, which should be about the same amount of code).
This is possible because of some extra features quietly lurking in RollingFunctions.jl I discovered while checking the package just now.

bkamins · September 22, 2022, 6:23am

Excellent. As noted - @JeffreySarnoff is probably the best person to discuss adding features/improving documentation of the package.

JeffreySarnoff · September 23, 2022, 11:19pm

Present. What may I do?

bkamins · September 24, 2022, 8:11am

In R rollapply(data = a, width = b, FUN = sd) accepts b to be a vector, in which case it performs rolling operation with variable window width specified by b individually for each observation. In RollingFunctions.jl b is currently fixed for all observations. Thank you!

JeffreySarnoff · September 24, 2022, 11:41am

Do you intend that
rollapply(data = [1,2,3,4,5,6,7,8], width = [2,3,3], fn=mean)
return [mean(1,2), mean(3,4,5), mean(6,7,8)] or something different?

rocco_sprmnt21 · October 7, 2022, 6:44am

As I understand it, width is the same size as data and
rollapply (data = [1,2,3,4,5,6], width = [2,3,3,2,3,2], fn = mean)
it should rerurn
[mean(1), mean(1,2), mean(1,2,3), mean(3,4), mean(3,4,5), mean(5,6)]


function runvarw(f,d,w)
    out=similar(d)
    for (i,e) in enumerate(w)
        out[i]=f(d[max(1,i-e+1):i])
    end
    out
end

transform(df, [:a,:b]=>(x,y)->runvarw(mean, x, y))

JeffreySarnoff · October 8, 2022, 1:28am

width is the span of the window, the length of the data tends to be greater.
Please explain the the use case.

Derek_Vetsch · October 8, 2022, 11:34am

If I had to make an assumption, it would be that a and b are recyclable to a common size - in other words, that length(a) % length(b) == 0. I think R’s “recycle everything” approach leads to some hard to detect bugs.

One additional plus of assuming that is that it works really well in a DataFrames context - if you want to apply a RollingFunction over one column and use another column’s values as windows, for example.

bkamins · October 8, 2022, 11:57am

in the use case that @rocco_sprmnt21 decribes width is a vector of windows having the same length as data vector. Each window applies to an individual data point.

aplavin · October 8, 2022, 12:10pm

Applying a rolling function is a special case of joining the dataset to itself. With FlexiJoins.jl:

julia> using StructArrays

# source table
julia> tbl = (i=1:100, a=rand(100), b=repeat([ x for x in 1:10 if isodd(x) ], 20)) |> StructArray
100-element StructArray(::UnitRange{Int64}, ::Vector{Float64}, ::Vector{Int64}) with eltype NamedTuple{(:i, :a, :b), Tuple{Int64, Float64, Int64}}:
 (i = 1, a = 0.09206398653048098, b = 1)
 (i = 2, a = 0.7537127847894282, b = 3)
 (i = 3, a = 0.8240248053711017, b = 5)
 (i = 4, a = 0.11441701169052021, b = 7)
 (i = 5, a = 0.16367689060640156, b = 9)
 (i = 6, a = 0.9349278192831513, b = 1)
 (i = 7, a = 0.9762688379519132, b = 3)
 (i = 8, a = 0.3151435725496834, b = 5)
...

julia> using Statistics, DataPipes, FlexiJoins

julia> @p let
    # join tbl to itself, so that L.i ∈ R.i ± R.b
    innerjoin((L=tbl, R=tbl), by_pred(:i, ∈, x -> x.i ± x.b); groupby=:R)
    # aggregate L.a with std()
    map((;_.R..., a_runstd=std(_.L.a)))
end
100-element StructArray(::Vector{Int64}, ::Vector{Float64}, ::Vector{Int64}, ::Vector{Float64}) with eltype NamedTuple{(:i, :a, :b, :a_runstd), Tuple{Int64, Float64, Int64, Float64}}:
 (i = 1, a = 0.09206398653048098, b = 1, a_runstd = 0.4678563520128315)
 (i = 2, a = 0.7537127847894282, b = 3, a_runstd = 0.3662641250205392)
 (i = 3, a = 0.8240248053711017, b = 5, a_runstd = 0.3861777662491635)
 (i = 4, a = 0.11441701169052021, b = 7, a_runstd = 0.37359812683740595)
 (i = 5, a = 0.16367689060640156, b = 9, a_runstd = 0.35945720455705743)
 (i = 6, a = 0.9349278192831513, b = 1, a_runstd = 0.4576830686002579)
 (i = 7, a = 0.9762688379519132, b = 3, a_runstd = 0.3961279439019275)
 (i = 8, a = 0.3151435725496834, b = 5, a_runstd = 0.357603729530411)

It’s somewhat less efficient than very specialized solutions, but these joins are much easier generalizable to other similar problems.
As you see, this rolling function application is built from general-purpose basic building blocks.

JeffreySarnoff · October 8, 2022, 2:05pm

I already handle non-common sizes without resorting to R’s anything is everything approach. If the datalength % windowsize or, perhaps, % composite of window sizes [still would like to see a use case from the real world where having distinct window sizes for many distinct points in the data is important], then the remainder will act as missings placed either at the front or the end of the data (user selectable [thank you @bkamins]).

Topic		Replies	Views
Rolling/running functions with complex output on multiple variables Data	6	280	March 9, 2023
ANN: RollingFunctions.jl Community package , announcement	0	1266	April 19, 2017
Nested Task error: Bad window span - what? Data question , dataframes	6	843	September 13, 2021
Variable sized windows for moving average Statistics question , rolling , speed-optimization	8	756	June 4, 2024
How to compute a "cumulative" in a dataframe (without a for loop) Data question , dataframes	44	9841	September 11, 2021

RollingFunctions with a variable window width

Related topics