RollingFunctions with a variable window width

How do I go about applying a running function to a dataframe column with variable window length?
Consider for example:

using DataFrames, RollingFunctions

df = DataFrame(:a => rand(100), :b => repeat([ x for x in 1:10 if isodd(x) ], 20))

Now say I want to use df.b to determine the windowspan argument to running, but I am in a long @chain… Here is what my intuition is, but this throws a MethodError because I’m feeding a vector to running for the windowspan

@chain df begin
    # a lot of code
    # I know there is `runstd` but my real use has a bit more to it
    transform([:a, :b] => (a, b) -> running(x -> std(x), a, b))
end

use GitHub - JeffreySarnoff/RollingFunctions.jl: Roll a window over data; apply a function over the window.

Sorry I had prematurely hit enter on my post - my first post has been edited with the rest of the content

I am not fully clear what you want. Can you write what you want without using transform, but just assuming a and b are just variables? Then I can help you translate this to operation specification syntax.

sure thing:

a = rand(100)
b = repeat([x for x in 1:10 if isodd(x)], 20)

running(x -> std(x), a, b) # This doesn't work

For example, if b is 6, then I want running(x -> std(x), a, 6)

But my problem is that in your case b is a vector, do you mean you want:

running.(std, Ref(a), b)

i.e. apply std to a for each value in vector b?

But my problem is that b is a vector

This is the problem I am running into in the DataFrames context. For example, if we go back to this:

df = DataFrame(:a => rand(100), :b => repeat([ x for x in 1:10 if isodd(x) ], 20))

Then what I really want is to say “if b is 2, do running(x -> std(x), a, 2), if b is 4, then do running(x -> std(x), a, 4), etc”. Is it a problem that I’ve laid out my data like this in general?

To put it another way, if I were writing this in dplyr, what I would do is:

library(dplyr)
library(zoo)
df %>%
    mutate(
        foo = rollapply(data = a, width = b, FUN = sd)
    )

OK - so you want to apply different window spans for different elements of a and the spans should come from b? Then I think RollingFunctions.jl does not have it implemented currently. @JeffreySarnoff probably can confirm this.

Currently probably the simplest thing is to either:

  1. write a custom code (this will be most efficient)
    or
  2. compute several vectors with different fixed rolling windows and then for each element of the output pick the value from the correct vector (this would work if you have only a few values of window size - but it is a workaround)
2 Likes

Thanks for this. I’ll make a PR to RollingFunctions if I write anything of value.

Actually, digging into RollingFunctions.jl you can actually get the desired effect. Try the following:

Using RollingFunctions

ragged_run(f,a,b) = 
    running((d1,d2)->( w = min(length(d1),Int64(d2[end])) ;
                       f(@view d1[end-w+1:end])) , 
            a, float.(b), maximum(b))

Now ragged_run(std, a, b) should work, and it actually is pretty efficient (not as a bespoke function, which should be about the same amount of code).
This is possible because of some extra features quietly lurking in RollingFunctions.jl I discovered while checking the package just now.

Excellent. As noted - @JeffreySarnoff is probably the best person to discuss adding features/improving documentation of the package.

Present. What may I do?

In R rollapply(data = a, width = b, FUN = sd) accepts b to be a vector, in which case it performs rolling operation with variable window width specified by b individually for each observation. In RollingFunctions.jl b is currently fixed for all observations. Thank you!

Do you intend that
rollapply(data = [1,2,3,4,5,6,7,8], width = [2,3,3], fn=mean)
return [mean(1,2), mean(3,4,5), mean(6,7,8)] or something different?

As I understand it, width is the same size as data and
rollapply (data = [1,2,3,4,5,6], width = [2,3,3,2,3,2], fn = mean)
it should rerurn
[mean(1), mean(1,2), mean(1,2,3), mean(3,4), mean(3,4,5), mean(5,6)]


function runvarw(f,d,w)
    out=similar(d)
    for (i,e) in enumerate(w)
        out[i]=f(d[max(1,i-e+1):i])
    end
    out
end

transform(df, [:a,:b]=>(x,y)->runvarw(mean, x, y))

width is the span of the window, the length of the data tends to be greater.
Please explain the the use case.

If I had to make an assumption, it would be that a and b are recyclable to a common size - in other words, that length(a) % length(b) == 0. I think R’s “recycle everything” approach leads to some hard to detect bugs.

One additional plus of assuming that is that it works really well in a DataFrames context - if you want to apply a RollingFunction over one column and use another column’s values as windows, for example.

in the use case that @rocco_sprmnt21 decribes width is a vector of windows having the same length as data vector. Each window applies to an individual data point.

Applying a rolling function is a special case of joining the dataset to itself. With FlexiJoins.jl:

julia> using StructArrays

# source table
julia> tbl = (i=1:100, a=rand(100), b=repeat([ x for x in 1:10 if isodd(x) ], 20)) |> StructArray
100-element StructArray(::UnitRange{Int64}, ::Vector{Float64}, ::Vector{Int64}) with eltype NamedTuple{(:i, :a, :b), Tuple{Int64, Float64, Int64}}:
 (i = 1, a = 0.09206398653048098, b = 1)
 (i = 2, a = 0.7537127847894282, b = 3)
 (i = 3, a = 0.8240248053711017, b = 5)
 (i = 4, a = 0.11441701169052021, b = 7)
 (i = 5, a = 0.16367689060640156, b = 9)
 (i = 6, a = 0.9349278192831513, b = 1)
 (i = 7, a = 0.9762688379519132, b = 3)
 (i = 8, a = 0.3151435725496834, b = 5)
...

julia> using Statistics, DataPipes, FlexiJoins

julia> @p let
    # join tbl to itself, so that L.i ∈ R.i ± R.b
    innerjoin((L=tbl, R=tbl), by_pred(:i, ∈, x -> x.i ± x.b); groupby=:R)
    # aggregate L.a with std()
    map((;_.R..., a_runstd=std(_.L.a)))
end
100-element StructArray(::Vector{Int64}, ::Vector{Float64}, ::Vector{Int64}, ::Vector{Float64}) with eltype NamedTuple{(:i, :a, :b, :a_runstd), Tuple{Int64, Float64, Int64, Float64}}:
 (i = 1, a = 0.09206398653048098, b = 1, a_runstd = 0.4678563520128315)
 (i = 2, a = 0.7537127847894282, b = 3, a_runstd = 0.3662641250205392)
 (i = 3, a = 0.8240248053711017, b = 5, a_runstd = 0.3861777662491635)
 (i = 4, a = 0.11441701169052021, b = 7, a_runstd = 0.37359812683740595)
 (i = 5, a = 0.16367689060640156, b = 9, a_runstd = 0.35945720455705743)
 (i = 6, a = 0.9349278192831513, b = 1, a_runstd = 0.4576830686002579)
 (i = 7, a = 0.9762688379519132, b = 3, a_runstd = 0.3961279439019275)
 (i = 8, a = 0.3151435725496834, b = 5, a_runstd = 0.357603729530411)

It’s somewhat less efficient than very specialized solutions, but these joins are much easier generalizable to other similar problems.
As you see, this rolling function application is built from general-purpose basic building blocks.

1 Like

I already handle non-common sizes without resorting to R’s anything is everything approach. If the datalength % windowsize or, perhaps, % composite of window sizes [still would like to see a use case from the real world where having distinct window sizes for many distinct points in the data is important], then the remainder will act as missings placed either at the front or the end of the data (user selectable [thank you @bkamins]).

1 Like