How to compute the rolling standardization using DataFrames

using DataFrames
data = DataFrame(time = rand(1:100, 1_000_000), val=rand(1_000_000))

consider the above data frame and say I wish to compute the average val within a rolling window and normalize each val using the rolling average and standard deviation. It’s almost like a moving average but notice that each time period has multiple values.

E.g say I am looking at a row with value (time = 7, val= 0.551) and I wish to normal val using a 12 time period average and stdev, I would need to compute the mean and standard deviation for all rows with time in 2:13 (cos it’s a 12 month moving average taking 5 values from the past and 6 values from the future), then I would standardize it for all.

The real problem is slightly more complicated and I would need to do the same for more columns, but the basic is the same.

The only way I can think of now is to do a groupby for each month, so that would involve compute it almost 100 times.

Is there a library that has this implemented already?

using DataFrames, Chain, DataFrameMacros, Statistics

data = DataFrame(time = rand(1:100, 1_000_000), val=rand(1_000_000))

function summarise(data, t)
    tmp = @chain data begin
        @subset t-5 <= :time <= t + 6
        @combine(:meanval = mean(:val), :stdval= std(:val))

    tmp.meanval[1], tmp.stdval[1]

df_summ = DataFrame(time = 1:100, mean_std = [summarise(data, t) for t in 1:100])

data_fnl = @chain data begin
    leftjoin(df_summ, on = [:time])
    @transform :val_normlaised = (:val-:mean_std[1]) / :mean_std[2]

Did you have something similar to DataFrame.rolling from pandas in mind?

While DataFrames is undoubtedly one of the most favorite ways to deal with tabular data, it is worth noting that there are other alternatives, which can be more suited for particular tasks. For example, TimeSeries.jl have special section devoted to window manipulations Apply methods · TimeSeries.jl. Alternatively, there is not so frequently updated, but useful package RollingFunctions.jl

it’s a bit harder than just rolling, cos there are multiple observations per time point.

1 Like

probably you cannot skip the 100 times calculation, however, you can make it faster.

the groupby function gives the starts and ends of each group, so instead of using subset you can go through these starts and ends to filter observations.

A colleague showed me the Spark’s (and SQL’s?) pyspark.sql.Window.rangeBetween — PySpark 3.1.2 documentation

which can do this. Interesting.

@xiodai, have you also taken a look at the RollingTimeWindows.jl package and associated discourse post?


No. But I will check them out now

Author of RollingTimeWindows (and the associated post) here. I’m happy to give you commit access if you want to make any modifications to RollingTimeWindows (e.g. adding fixed-width iteration or functions that live at a higher level of abstraction than the iterator) for your particular use-case – just ping me with your GitHub username :slight_smile:

1 Like