 # How to compute the rolling standardization using DataFrames

``````using DataFrames
data = DataFrame(time = rand(1:100, 1_000_000), val=rand(1_000_000))
``````

consider the above data frame and say I wish to compute the average `val` within a rolling window and normalize each `val` using the rolling average and standard deviation. It’s almost like a moving average but notice that each time period has multiple values.

E.g say I am looking at a row with value `(time = 7, val= 0.551)` and I wish to normal `val` using a 12 time period average and stdev, I would need to compute the mean and standard deviation for all rows with `time` in 2:13 (cos it’s a 12 month moving average taking 5 values from the past and 6 values from the future), then I would standardize it for all.

The real problem is slightly more complicated and I would need to do the same for more columns, but the basic is the same.

The only way I can think of now is to do a groupby for each month, so that would involve compute it almost 100 times.

Is there a library that has this implemented already?

``````using DataFrames, Chain, DataFrameMacros, Statistics

data = DataFrame(time = rand(1:100, 1_000_000), val=rand(1_000_000))

function summarise(data, t)
tmp = @chain data begin
@subset t-5 <= :time <= t + 6
@combine(:meanval = mean(:val), :stdval= std(:val))
end

tmp.meanval, tmp.stdval
end

df_summ = DataFrame(time = 1:100, mean_std = [summarise(data, t) for t in 1:100])

data_fnl = @chain data begin
leftjoin(df_summ, on = [:time])
@transform :val_normlaised = (:val-:mean_std) / :mean_std
end
``````

Did you have something similar to `DataFrame.rolling` from pandas in mind?

While `DataFrames` is undoubtedly one of the most favorite ways to deal with tabular data, it is worth noting that there are other alternatives, which can be more suited for particular tasks. For example, TimeSeries.jl have special section devoted to window manipulations Apply methods · TimeSeries.jl. Alternatively, there is not so frequently updated, but useful package RollingFunctions.jl

it’s a bit harder than just rolling, cos there are multiple observations per time point.

1 Like

probably you cannot skip the 100 times calculation, however, you can make it faster.

the `groupby` function gives the starts and ends of each group, so instead of using `subset` you can go through these starts and ends to filter observations.

A colleague showed me the Spark’s (and SQL’s?) pyspark.sql.Window.rangeBetween — PySpark 3.1.2 documentation

which can do this. Interesting.

@xiodai, have you also taken a look at the RollingTimeWindows.jl package and associated discourse post?

2 Likes

No. But I will check them out now

Author of `RollingTimeWindows` (and the associated post) here. I’m happy to give you commit access if you want to make any modifications to `RollingTimeWindows` (e.g. adding fixed-width iteration or functions that live at a higher level of abstraction than the iterator) for your particular use-case – just ping me with your GitHub username 1 Like