How to compute the rolling standardization using DataFrames

xiaodai · July 20, 2021, 10:31am

using DataFrames
data = DataFrame(time = rand(1:100, 1_000_000), val=rand(1_000_000))

consider the above data frame and say I wish to compute the average val within a rolling window and normalize each val using the rolling average and standard deviation. It’s almost like a moving average but notice that each time period has multiple values.

E.g say I am looking at a row with value (time = 7, val= 0.551) and I wish to normal val using a 12 time period average and stdev, I would need to compute the mean and standard deviation for all rows with time in 2:13 (cos it’s a 12 month moving average taking 5 values from the past and 6 values from the future), then I would standardize it for all.

The real problem is slightly more complicated and I would need to do the same for more columns, but the basic is the same.

The only way I can think of now is to do a groupby for each month, so that would involve compute it almost 100 times.

Is there a library that has this implemented already?

using DataFrames, Chain, DataFrameMacros, Statistics

data = DataFrame(time = rand(1:100, 1_000_000), val=rand(1_000_000))


function summarise(data, t)
    tmp = @chain data begin
        @subset t-5 <= :time <= t + 6
        @combine(:meanval = mean(:val), :stdval= std(:val))
    end

    tmp.meanval[1], tmp.stdval[1]
end


df_summ = DataFrame(time = 1:100, mean_std = [summarise(data, t) for t in 1:100])

data_fnl = @chain data begin
    leftjoin(df_summ, on = [:time])
    @transform :val_normlaised = (:val-:mean_std[1]) / :mean_std[2]
end

eliassno · July 20, 2021, 10:45am

Did you have something similar to DataFrame.rolling from pandas in mind?

Skoffer · July 20, 2021, 10:56am

While DataFrames is undoubtedly one of the most favorite ways to deal with tabular data, it is worth noting that there are other alternatives, which can be more suited for particular tasks. For example, TimeSeries.jl have special section devoted to window manipulations Apply methods · TimeSeries.jl. Alternatively, there is not so frequently updated, but useful package RollingFunctions.jl

xiaodai · July 20, 2021, 10:58am

it’s a bit harder than just rolling, cos there are multiple observations per time point.

DataFrames · July 20, 2021, 11:15pm

probably you cannot skip the 100 times calculation, however, you can make it faster.

the groupby function gives the starts and ends of each group, so instead of using subset you can go through these starts and ends to filter observations.

xiaodai · July 20, 2021, 11:49pm

A colleague showed me the Spark’s (and SQL’s?) https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Window.rangeBetween.html

which can do this. Interesting.

rafael.guerra · July 21, 2021, 1:03pm

@xiodai, have you also taken a look at the RollingTimeWindows.jl package and associated discourse post?

xiaodai · July 21, 2021, 1:50pm

No. But I will check them out now

luke · July 21, 2021, 3:54pm

Author of RollingTimeWindows (and the associated post) here. I’m happy to give you commit access if you want to make any modifications to RollingTimeWindows (e.g. adding fixed-width iteration or functions that live at a higher level of abstraction than the iterator) for your particular use-case – just ping me with your GitHub username

Topic		Replies	Views
Calculation of rolling standard deviation by group General Usage dataframes	7	625	January 26, 2023
Normalizing DataFrame column by group Data	5	867	October 30, 2022
RollingFunctions with a variable window width New to Julia dataframes	19	1354	October 8, 2022
Creating a DataFrame column as the running mean of another column Data dataframes	5	747	January 24, 2022
Standardize all columns of DataFrame Data question , package	9	2754	October 2, 2023

How to compute the rolling standardization using DataFrames

Related topics