Calculation of rolling standard deviation by group

ejlongman · January 26, 2023, 3:48am

Hello,

I have a dataset with year, quarter, firm and EPS. I want to calculate rolling standard deviation over only the 8 previous quarters, by firm. The following code works to create lags by group, and the really terrible solution that comes to mind is making 8 lags then calculating standard deviation across them. There’s a better way…

transform!(groupby(qr_crsp_cmp, :gvkey), [:gvkey,:fyearq, :fqtr,:EPS] => ((w,y, x,z) -> ifelse.(coalesce.(w .- lag(w) .== 0 , false), lag(z,8), missing))  => :EPS_lag8);

std_dev = std(filter(x -> !ismissing(x),EPS))

Dan · January 26, 2023, 5:18am

Here is one method to try (thanks to FlexiJoin package):

julia> using DataFrames, StatsBase, StructArrays, IntervalSets, FlexiJoins

julia> df = DataFrame(gvkey = sample(["APPL","MSFT","GOOG","AMZN"],20),
       fyearq=sample([2021,2022],20), 
       fqtr=sample([1,2,3,4],20),
       EPS=round.(rand(20); digits=2))
20×4 DataFrame
 Row │ gvkey   fyearq  fqtr   EPS     
     │ String  Int64   Int64  Float64 
─────┼────────────────────────────────
   1 │ APPL      2022      4     0.41
   2 │ APPL      2021      4     0.72
   3 │ GOOG      2021      1     0.83
   4 │ MSFT      2021      4     0.48
   5 │ GOOG      2022      2     0.66
...
julia> transform!(df, [:fyearq, :fqtr] => ByRow((y,q) -> y+(q-1)*0.25) => :yq)                    
20×5 DataFrame                                                                                              
 Row │ gvkey   fyearq  fqtr   EPS      yq                                                                   
     │ String  Int64   Int64  Float64  Float64                                                              
─────┼─────────────────────────────────────────                                                                1 │ APPL      2022      4     0.41  2022.75                                                              
   2 │ APPL      2021      4     0.72  2021.75                                                              
   3 │ GOOG      2021      1     0.83  2021.0                                                               
   4 │ MSFT      2021      4     0.48  2021.75                                                              
   5 │ GOOG      2022      2     0.66  2022.25                                                              
...
julia> sa = StructArray(NamedTuple.(eachrow(df)))
20-element StructArray(::Vector{String}, ::Vector{Int64}...:
 (gvkey = "APPL", fyearq = 2022, fqtr = 4, EPS = 0.41, yq = 2022.75)
 (gvkey = "APPL", fyearq = 2021, fqtr = 4, EPS = 0.72, yq = 2021.75)
 (gvkey = "GOOG", fyearq = 2021, fqtr = 1, EPS = 0.83, yq = 2021.0)
 (gvkey = "MSFT", fyearq = 2021, fqtr = 4, EPS = 0.48, yq = 2021.75)
 (gvkey = "GOOG", fyearq = 2022, fqtr = 2, EPS = 0.66, yq = 2022.25)
...
julia> dfout = DataFrame(map(x->(; x.O..., qtr_cnt=length(x.M), EPS_sum=sum(t->t.EPS,x.M), EPS_ssq=sum(t->t.
EPS^2, x.M)),leftjoin((O=sa, M=sa), by_key(:gvkey) & by_pred(x->((x.yq - 1.76)..(x.yq + 0.01)), ∋, :yq); groupby=:O)))
20×8 DataFrame
 Row │ gvkey   fyearq  fqtr   EPS      yq       qtr_cnt  EPS_sum  EPS_ssq 
     │ String  Int64   Int64  Float64  Float64  Int64    Float64  Float64 
─────┼────────────────────────────────────────────────────────────────────
   1 │ AMZN      2021      3     0.17  2021.5         3     0.76   0.3134
   2 │ GOOG      2022      2     0.96  2022.25        5     3.4    2.6422
   3 │ MSFT      2021      4     0.41  2021.75        4     1.93   1.0687
   4 │ APPL      2021      1     0.33  2021.0         2     0.4    0.1138
   5 │ APPL      2021      1     0.07  2021.0         2     0.4    0.1138
...

Now, in the final DataFrame, there are the qtr_cnt, EPS_sum and EPS_ssq (sum-of-squares) for the appropriate previous 2 years from each datapoint. This should be enough to calculate the standard deviation.

Note, a key difficulty in these questions is getting some data to play with. Had the question included a pasteble generation of a bit of data, I’m sure it would be a huge help for readers.

bkamins · January 26, 2023, 7:58am

Is this what you want?

julia> df = allcombinations(DataFrame, quarter=1:4, year=2021:2023, company=["A", "B"]);

julia> df.v = rand(24);

julia> using RollingFunctions

julia> transform!(groupby(df, :company), :v => x -> runstd(x, 8))
24×5 DataFrame
 Row │ quarter  year   company  v          v_function
     │ Int64    Int64  String   Float64    Float64
─────┼────────────────────────────────────────────────
   1 │       1   2021  A        0.953399    0.0
   2 │       2   2021  A        0.323781    0.445208
   3 │       3   2021  A        0.592177    0.315948
   4 │       4   2021  A        0.578366    0.258939
   5 │       1   2022  A        0.263201    0.273147
   6 │       2   2022  A        0.502612    0.244844
   7 │       3   2022  A        0.764003    0.239605
   8 │       4   2022  A        0.0296855   0.292338
   9 │       1   2023  A        0.491008    0.228926
  10 │       2   2023  A        0.903478    0.273221
  11 │       3   2023  A        0.281872    0.282655
  12 │       4   2023  A        0.98276     0.334773
  13 │       1   2021  B        0.888162    0.0
  14 │       2   2021  B        0.974804    0.0612648
  15 │       3   2021  B        0.317977    0.356847
  16 │       4   2021  B        0.0854368   0.433345
  17 │       1   2022  B        0.434297    0.379923
  18 │       2   2022  B        0.860623    0.364132
  19 │       3   2022  B        0.827672    0.343982
  20 │       4   2022  B        0.210523    0.350858
  21 │       1   2023  B        0.25318     0.341578
  22 │       2   2023  B        0.774294    0.306995
  23 │       3   2023  B        0.882941    0.330944
  24 │       4   2023  B        0.244069    0.303342

rocco_sprmnt21 · January 26, 2023, 9:20am

you could also define your own rolling std and then do everything with the minilanguage

std(itr)=length(itr)==1 ? 0 : sqrt(sum((itr .- sum(itr)/length(itr)).^2) / (length(itr) - 1))
rstd(w)=itr->[std(itr[max(1,c-w+1):c]) for c in eachindex(itr)]
transform(groupby(df, :company), :v => rstd(8))

ejlongman · January 26, 2023, 3:07pm

Thank you!! This works well. I amended it to handle missing obs and groups with fewer than 8 obs. Do you know how to change the title of v_function? When I add a rename line, it says that “v_function” does not exist even though it appears when I print the dataframe.

#check that this looks at 8 nonmissings. If less than that in group, looks at smaller amt. 
transform!(groupby(qr_crsp_cmp[.!ismissing.(qr_crsp_cmp[!,:change4q]),:], :gvkey), :change4q => x -> if length(x) >= 8 runstd(x, 8) else runstd(x, length(x)) end)

rename!(qr_crsp_cmp,:change4q_function => :a)

bkamins · January 26, 2023, 3:20pm

You can pass output column name explicitly, eg.:

:v => rstd(8) => :some_new_name

Dan · January 26, 2023, 3:31pm

Many times datasets are incomplete in strange ways (though CRSP is well curated). So if using rolling/running functions, it is best to verify data has contiguous year/quarter data with missings when missing, and not just skipping over time periods.
For example, consider a company going private and back public (e.g. TWTR), then a window of 8 rows might include a gap and make it spill way over 2 years.

rocco_sprmnt21 · January 26, 2023, 4:21pm

could you check if this formulation handles all “your” cases?

julia> df = allcombinations(DataFrame, quarter=1:4, year=2021:2023, company=["A", "B"]);

julia> df.v=[rand(3); missing; rand(12);missing;rand(7)]        
24-element Vector{Union{Missing, Float64}}:
...

julia> delete!(df,3)
...

julia> delete!(df,21)
...
julia> std(itr)=length(itr)==1 ? 0 : sqrt(sum((itr .- sum(itr)/length(itr)).^2) / (length(itr) - 1))
std (generic function with 1 method)

julia> std_sm(itr)=std(collect(skipmissing(itr)))
std_sm (generic function with 1 method)

julia> rstd(w)=itr->[std_sm(itr[max(1,c-w+1):c]) for c in eachindex(itr)]
rstd (generic function with 1 method)

julia> transform(groupby(df, :company), :v => rstd(8)=>:new_name)
22×5 DataFrame
 Row │ quarter  year   company  v               new_name        
     │ Int64    Int64  String   Float64?        Real
─────┼────────────────────────────────────────────────────      
   1 │       1   2021  A              0.332982  0
   2 │       2   2021  A              0.226966  0.0749649       
   3 │       4   2021  A        missing         0.0749649       
   4 │       1   2022  A              0.423359  0.0983002       
   5 │       2   2022  A              0.65792   0.183554        
   6 │       3   2022  A              0.4826    0.162216        
   7 │       4   2022  A              0.248065  0.162034        
   8 │       1   2023  A              0.160807  0.17244
   9 │       2   2023  A              0.786538  0.234018        
  10 │       3   2023  A              0.412299  0.217572        
  11 │       4   2023  A              0.380783  0.203048        
  12 │       1   2021  B              0.976114  0
  13 │       2   2021  B              0.115877  0.60828
  14 │       3   2021  B              0.295775  0.453732        
  15 │       4   2021  B              0.214436  0.390696        
  16 │       1   2022  B        missing         0.390696        
  17 │       2   2022  B              0.156643  0.3555
  18 │       3   2022  B              0.748149  0.356778        
  19 │       4   2022  B              0.197173  0.336201        
  20 │       1   2023  B              0.86732   0.305504        
  21 │       3   2023  B              0.160867  0.299666        
  22 │       4   2023  B              0.750582  0.327118

Topic		Replies	Views
How to compute the rolling standardization using DataFrames General Usage packages , dataframes	8	3007	July 21, 2021
How to compute a "cumulative" in a dataframe (without a for loop) Data question , dataframes	44	9518	September 11, 2021
Query.jl: Group dataframe along two variables General Usage	4	648	May 2, 2020
RollingFunctions with a variable window width New to Julia dataframes	19	1353	October 8, 2022
Nested Task error: Bad window span - what? Data question , dataframes	6	794	September 13, 2021

Calculation of rolling standard deviation by group

Related topics