Calculation of rolling standard deviation by group

Hello,

I have a dataset with year, quarter, firm and EPS. I want to calculate rolling standard deviation over only the 8 previous quarters, by firm. The following code works to create lags by group, and the really terrible solution that comes to mind is making 8 lags then calculating standard deviation across them. There’s a better way…

transform!(groupby(qr_crsp_cmp, :gvkey), [:gvkey,:fyearq, :fqtr,:EPS] => ((w,y, x,z) -> ifelse.(coalesce.(w .- lag(w) .== 0 , false), lag(z,8), missing))  => :EPS_lag8);

std_dev = std(filter(x -> !ismissing(x),EPS))

Here is one method to try (thanks to FlexiJoin package):

julia> using DataFrames, StatsBase, StructArrays, IntervalSets, FlexiJoins

julia> df = DataFrame(gvkey = sample(["APPL","MSFT","GOOG","AMZN"],20),
       fyearq=sample([2021,2022],20), 
       fqtr=sample([1,2,3,4],20),
       EPS=round.(rand(20); digits=2))
20Γ—4 DataFrame
 Row β”‚ gvkey   fyearq  fqtr   EPS     
     β”‚ String  Int64   Int64  Float64 
─────┼────────────────────────────────
   1 β”‚ APPL      2022      4     0.41
   2 β”‚ APPL      2021      4     0.72
   3 β”‚ GOOG      2021      1     0.83
   4 β”‚ MSFT      2021      4     0.48
   5 β”‚ GOOG      2022      2     0.66
...
julia> transform!(df, [:fyearq, :fqtr] => ByRow((y,q) -> y+(q-1)*0.25) => :yq)                    
20Γ—5 DataFrame                                                                                              
 Row β”‚ gvkey   fyearq  fqtr   EPS      yq                                                                   
     β”‚ String  Int64   Int64  Float64  Float64                                                              
─────┼─────────────────────────────────────────                                                                1 β”‚ APPL      2022      4     0.41  2022.75                                                              
   2 β”‚ APPL      2021      4     0.72  2021.75                                                              
   3 β”‚ GOOG      2021      1     0.83  2021.0                                                               
   4 β”‚ MSFT      2021      4     0.48  2021.75                                                              
   5 β”‚ GOOG      2022      2     0.66  2022.25                                                              
...
julia> sa = StructArray(NamedTuple.(eachrow(df)))
20-element StructArray(::Vector{String}, ::Vector{Int64}...:
 (gvkey = "APPL", fyearq = 2022, fqtr = 4, EPS = 0.41, yq = 2022.75)
 (gvkey = "APPL", fyearq = 2021, fqtr = 4, EPS = 0.72, yq = 2021.75)
 (gvkey = "GOOG", fyearq = 2021, fqtr = 1, EPS = 0.83, yq = 2021.0)
 (gvkey = "MSFT", fyearq = 2021, fqtr = 4, EPS = 0.48, yq = 2021.75)
 (gvkey = "GOOG", fyearq = 2022, fqtr = 2, EPS = 0.66, yq = 2022.25)
...
julia> dfout = DataFrame(map(x->(; x.O..., qtr_cnt=length(x.M), EPS_sum=sum(t->t.EPS,x.M), EPS_ssq=sum(t->t.
EPS^2, x.M)),leftjoin((O=sa, M=sa), by_key(:gvkey) & by_pred(x->((x.yq - 1.76)..(x.yq + 0.01)), βˆ‹, :yq); groupby=:O)))
20Γ—8 DataFrame
 Row β”‚ gvkey   fyearq  fqtr   EPS      yq       qtr_cnt  EPS_sum  EPS_ssq 
     β”‚ String  Int64   Int64  Float64  Float64  Int64    Float64  Float64 
─────┼────────────────────────────────────────────────────────────────────
   1 β”‚ AMZN      2021      3     0.17  2021.5         3     0.76   0.3134
   2 β”‚ GOOG      2022      2     0.96  2022.25        5     3.4    2.6422
   3 β”‚ MSFT      2021      4     0.41  2021.75        4     1.93   1.0687
   4 β”‚ APPL      2021      1     0.33  2021.0         2     0.4    0.1138
   5 β”‚ APPL      2021      1     0.07  2021.0         2     0.4    0.1138
...

Now, in the final DataFrame, there are the qtr_cnt, EPS_sum and EPS_ssq (sum-of-squares) for the appropriate previous 2 years from each datapoint. This should be enough to calculate the standard deviation.

Note, a key difficulty in these questions is getting some data to play with. Had the question included a pasteble generation of a bit of data, I’m sure it would be a huge help for readers.

3 Likes

Is this what you want?

julia> df = allcombinations(DataFrame, quarter=1:4, year=2021:2023, company=["A", "B"]);

julia> df.v = rand(24);

julia> using RollingFunctions

julia> transform!(groupby(df, :company), :v => x -> runstd(x, 8))
24Γ—5 DataFrame
 Row β”‚ quarter  year   company  v          v_function
     β”‚ Int64    Int64  String   Float64    Float64
─────┼────────────────────────────────────────────────
   1 β”‚       1   2021  A        0.953399    0.0
   2 β”‚       2   2021  A        0.323781    0.445208
   3 β”‚       3   2021  A        0.592177    0.315948
   4 β”‚       4   2021  A        0.578366    0.258939
   5 β”‚       1   2022  A        0.263201    0.273147
   6 β”‚       2   2022  A        0.502612    0.244844
   7 β”‚       3   2022  A        0.764003    0.239605
   8 β”‚       4   2022  A        0.0296855   0.292338
   9 β”‚       1   2023  A        0.491008    0.228926
  10 β”‚       2   2023  A        0.903478    0.273221
  11 β”‚       3   2023  A        0.281872    0.282655
  12 β”‚       4   2023  A        0.98276     0.334773
  13 β”‚       1   2021  B        0.888162    0.0
  14 β”‚       2   2021  B        0.974804    0.0612648
  15 β”‚       3   2021  B        0.317977    0.356847
  16 β”‚       4   2021  B        0.0854368   0.433345
  17 β”‚       1   2022  B        0.434297    0.379923
  18 β”‚       2   2022  B        0.860623    0.364132
  19 β”‚       3   2022  B        0.827672    0.343982
  20 β”‚       4   2022  B        0.210523    0.350858
  21 β”‚       1   2023  B        0.25318     0.341578
  22 β”‚       2   2023  B        0.774294    0.306995
  23 β”‚       3   2023  B        0.882941    0.330944
  24 β”‚       4   2023  B        0.244069    0.303342
1 Like

you could also define your own rolling std and then do everything with the minilanguage

std(itr)=length(itr)==1 ? 0 : sqrt(sum((itr .- sum(itr)/length(itr)).^2) / (length(itr) - 1))
rstd(w)=itr->[std(itr[max(1,c-w+1):c]) for c in eachindex(itr)]
transform(groupby(df, :company), :v => rstd(8))

Thank you!! This works well. I amended it to handle missing obs and groups with fewer than 8 obs. Do you know how to change the title of v_function? When I add a rename line, it says that β€œv_function” does not exist even though it appears when I print the dataframe.

#check that this looks at 8 nonmissings. If less than that in group, looks at smaller amt. 
transform!(groupby(qr_crsp_cmp[.!ismissing.(qr_crsp_cmp[!,:change4q]),:], :gvkey), :change4q => x -> if length(x) >= 8 runstd(x, 8) else runstd(x, length(x)) end)

rename!(qr_crsp_cmp,:change4q_function => :a)

You can pass output column name explicitly, eg.:

:v => rstd(8) => :some_new_name

Many times datasets are incomplete in strange ways (though CRSP is well curated). So if using rolling/running functions, it is best to verify data has contiguous year/quarter data with missings when missing, and not just skipping over time periods.
For example, consider a company going private and back public (e.g. TWTR), then a window of 8 rows might include a gap and make it spill way over 2 years.

2 Likes

could you check if this formulation handles all β€œyour” cases?

julia> df = allcombinations(DataFrame, quarter=1:4, year=2021:2023, company=["A", "B"]);

julia> df.v=[rand(3); missing; rand(12);missing;rand(7)]        
24-element Vector{Union{Missing, Float64}}:
...

julia> delete!(df,3)
...

julia> delete!(df,21)
...
julia> std(itr)=length(itr)==1 ? 0 : sqrt(sum((itr .- sum(itr)/length(itr)).^2) / (length(itr) - 1))
std (generic function with 1 method)

julia> std_sm(itr)=std(collect(skipmissing(itr)))
std_sm (generic function with 1 method)

julia> rstd(w)=itr->[std_sm(itr[max(1,c-w+1):c]) for c in eachindex(itr)]
rstd (generic function with 1 method)

julia> transform(groupby(df, :company), :v => rstd(8)=>:new_name)
22Γ—5 DataFrame
 Row β”‚ quarter  year   company  v               new_name        
     β”‚ Int64    Int64  String   Float64?        Real
─────┼────────────────────────────────────────────────────      
   1 β”‚       1   2021  A              0.332982  0
   2 β”‚       2   2021  A              0.226966  0.0749649       
   3 β”‚       4   2021  A        missing         0.0749649       
   4 β”‚       1   2022  A              0.423359  0.0983002       
   5 β”‚       2   2022  A              0.65792   0.183554        
   6 β”‚       3   2022  A              0.4826    0.162216        
   7 β”‚       4   2022  A              0.248065  0.162034        
   8 β”‚       1   2023  A              0.160807  0.17244
   9 β”‚       2   2023  A              0.786538  0.234018        
  10 β”‚       3   2023  A              0.412299  0.217572        
  11 β”‚       4   2023  A              0.380783  0.203048        
  12 β”‚       1   2021  B              0.976114  0
  13 β”‚       2   2021  B              0.115877  0.60828
  14 β”‚       3   2021  B              0.295775  0.453732        
  15 β”‚       4   2021  B              0.214436  0.390696        
  16 β”‚       1   2022  B        missing         0.390696        
  17 β”‚       2   2022  B              0.156643  0.3555
  18 β”‚       3   2022  B              0.748149  0.356778        
  19 β”‚       4   2022  B              0.197173  0.336201        
  20 β”‚       1   2023  B              0.86732   0.305504        
  21 β”‚       3   2023  B              0.160867  0.299666        
  22 β”‚       4   2023  B              0.750582  0.327118