# Calculation of rolling standard deviation by group

Hello,

I have a dataset with year, quarter, firm and EPS. I want to calculate rolling standard deviation over only the 8 previous quarters, by firm. The following code works to create lags by group, and the really terrible solution that comes to mind is making 8 lags then calculating standard deviation across them. There’s a better way…

``````transform!(groupby(qr_crsp_cmp, :gvkey), [:gvkey,:fyearq, :fqtr,:EPS] => ((w,y, x,z) -> ifelse.(coalesce.(w .- lag(w) .== 0 , false), lag(z,8), missing))  => :EPS_lag8);

``````
``````std_dev = std(filter(x -> !ismissing(x),EPS))
``````

Here is one method to try (thanks to FlexiJoin package):

``````julia> using DataFrames, StatsBase, StructArrays, IntervalSets, FlexiJoins

julia> df = DataFrame(gvkey = sample(["APPL","MSFT","GOOG","AMZN"],20),
fyearq=sample([2021,2022],20),
fqtr=sample([1,2,3,4],20),
EPS=round.(rand(20); digits=2))
20×4 DataFrame
Row │ gvkey   fyearq  fqtr   EPS
│ String  Int64   Int64  Float64
─────┼────────────────────────────────
1 │ APPL      2022      4     0.41
2 │ APPL      2021      4     0.72
3 │ GOOG      2021      1     0.83
4 │ MSFT      2021      4     0.48
5 │ GOOG      2022      2     0.66
...
julia> transform!(df, [:fyearq, :fqtr] => ByRow((y,q) -> y+(q-1)*0.25) => :yq)
20×5 DataFrame
Row │ gvkey   fyearq  fqtr   EPS      yq
│ String  Int64   Int64  Float64  Float64
─────┼─────────────────────────────────────────                                                                1 │ APPL      2022      4     0.41  2022.75
2 │ APPL      2021      4     0.72  2021.75
3 │ GOOG      2021      1     0.83  2021.0
4 │ MSFT      2021      4     0.48  2021.75
5 │ GOOG      2022      2     0.66  2022.25
...
julia> sa = StructArray(NamedTuple.(eachrow(df)))
20-element StructArray(::Vector{String}, ::Vector{Int64}...:
(gvkey = "APPL", fyearq = 2022, fqtr = 4, EPS = 0.41, yq = 2022.75)
(gvkey = "APPL", fyearq = 2021, fqtr = 4, EPS = 0.72, yq = 2021.75)
(gvkey = "GOOG", fyearq = 2021, fqtr = 1, EPS = 0.83, yq = 2021.0)
(gvkey = "MSFT", fyearq = 2021, fqtr = 4, EPS = 0.48, yq = 2021.75)
(gvkey = "GOOG", fyearq = 2022, fqtr = 2, EPS = 0.66, yq = 2022.25)
...
julia> dfout = DataFrame(map(x->(; x.O..., qtr_cnt=length(x.M), EPS_sum=sum(t->t.EPS,x.M), EPS_ssq=sum(t->t.
EPS^2, x.M)),leftjoin((O=sa, M=sa), by_key(:gvkey) & by_pred(x->((x.yq - 1.76)..(x.yq + 0.01)), ∋, :yq); groupby=:O)))
20×8 DataFrame
Row │ gvkey   fyearq  fqtr   EPS      yq       qtr_cnt  EPS_sum  EPS_ssq
│ String  Int64   Int64  Float64  Float64  Int64    Float64  Float64
─────┼────────────────────────────────────────────────────────────────────
1 │ AMZN      2021      3     0.17  2021.5         3     0.76   0.3134
2 │ GOOG      2022      2     0.96  2022.25        5     3.4    2.6422
3 │ MSFT      2021      4     0.41  2021.75        4     1.93   1.0687
4 │ APPL      2021      1     0.33  2021.0         2     0.4    0.1138
5 │ APPL      2021      1     0.07  2021.0         2     0.4    0.1138
...
``````

Now, in the final DataFrame, there are the `qtr_cnt`, `EPS_sum` and `EPS_ssq` (sum-of-squares) for the appropriate previous 2 years from each datapoint. This should be enough to calculate the standard deviation.

Note, a key difficulty in these questions is getting some data to play with. Had the question included a pasteble generation of a bit of data, I’m sure it would be a huge help for readers.

3 Likes

Is this what you want?

``````julia> df = allcombinations(DataFrame, quarter=1:4, year=2021:2023, company=["A", "B"]);

julia> df.v = rand(24);

julia> using RollingFunctions

julia> transform!(groupby(df, :company), :v => x -> runstd(x, 8))
24×5 DataFrame
Row │ quarter  year   company  v          v_function
│ Int64    Int64  String   Float64    Float64
─────┼────────────────────────────────────────────────
1 │       1   2021  A        0.953399    0.0
2 │       2   2021  A        0.323781    0.445208
3 │       3   2021  A        0.592177    0.315948
4 │       4   2021  A        0.578366    0.258939
5 │       1   2022  A        0.263201    0.273147
6 │       2   2022  A        0.502612    0.244844
7 │       3   2022  A        0.764003    0.239605
8 │       4   2022  A        0.0296855   0.292338
9 │       1   2023  A        0.491008    0.228926
10 │       2   2023  A        0.903478    0.273221
11 │       3   2023  A        0.281872    0.282655
12 │       4   2023  A        0.98276     0.334773
13 │       1   2021  B        0.888162    0.0
14 │       2   2021  B        0.974804    0.0612648
15 │       3   2021  B        0.317977    0.356847
16 │       4   2021  B        0.0854368   0.433345
17 │       1   2022  B        0.434297    0.379923
18 │       2   2022  B        0.860623    0.364132
19 │       3   2022  B        0.827672    0.343982
20 │       4   2022  B        0.210523    0.350858
21 │       1   2023  B        0.25318     0.341578
22 │       2   2023  B        0.774294    0.306995
23 │       3   2023  B        0.882941    0.330944
24 │       4   2023  B        0.244069    0.303342
``````
1 Like

you could also define your own rolling std and then do everything with the minilanguage

``````std(itr)=length(itr)==1 ? 0 : sqrt(sum((itr .- sum(itr)/length(itr)).^2) / (length(itr) - 1))
rstd(w)=itr->[std(itr[max(1,c-w+1):c]) for c in eachindex(itr)]
transform(groupby(df, :company), :v => rstd(8))

``````

Thank you!! This works well. I amended it to handle missing obs and groups with fewer than 8 obs. Do you know how to change the title of v_function? When I add a rename line, it says that “v_function” does not exist even though it appears when I print the dataframe.

``````#check that this looks at 8 nonmissings. If less than that in group, looks at smaller amt.
transform!(groupby(qr_crsp_cmp[.!ismissing.(qr_crsp_cmp[!,:change4q]),:], :gvkey), :change4q => x -> if length(x) >= 8 runstd(x, 8) else runstd(x, length(x)) end)

rename!(qr_crsp_cmp,:change4q_function => :a)
``````

You can pass output column name explicitly, eg.:

``````:v => rstd(8) => :some_new_name
``````

Many times datasets are incomplete in strange ways (though CRSP is well curated). So if using rolling/running functions, it is best to verify data has contiguous year/quarter data with `missing`s when missing, and not just skipping over time periods.
For example, consider a company going private and back public (e.g. TWTR), then a window of 8 rows might include a gap and make it spill way over 2 years.

2 Likes

could you check if this formulation handles all “your” cases?

``````julia> df = allcombinations(DataFrame, quarter=1:4, year=2021:2023, company=["A", "B"]);

julia> df.v=[rand(3); missing; rand(12);missing;rand(7)]
24-element Vector{Union{Missing, Float64}}:
...

julia> delete!(df,3)
...

julia> delete!(df,21)
...
julia> std(itr)=length(itr)==1 ? 0 : sqrt(sum((itr .- sum(itr)/length(itr)).^2) / (length(itr) - 1))
std (generic function with 1 method)

julia> std_sm(itr)=std(collect(skipmissing(itr)))
std_sm (generic function with 1 method)

julia> rstd(w)=itr->[std_sm(itr[max(1,c-w+1):c]) for c in eachindex(itr)]
rstd (generic function with 1 method)

julia> transform(groupby(df, :company), :v => rstd(8)=>:new_name)
22×5 DataFrame
Row │ quarter  year   company  v               new_name
│ Int64    Int64  String   Float64?        Real
─────┼────────────────────────────────────────────────────
1 │       1   2021  A              0.332982  0
2 │       2   2021  A              0.226966  0.0749649
3 │       4   2021  A        missing         0.0749649
4 │       1   2022  A              0.423359  0.0983002
5 │       2   2022  A              0.65792   0.183554
6 │       3   2022  A              0.4826    0.162216
7 │       4   2022  A              0.248065  0.162034
8 │       1   2023  A              0.160807  0.17244
9 │       2   2023  A              0.786538  0.234018
10 │       3   2023  A              0.412299  0.217572
11 │       4   2023  A              0.380783  0.203048
12 │       1   2021  B              0.976114  0
13 │       2   2021  B              0.115877  0.60828
14 │       3   2021  B              0.295775  0.453732
15 │       4   2021  B              0.214436  0.390696
16 │       1   2022  B        missing         0.390696
17 │       2   2022  B              0.156643  0.3555
18 │       3   2022  B              0.748149  0.356778
19 │       4   2022  B              0.197173  0.336201
20 │       1   2023  B              0.86732   0.305504
21 │       3   2023  B              0.160867  0.299666
22 │       4   2023  B              0.750582  0.327118
``````