# Initializing OnlineStats Series with statistics

Premise:
I am running experiments that involve with Monte Carlo sampling on clusters, and I am collecting the mean and variance using OnlineStats.
It takes a long time and I get time out errors on our clusters, so I want to save data and restart in another job.
Since the number of sampling is huge, I want to store only the mean, variance and the sample size, (not the whole data) to restart.

I am aware of the algorithm Online estimation of variance with limited memory - Cross Validated (but if I were to be willing to implement this myself, I would not be using OnlineStats)

Question:
If I do

using OnlineStats
mystat = Series(Mean(),Variance())
fit!(mystat, rand(10))

I get

Series
â”śâ”€ Mean: n=10 | value=0.542621
â””â”€ Variance: n=10 | value=0.077484

If I can save mystat and load mystat as a â€śjulia variableâ€ť like matlab then thatâ€™s fine, but it seems to be tricky: What is the preferred way to save variables? - #17 by FHell

I know value(mystat) gives the mean and variance, and nobs(mystat)gives the sample size, which I can save to a .txt file and I can read it in another run.

But given the mean, variance, and the sample size, I donâ€™t know how to create Series â€śmystatâ€ť with the same information, so that I can merge! in another run of my experiment.

I would think as a worst case you can just rebuild the structs exactly as they were. See the definition in the code for Variance below.

https://github.com/joshday/OnlineStatsBase.jl/blob/master/src/stats.jl#L476

1 Like

Ah thank you!

So with
mystat = Series(Mean(),Variance());
fit!(mystat, rand(20))
I get
Series
â”śâ”€ Mean: n=10 | value=0.446827
â””â”€ Variance: n=10 | value=0.0938919

I guessed that you meant something like

mymean = Mean(value(mystat)[1], EqualWeight() , nobs(mystat));
myvar = Variance(value(mystat)[2], value(mystat)[1], EqualWeight() , nobs(mystat));

mypreviousstat = Series(mymean,myvar)
But I get the correct mean but the variance is not correct.
Series
â”śâ”€ Mean: n=10 | value=0.446827
â””â”€ Variance: n=10 | value=0.104324

I may be misunderstanding some definitions of sample variance etc. but \sigma2 looks like the variance and I assume the definition of sample variance to be the same in the functionâ€¦what did I missâ€¦?

Update:
myvar = Variance(value(mystat)[2]*(nobs(mystat)-1)/nobs(mystat), value(mystat)[1], EqualWeight() , nobs(mystat))

Worked. So somehow the parameter \sigma2 in the struct is the biased estimator of the variance (the one with 1/(sample size)) and value(mystat)[2] gives the (unbiased) sample variance (the one with 1/(sample size -1))â€¦?

1 Like

OnlineStats author here. Yes, the Variance struct stores the biased variance because it simplifies the update code in fit!.

If Iâ€™m understanding your problem correctly, you could also serialize the Variance in one process and then deserialize it in another.