Initializing OnlineStats Series with statistics

I am running experiments that involve with Monte Carlo sampling on clusters, and I am collecting the mean and variance using OnlineStats.
It takes a long time and I get time out errors on our clusters, so I want to save data and restart in another job.
Since the number of sampling is huge, I want to store only the mean, variance and the sample size, (not the whole data) to restart.

I am aware of the algorithm Online estimation of variance with limited memory - Cross Validated (but if I were to be willing to implement this myself, I would not be using OnlineStats)

If I do

using OnlineStats
mystat = Series(Mean(),Variance())
fit!(mystat, rand(10))

I get

├─ Mean: n=10 | value=0.542621
└─ Variance: n=10 | value=0.077484

If I can save mystat and load mystat as a “julia variable” like matlab then that’s fine, but it seems to be tricky: What is the preferred way to save variables? - #17 by FHell

I know value(mystat) gives the mean and variance, and nobs(mystat)gives the sample size, which I can save to a .txt file and I can read it in another run.

But given the mean, variance, and the sample size, I don’t know how to create Series “mystat” with the same information, so that I can merge! in another run of my experiment.

I would think as a worst case you can just rebuild the structs exactly as they were. See the definition in the code for Variance below.

1 Like

Ah thank you!

So with
mystat = Series(Mean(),Variance());
fit!(mystat, rand(20))
I get
├─ Mean: n=10 | value=0.446827
└─ Variance: n=10 | value=0.0938919

I guessed that you meant something like

mymean = Mean(value(mystat)[1], EqualWeight() , nobs(mystat));
myvar = Variance(value(mystat)[2], value(mystat)[1], EqualWeight() , nobs(mystat));

mypreviousstat = Series(mymean,myvar)
But I get the correct mean but the variance is not correct.
├─ Mean: n=10 | value=0.446827
└─ Variance: n=10 | value=0.104324

I may be misunderstanding some definitions of sample variance etc. but \sigma2 looks like the variance and I assume the definition of sample variance to be the same in the function…what did I miss…?

myvar = Variance(value(mystat)[2]*(nobs(mystat)-1)/nobs(mystat), value(mystat)[1], EqualWeight() , nobs(mystat))

Worked. So somehow the parameter \sigma2 in the struct is the biased estimator of the variance (the one with 1/(sample size)) and value(mystat)[2] gives the (unbiased) sample variance (the one with 1/(sample size -1))…?

1 Like

OnlineStats author here. Yes, the Variance struct stores the biased variance because it simplifies the update code in fit!.

If I’m understanding your problem correctly, you could also serialize the Variance in one process and then deserialize it in another.