Since people keep discovering this post, I think the only two packages worth talking about here are now:
marcpabst / ANOVA.jl requires DataFrames, external setup and fitting of linear model (GLM.jl), and I think is only fixed factors, with “type” argument referring to variant ways to calculate the sums of squares. I don’t know if there’s any way to do nested factors with linear models. I’m assuming robust to unbalanced data due to reliance on regression?, updated 8 months ago
and mine:
BioTurboNick / SimpleANOVA.jl which does a non-GLM-requiring calculation, does random effects, nesting and repeated measures, can use DataFrames or vectors or multidimensional arrays, but requires balanced data and only does one type of sum of squares. Also can do contrasts and omega-square effect size.
I think these are both basic cases, however, and I will leave it to those who know more about statistics to speak to the availability of more general tools.
@BioTurboNick having just stumbled on this thread because it popped up due to recent comments, I see you implemented this but your background is in biology and not mathematics etc. My biggest concern is that calculations of sums of squares are numerically unstable and it’s easily possible to get wrong answers due to numerical computation issues. How much were you aware of numerical stability issues when coding your package?
At the time I wrote it, nearly zero awareness of numerical stability I must admit.
I have recently had a headlong crash-course in dealing with numerical stability in another context, so it would be good for me to revisit the package with that in mind.
Hi! Good point! But what about var from StatsBase? This is really just sum of squares… Do we need adjust StatsBase? What kind of algorithm could be prefered? (Welford? Weighted incremental? )
##### General central moment
function _moment2(v::RealArray, m::Real; corrected=false)
n = length(v)
s = 0.0
for i = 1:n
@inbounds z = v[i] - m
s += z * z
end
varcorrection(n, corrected) * s
end
Seems var work… Can you kindly provide more robust test?
using StatsBase
v = rand(1000000)
v2 = v .+ 1000_000_000
v3 = append!(copy(v2), v2)
tv = rand(8)
function var2(v)
M = v[1]
S = 0.0
N = length(v)
for k = 2:N
x = v[k]
Mn = M
M += (x - M) / k
S += (x - M) * (x - Mn)
end
return S / (N - 1)
end
function var3(v)
M = v[1]
S = 0.0
N = length(v)
for k = 2:N
x = v[k]
Mn = M
M = Mn + (x - Mn) / k
S = S + ((x - Mn) * (x - M) - S) / k
end
return S * N / (N - 1)
end
println("check: ", isapprox(var(tv), var2(tv)), " ", isapprox(var(tv), var3(tv)))
println("var: ", var(v) - var(v2), " ", var(v) - var(v3))
println("var2: ", var(v) - var2(v2), " ", var(v) - var2(v3))
println("var3: ", var(v) - var3(v2), " ", var(v) - var3(v3))
Try sampling uniform on 0 to max float value. Try sampling from a t dist with 2 degrees of freedom. Try sampling from a t dist with 2 dof and center at half max float. Try sampling from a distribution where variance is something like 4x machine epsilon for center of 1/16 max float…
I made some experiments, and in the worst case when variable is exp.(rand(Uniform(1,log(prevfloat(typemax(Float64)))), 10000000)) we have Inf for var and var2 ; NaN for var3.
Try sampling from a distribution where variance is something like 4x machine epsilon for center of 1/16 max float
For this, I have Inf for var and 0.0 for var2 and var3
So for var2 and var3 - these are appropriate results, but Inf for var is discussible.