ANOVA Tests in Julia?

Since people keep discovering this post, I think the only two packages worth talking about here are now:

marcpabst / ANOVA.jl requires DataFrames, external setup and fitting of linear model (GLM.jl), and I think is only fixed factors, with “type” argument referring to variant ways to calculate the sums of squares. I don’t know if there’s any way to do nested factors with linear models. I’m assuming robust to unbalanced data due to reliance on regression?, updated 8 months ago

and mine:

BioTurboNick / SimpleANOVA.jl which does a non-GLM-requiring calculation, does random effects, nesting and repeated measures, can use DataFrames or vectors or multidimensional arrays, but requires balanced data and only does one type of sum of squares. Also can do contrasts and omega-square effect size.

12 Likes

I’ve used the ANOVA successfully, but the package seems not maintained anymore, and incompatible with some of the packages.

There is another package MixedAnova, so i am curious that if anyone tried it, and the overall experience of using it.

Is really no any actual package for ANOVA?

1 Like

HypothesisTests.jl has one-way ANOVA now:

https://juliastats.org/HypothesisTests.jl/stable/parametric/#One-way-ANOVA-Test

And GLM.jl’s ftest model claims to be usable for ANOVA:

https://juliastats.org/GLM.jl/stable/api/#GLM.ftest

I think these are both basic cases, however, and I will leave it to those who know more about statistics to speak to the availability of more general tools.

1 Like
1 Like

It seems work. But MixedAnova.jl should be intalled. I think there is ssue with package installation.

Not sure how you define “actual package” but mine does many different variants of ANOVA: https://github.com/BioTurboNick/SimpleANOVA.jl

2 Likes

@BioTurboNick having just stumbled on this thread because it popped up due to recent comments, I see you implemented this but your background is in biology and not mathematics etc. My biggest concern is that calculations of sums of squares are numerically unstable and it’s easily possible to get wrong answers due to numerical computation issues. How much were you aware of numerical stability issues when coding your package?

1 Like

At the time I wrote it, nearly zero awareness of numerical stability I must admit.

I have recently had a headlong crash-course in dealing with numerical stability in another context, so it would be good for me to revisit the package with that in mind.

2 Likes

Hi! Good point! But what about var from StatsBase? This is really just sum of squares… Do we need adjust StatsBase? What kind of algorithm could be prefered? (Welford? Weighted incremental? )

##### General central moment
function _moment2(v::RealArray, m::Real; corrected=false)
    n = length(v)
    s = 0.0
    for i = 1:n
        @inbounds z = v[i] - m
        s += z * z
    end
    varcorrection(n, corrected) * s
end

Seems like a reasonable choice.

Let me know, if I can contribute in some way, when you revisit your package. My implementation of R’s aov shouldn’t suffer from numerical instability.

Do we really have problems with this?

Seems var work… Can you kindly provide more robust test?

using StatsBase

v = rand(1000000)
v2 = v .+ 1000_000_000
v3 = append!(copy(v2), v2)
tv = rand(8)

function var2(v)
    M  = v[1]
    S  = 0.0
    N = length(v)
    for k = 2:N
    x = v[k]
    Mn = M
    M += (x - M) / k
    S += (x - M) * (x - Mn)
    end
  return S / (N - 1)
end

function var3(v)
    M  = v[1]
    S  = 0.0
    N = length(v)
    for k = 2:N
        x  = v[k]
        Mn = M
        M  = Mn + (x - Mn) / k
        S  = S + ((x - Mn) * (x - M) - S) / k
    end
  return S * N / (N - 1)
end

println("check: ", isapprox(var(tv), var2(tv)), " ", isapprox(var(tv), var3(tv)))
println("var: ", var(v) - var(v2), " ", var(v) - var(v3))
println("var2: ", var(v) - var2(v2), " ", var(v) - var2(v3))
println("var3: ", var(v) - var3(v2), " ", var(v) - var3(v3))
check: true true
var: 2.2700591406632498e-12 4.16894628540998e-8
var2: 1.3641951304710354e-8 6.139953312445101e-8
var3: 1.3641966167821096e-8 6.1399576617438e-8

I appreciate the offer! I did some checking and made PR.

https://github.com/BioTurboNick/SimpleANOVA.jl/pull/23

It would be great if you could see if what I did makes any sense, or if I’m missing other sources of instability.

1 Like

Just some thoughts… Haven’t tested yet.

Try sampling uniform on 0 to max float value. Try sampling from a t dist with 2 degrees of freedom. Try sampling from a t dist with 2 dof and center at half max float. Try sampling from a distribution where variance is something like 4x machine epsilon for center of 1/16 max float…

I made some experiments, and in the worst case when variable is exp.(rand(Uniform(1,log(prevfloat(typemax(Float64)))), 10000000)) we have Inf for var and var2 ; NaN for var3.

Try sampling from a distribution where variance is something like 4x machine epsilon for center of 1/16 max float

For this, I have Inf for var and 0.0 for var2 and var3

So for var2 and var3 - these are appropriate results, but Inf for var is discussible.

Since var is giving things of scale x^2, try something like:

var(exp.(rand(Uniform(1,log(prevfloat(typemax(Float64)))/2.1), 10000000)))