ANOVA Tests in Julia?

While examples are not in the docs you can look at ftest.jl in GLM.jl/ftest.jl at master · JuliaStats/GLM.jl · GitHub which has a large commented section that gives explicit examples and results that could probably be put in the docs. There also examples in runtests.jl found in GLM.jl/runtests.jl at master · JuliaStats/GLM.jl · GitHub.

In general, it is used to compare nested models fit separately. ftest(fullmodel,nullmodel), it can also do more than two models at a time. This topic has come up many different times and the discussion general runs between “we need to do this because everyone expects it” and “it is misused and abused so we would rather individuals do explicit model comparisons”. I am not taking a side one way or another. The ftest function explicitly compares nested models. It can be used in a way to give you information for a typical ANOVA table but you will need to explicitly fit the different models to do it.

2 Likes

I don’t think anyone doubts that there needs to be basic ANOVA functionality. But what is the problem with ANOVA.jl?

1 Like

I just realized there are four different repos names ANOVA.jl, but I assume you mean this one.

1 Like

@Taylor_Maxwell I appreciate that. And @mkborregaard I’m looking into ANOVA.jl now.

To both your points, I don’t know what models are supposed to be used, or why. It wasn’t taught in my biostats class or can be found in my otherwise detailed textbook (Zar 4ed) which doesn’t discuss GLM. I don’t know how to map the concepts to each other and I haven’t found good resources to improve my understanding.

For ANOVA.jl specifically, apart from having to define and fit a model, I don’t understand what the “contrasts” are or why it is important to use this EffectsCoding object. Or how to specify fixed and random effects factors, interaction terms, etc.

A suggestion: a final ANOVA package should not require the user to formulate and fit a linear model but should provide output that shows the user how to construct the underlying GLM model used. That would provide a bridge between them.

The best thing would be to look at the example in the comments in https://github.com/JuliaStats/GLM.jl/blob/master/src/ftest.j.

If you have a model: y~1+x1+x2 (call this the full model) and you are interested in “x2” they you would compare the full model to a nested model that excludes “x2” (y~1+x1) which is often called the reduced model. While it looks like you are testing “x2” you are really comparing the two models where the difference between them happens to be the absence of “x2” in the reduced model. A traditional ANOVA table that presents statistics for each variable is really presenting statistics for multiple different full vs reduced model comparisons. The ftest function makes it explicitly clear what specific models you are comparing. I hope this helps.

I have not used ANOVA.jl so I can’t comment on their contrasts and effect coding.

1 Like

I would also suggest reading this comment (Poll: Do we Julians want ANOVAs? - #7 by dmbates) by dmbates (the original author of GLM, MixedModels in julia and the primary mixed effects models package lme4 in R) in an older discussion on ANOVA and the paper he links to.

3 Likes

I appreciate you trying, but starting an explanation that assumes knowledge of linear models is only marginally helpful.

You’re talking to people who know nothing about linear models and may only know ANOVAs as computations on sums of squares and within and across groups, if they even know that much.

There needs to be a translation between the two, if this is the approach taken.

Curiously, one of the papers in the older thread even mentions cases where linear regression doesn’t provide a correct answer for ANOVA (section 2.2): http://www.stat.columbia.edu/~gelman/research/published/AOS259.pdf

3 Likes

Sorry, I wasn’t trying to convince you of anything, I was just trying to answer your question. The thread I linked to had a discussion of why the person who wrote the ftest function decided to do it that way. The comment by dmbates and the paper give some of the rational for it (i.e. why and how ANOVA tables can be misinterpreted). It was not an argument for or against it.

Every basic stats class I have seen introduces regression & ANOVA and unites them in the same general linear model framework and many go further and introduce generalized linear models. The concept of comparing models is not a hard argument to follow. If someone intends to use a linear model it would be useful to know something about it, especially when it is not uncommon for them to be misinterpreted.

I fully expect that some or many packages and/or functions will be written to give someone the traditional ANOVA table because enough people have asked for it.

1 Like

I am not sure I understand what you are asking for here. Do you want people who understand ANOVA to code and maintain a package for people who don’t, with the understanding that it yields conceptually incorrect results some of the time? And you want them to do it for free, with a free software license?

@Tamas_Papp, I think You completely missed the point of @BioTurboNick. While it’s not hard for some of us to transition between the two concepts of GLM and ANOVA (because we’re e.g. trained to do so), most users without stats/math/cs background will find it hard. And they even don’t want to learn that, since for them it may be a ~half year effort which is a time they’d rather spend doing something else.

Let me re-phrase for You what the discussion here is about:

  • I’d like to see a tool that…
  • You can do that using this method… Try…
  • This looks too advanced for me.
  • Have a look here for the rationale of the choices that were made.

We chose julia to assembly, even though with assembly you can the same (and sometimes with even faster run-time). It’s just a matter of convenience for us (to get things done, without spending half a year learning a tool). Yet You don’t see people writing compiler stuff saying “do You want us, who understand assembly to code and maintain a compiler for those of You who don’t (blahblahblah)…”

Sorry, but this comment of Yours a) sounds mildly aggressive and b) brings nothing to the discussion of Anova vs GLM. Unless of course You were willing to produce a simple wrapper of GLM for anova functionality and were asking @BioTurboNick for more details :P.

2 Likes

I don’t think I did — I think I am asking a very relevant question.

Perhaps because people who are working on the compiler are also users of Julia, so the two roles don’t separate. If they didn’t use Julia otherwise, I doubt many of them would work on the compiler (unpaid).

This is not the first thread expressing a demand for ANOVA. What is important to realize that a package will not be written and maintained unless someone who needs the functionality is willing to work on it. IMO the best approach is to ask for help: other people may not do the work for you, but you can get a lot of support with the right attitude.

Which brings me back to the original question: I think that @Taylor_Maxwell linked a very relevant discussion, and I was surprised that @BioTurboNick deemed it “marginally helpful”. A clarification of what he is looking for could be useful then.

2 Likes

@abulak pretty much got it.

@Taylor_Maxwell, every basic stats class you’ve seen? Maybe, but my biostatistics class in 2005-6 did not. I doubt I was unique in having single semester course on statistics that did not introduce GLM. And I apologize, I didn’t think you were trying to convince me of anything, I was just trying to explain how, if you do want to answer how to do ANOVA with linear models, it needs to include an explanation of how the concepts of the ANOVA map onto those models.

If I ask you how to specify a model for ANOVA with one fixed factor and one random factor, what would you say? What about two fixed factors with interaction?

@Tamas_Papp - I’m not demanding it, I’m just emphasizing to you/the world that it is necessary, and why. You’re free to choose otherwise.

But people who know building things for people who don’t (or are learning) is how it always is, isn’t it? Encouraging the latter to try the former is a good thing, but only a subset can be expected to do so. I might even be willing to contribute, later.

And perhaps I’m misunderstanding (which is likely), but it sounded like GLM can also produce bad results for ANOVA for some nested designs?

In any case, the immediate problem seems completely solvable with documentation. If you believe that GLM-ANOVA is better than or equal to traditional ANOVA, and the current solution is to have users specify and fit linear models as input, then there just needs to be a clear explanation of what the models should look like using ANOVA terminology.

I think some of this is a misunderstanding of terminology. The uniting of ANOVA and regression is called the general linear model and it is limited to response/dependent variables that are continuous. The “generalized” linear model which we now call GLM generalized the general linear model to include dependent/response variables that can have a specific set of different distributions (think logistic, poisson, etc) using a canonical link function. In this discussion we are not really about generalized linear models (GLM) but the more basic general linear model. I went and looked at two or three texts used for a basic statistics class and the uniting of ANOVA and regression into the general linear model (not the related generalized linear model) was the primary culmination for each one of them for use in a basic stats course.

That being said, if you are trying to connect this with fixed AND random effects (mixed models) and nesting schemes then you are not talking about traditional ANOVA and functionality for a traditional ANOVA table will not be relevant for your mixed effect models and making most of this discussion fairly mute. That is a different topic altogether. The MixedModels package is what you should be looking at and not GLM and definitely not traditional ANOVA tables.

Back to traditional ANOVA tables, I included the link to the past discussion because if I remember right, LewisHein (who wrote ftest) had originally wanted/intended to put together code in GLM to implement traditional ANOVA functionality. After that discussion and other discussions on github he changed his mind and decided to go the ftest route.

1 Like

There’s definitely some disconnect here.

My book and notes show the only difference between the 3 types of ANOVA (I, II, III) are how you calculate the F statistic.

Model I (all fixed): F = factor MS / error MS
Model II (all random): F = factor MS / interaction MS
Model III (mixed): Each factor’s F statistic uses Model I or Model II equation depending on type.

Are you saying this Model III ANOVA is invalid, then?

FWIW, I spoke with a biostats professor at my undergrad institution. She pointed me to a resource to help me here, but also said it was rare for an undergrad 1-semester course to get to GLM, even if it’s in the textbook.

When you talk to someone about GLM they usually are talking about generalized linear models so you are right, normally generalized linear models are not talked about in a first basic course.

As for mixed effect models (definitely not discussed in a basic stats class) when people are asking for an ANOVA table they are almost exclusively asking for ANOVA for a fixed effects model (and often know nothing about mixed effects). Yes you can do mixed effect models in a ANOVA like framework but it is VERY different than what most people mean when they are talking about ANOVA or asking for an ANOVA table.

The GLM package in Julia does not even address random and/or mixed effect models in the first place. Whether you are working in Julia or R you would would be using entirely different packages (lme4 or other in R or MixedModels in Julia) to fit models with mixed effects. In fact, dmbates (from the previously linked discussion) is the primary author of both of those packages.

By inserting mixed effect models into this discussion about functionality for ANOVA tables in the GLM package is a complete and unnecessary tangent.

Could you please answer my direct question?

There’s definitely some disconnect here.

My book and notes show the only difference between the 3 types of ANOVA (I, II, III) are how you calculate the F statistic.

Model I (all fixed): F = factor MS / error MS
Model II (all random): F = factor MS / interaction MS
Model III (mixed): Each factor’s F statistic uses Model I or Model II equation depending on type.

Are you saying this Model III ANOVA is invalid, then?

Unless the text is wrong, there is no difference in the SS or MS calculations, only the F statistic. Why would they be handled by different packages?

Fixed effect models are not trivially different from fixed effect models. Part of the complexity of mixed models is expressing/defining the nesting structures, making the design matrices and the computational burden for fitting the models (usually with some form of likelihood). I am by no means an expert with linear mixed-effect models and would not do justice to the subject.

I really am not trying to argue with you. I am saying that mixed models are not trivial and are considered more advanced and complicated than fixed effects models. As such, they have always required special packages that are more complicated and require more expertise to do right than an implementation of standard fixed effect linear models. To give you a taste, look up how to fit and interpret mixed effect models with lme4 in R or MixedModels in Julia and compare that to any implementation of standard fixed effects models. While mixed effects with only categorical independent (the x’s) variables is simpler those that include continuous independent/predictor (the x’s) variables, most discussions usually want to have a solution that includes both continuous and categorical independent/predictor variables. Please, just look at these packages first (these are the premier mixed models packages in R and Julia) to get an idea what they are like.

1 Like

I think this discussion is turning a little fruitless.

ANOVA tests are very widely used in sciences based on block experiments, like much of classical ecology. They are particularly nice because you can design the experiment to follow a specific ANOVA test and vice versa. For this reason, ANOVA is possibly the most widely performed statistical procedure in the world after the bivariate linear regression.

I know that there are statisticians that argue about potential pitfalls in ANOVA, and of course anyone should only implement statistical procedures in their own package that they are comfortable with. I don’t think it qualifies ss an argument for not having an ANOVA functionality in Julia at all.

Also, yes, they can be statistically demonstrated to be derivable from a specific type of GLM (using f-tests to compare linear models fit with only categorical dependent variable), but it is a little unclear why that needs to matter. Of course, understanding the statistical tests you are using, and it’s pitfalls, is very important, but as said before this is largely a documentation issue.

When it comes to mixed models, yes, the generalized linear mixed model is a complicated topic, but specifically in the context of ANOVA it is actually well-described and tractable. In base R, for instance, a mixed effects ANOVA is created by taking an lm object and running anova(mylmobject, type = "III"), no need at all to call up lme4 or use REML solvers.

Noone is asking anyone in particular to implement this, but it is good to have this discussion, and it is useful for people who want to contribute to building up the language to know that there is a need for this. Development can usefully start in bringing together the authors of the 4 existing ANOVA.jl packages and having a conversation about how best to alleviate the perceived lack of functionality.

9 Likes

I don’t think that anyone was arguing for not having ANOVA. And of course that would be rather pointless, because people are free to code what they want.

I think part of the problem is that ANOVA is used as an umbrella term, with specific fields having preferred/canonical models and doing ANOVA for those. Simple one- and two-way designs are not complicated to implement, but partitioning, incomplete block designs etc can quickly get complicated.

A quick Github search identifies two actively maintained packages,

which is a standalone design, and

which uses linear models.