ANOVA Tests in Julia?

Juan · November 18, 2018, 1:19am

An Anova analysis gives you the same coefficients but also the F-ratio and posthoc tests.

BioTurboNick · December 7, 2018, 9:39pm

Just came across this topic because I was looking for ANOVA in Julia. I see the note about the ftest function in GLM.jl, but the docstring doesn’t show how to implement it and the GLM online docs don’t either.

I should put a finer point on this thread: ANOVA is one of the basic statistical tests taught to and used by biologists. I think its absence from Julia is going to be a problem for general adoption.

Taylor_Maxwell · December 7, 2018, 10:18pm

While examples are not in the docs you can look at ftest.jl in GLM.jl/ftest.jl at master · JuliaStats/GLM.jl · GitHub which has a large commented section that gives explicit examples and results that could probably be put in the docs. There also examples in runtests.jl found in GLM.jl/runtests.jl at master · JuliaStats/GLM.jl · GitHub.

In general, it is used to compare nested models fit separately. ftest(fullmodel,nullmodel), it can also do more than two models at a time. This topic has come up many different times and the discussion general runs between “we need to do this because everyone expects it” and “it is misused and abused so we would rather individuals do explicit model comparisons”. I am not taking a side one way or another. The ftest function explicitly compares nested models. It can be used in a way to give you information for a typical ANOVA table but you will need to explicitly fit the different models to do it.

mkborregaard · December 7, 2018, 10:22pm

I don’t think anyone doubts that there needs to be basic ANOVA functionality. But what is the problem with ANOVA.jl?

alejandromerchan · December 7, 2018, 10:31pm

I just realized there are four different repos names ANOVA.jl, but I assume you mean this one.

BioTurboNick · December 8, 2018, 3:21am

@Taylor_Maxwell I appreciate that. And @mkborregaard I’m looking into ANOVA.jl now.

To both your points, I don’t know what models are supposed to be used, or why. It wasn’t taught in my biostats class or can be found in my otherwise detailed textbook (Zar 4ed) which doesn’t discuss GLM. I don’t know how to map the concepts to each other and I haven’t found good resources to improve my understanding.

For ANOVA.jl specifically, apart from having to define and fit a model, I don’t understand what the “contrasts” are or why it is important to use this EffectsCoding object. Or how to specify fixed and random effects factors, interaction terms, etc.

A suggestion: a final ANOVA package should not require the user to formulate and fit a linear model but should provide output that shows the user how to construct the underlying GLM model used. That would provide a bridge between them.

Taylor_Maxwell · December 8, 2018, 4:18am

The best thing would be to look at the example in the comments in https://github.com/JuliaStats/GLM.jl/blob/master/src/ftest.j.

If you have a model: y~1+x1+x2 (call this the full model) and you are interested in “x2” they you would compare the full model to a nested model that excludes “x2” (y~1+x1) which is often called the reduced model. While it looks like you are testing “x2” you are really comparing the two models where the difference between them happens to be the absence of “x2” in the reduced model. A traditional ANOVA table that presents statistics for each variable is really presenting statistics for multiple different full vs reduced model comparisons. The ftest function makes it explicitly clear what specific models you are comparing. I hope this helps.

I have not used ANOVA.jl so I can’t comment on their contrasts and effect coding.

Taylor_Maxwell · December 8, 2018, 4:34am

I would also suggest reading this comment (Poll: Do we Julians want ANOVAs? - #7 by dmbates) by dmbates (the original author of GLM, MixedModels in julia and the primary mixed effects models package lme4 in R) in an older discussion on ANOVA and the paper he links to.

BioTurboNick · December 8, 2018, 9:13pm

I appreciate you trying, but starting an explanation that assumes knowledge of linear models is only marginally helpful.

You’re talking to people who know nothing about linear models and may only know ANOVAs as computations on sums of squares and within and across groups, if they even know that much.

There needs to be a translation between the two, if this is the approach taken.

Curiously, one of the papers in the older thread even mentions cases where linear regression doesn’t provide a correct answer for ANOVA (section 2.2): http://www.stat.columbia.edu/~gelman/research/published/AOS259.pdf

Taylor_Maxwell · December 8, 2018, 10:54pm

Sorry, I wasn’t trying to convince you of anything, I was just trying to answer your question. The thread I linked to had a discussion of why the person who wrote the ftest function decided to do it that way. The comment by dmbates and the paper give some of the rational for it (i.e. why and how ANOVA tables can be misinterpreted). It was not an argument for or against it.

Every basic stats class I have seen introduces regression & ANOVA and unites them in the same general linear model framework and many go further and introduce generalized linear models. The concept of comparing models is not a hard argument to follow. If someone intends to use a linear model it would be useful to know something about it, especially when it is not uncommon for them to be misinterpreted.

I fully expect that some or many packages and/or functions will be written to give someone the traditional ANOVA table because enough people have asked for it.

Tamas_Papp · December 9, 2018, 7:08am

I am not sure I understand what you are asking for here. Do you want people who understand ANOVA to code and maintain a package for people who don’t, with the understanding that it yields conceptually incorrect results some of the time? And you want them to do it for free, with a free software license?

abulak · December 9, 2018, 11:14am

@Tamas_Papp, I think You completely missed the point of @BioTurboNick. While it’s not hard for some of us to transition between the two concepts of GLM and ANOVA (because we’re e.g. trained to do so), most users without stats/math/cs background will find it hard. And they even don’t want to learn that, since for them it may be a ~half year effort which is a time they’d rather spend doing something else.

Let me re-phrase for You what the discussion here is about:

I’d like to see a tool that…
You can do that using this method… Try…
This looks too advanced for me.
Have a look here for the rationale of the choices that were made.

We chose julia to assembly, even though with assembly you can the same (and sometimes with even faster run-time). It’s just a matter of convenience for us (to get things done, without spending half a year learning a tool). Yet You don’t see people writing compiler stuff saying “do You want us, who understand assembly to code and maintain a compiler for those of You who don’t (blahblahblah)…”

Sorry, but this comment of Yours a) sounds mildly aggressive and b) brings nothing to the discussion of Anova vs GLM. Unless of course You were willing to produce a simple wrapper of GLM for anova functionality and were asking @BioTurboNick for more details :P.

Tamas_Papp · December 9, 2018, 12:42pm

I don’t think I did — I think I am asking a very relevant question.

Perhaps because people who are working on the compiler are also users of Julia, so the two roles don’t separate. If they didn’t use Julia otherwise, I doubt many of them would work on the compiler (unpaid).

This is not the first thread expressing a demand for ANOVA. What is important to realize that a package will not be written and maintained unless someone who needs the functionality is willing to work on it. IMO the best approach is to ask for help: other people may not do the work for you, but you can get a lot of support with the right attitude.

Which brings me back to the original question: I think that @Taylor_Maxwell linked a very relevant discussion, and I was surprised that @BioTurboNick deemed it “marginally helpful”. A clarification of what he is looking for could be useful then.

BioTurboNick · December 9, 2018, 3:29pm

@abulak pretty much got it.

@Taylor_Maxwell, every basic stats class you’ve seen? Maybe, but my biostatistics class in 2005-6 did not. I doubt I was unique in having single semester course on statistics that did not introduce GLM. And I apologize, I didn’t think you were trying to convince me of anything, I was just trying to explain how, if you do want to answer how to do ANOVA with linear models, it needs to include an explanation of how the concepts of the ANOVA map onto those models.

If I ask you how to specify a model for ANOVA with one fixed factor and one random factor, what would you say? What about two fixed factors with interaction?

@Tamas_Papp - I’m not demanding it, I’m just emphasizing to you/the world that it is necessary, and why. You’re free to choose otherwise.

But people who know building things for people who don’t (or are learning) is how it always is, isn’t it? Encouraging the latter to try the former is a good thing, but only a subset can be expected to do so. I might even be willing to contribute, later.

And perhaps I’m misunderstanding (which is likely), but it sounded like GLM can also produce bad results for ANOVA for some nested designs?

In any case, the immediate problem seems completely solvable with documentation. If you believe that GLM-ANOVA is better than or equal to traditional ANOVA, and the current solution is to have users specify and fit linear models as input, then there just needs to be a clear explanation of what the models should look like using ANOVA terminology.

Taylor_Maxwell · December 9, 2018, 4:05pm

I think some of this is a misunderstanding of terminology. The uniting of ANOVA and regression is called the general linear model and it is limited to response/dependent variables that are continuous. The “generalized” linear model which we now call GLM generalized the general linear model to include dependent/response variables that can have a specific set of different distributions (think logistic, poisson, etc) using a canonical link function. In this discussion we are not really about generalized linear models (GLM) but the more basic general linear model. I went and looked at two or three texts used for a basic statistics class and the uniting of ANOVA and regression into the general linear model (not the related generalized linear model) was the primary culmination for each one of them for use in a basic stats course.

That being said, if you are trying to connect this with fixed AND random effects (mixed models) and nesting schemes then you are not talking about traditional ANOVA and functionality for a traditional ANOVA table will not be relevant for your mixed effect models and making most of this discussion fairly mute. That is a different topic altogether. The MixedModels package is what you should be looking at and not GLM and definitely not traditional ANOVA tables.

Back to traditional ANOVA tables, I included the link to the past discussion because if I remember right, LewisHein (who wrote ftest) had originally wanted/intended to put together code in GLM to implement traditional ANOVA functionality. After that discussion and other discussions on github he changed his mind and decided to go the ftest route.

BioTurboNick · December 9, 2018, 7:01pm

There’s definitely some disconnect here.

My book and notes show the only difference between the 3 types of ANOVA (I, II, III) are how you calculate the F statistic.

Model I (all fixed): F = factor MS / error MS
Model II (all random): F = factor MS / interaction MS
Model III (mixed): Each factor’s F statistic uses Model I or Model II equation depending on type.

Are you saying this Model III ANOVA is invalid, then?

BioTurboNick · December 9, 2018, 9:41pm

FWIW, I spoke with a biostats professor at my undergrad institution. She pointed me to a resource to help me here, but also said it was rare for an undergrad 1-semester course to get to GLM, even if it’s in the textbook.

Taylor_Maxwell · December 9, 2018, 10:01pm

When you talk to someone about GLM they usually are talking about generalized linear models so you are right, normally generalized linear models are not talked about in a first basic course.

As for mixed effect models (definitely not discussed in a basic stats class) when people are asking for an ANOVA table they are almost exclusively asking for ANOVA for a fixed effects model (and often know nothing about mixed effects). Yes you can do mixed effect models in a ANOVA like framework but it is VERY different than what most people mean when they are talking about ANOVA or asking for an ANOVA table.

The GLM package in Julia does not even address random and/or mixed effect models in the first place. Whether you are working in Julia or R you would would be using entirely different packages (lme4 or other in R or MixedModels in Julia) to fit models with mixed effects. In fact, dmbates (from the previously linked discussion) is the primary author of both of those packages.

By inserting mixed effect models into this discussion about functionality for ANOVA tables in the GLM package is a complete and unnecessary tangent.

BioTurboNick · December 10, 2018, 4:36am

Could you please answer my direct question?

There’s definitely some disconnect here.

My book and notes show the only difference between the 3 types of ANOVA (I, II, III) are how you calculate the F statistic.

Model I (all fixed): F = factor MS / error MS
Model II (all random): F = factor MS / interaction MS
Model III (mixed): Each factor’s F statistic uses Model I or Model II equation depending on type.

Are you saying this Model III ANOVA is invalid, then?

Unless the text is wrong, there is no difference in the SS or MS calculations, only the F statistic. Why would they be handled by different packages?

Taylor_Maxwell · December 10, 2018, 5:11am

Fixed effect models are not trivially different from fixed effect models. Part of the complexity of mixed models is expressing/defining the nesting structures, making the design matrices and the computational burden for fitting the models (usually with some form of likelihood). I am by no means an expert with linear mixed-effect models and would not do justice to the subject.

I really am not trying to argue with you. I am saying that mixed models are not trivial and are considered more advanced and complicated than fixed effects models. As such, they have always required special packages that are more complicated and require more expertise to do right than an implementation of standard fixed effect linear models. To give you a taste, look up how to fit and interpret mixed effect models with lme4 in R or MixedModels in Julia and compare that to any implementation of standard fixed effects models. While mixed effects with only categorical independent (the x’s) variables is simpler those that include continuous independent/predictor (the x’s) variables, most discussions usually want to have a solution that includes both continuous and categorical independent/predictor variables. Please, just look at these packages first (these are the premier mixed models packages in R and Julia) to get an idea what they are like.

Topic		Replies	Views
Poll: Do we Julians want ANOVAs? Statistics	14	3716	December 18, 2018
Didn't get consistent result of ANOVA from R and Julia General Usage question , package	2	569	July 4, 2019
Should we mimic R (, MATLAB, ...) functions? General Usage	20	2662	June 28, 2018
ANOVA and TukeyHSD Statistics	4	911	November 4, 2022
Repeated measures ANOVA Statistics	11	1166	May 8, 2021

ANOVA Tests in Julia?

Related topics