ANOVA Tests in Julia?

I think this discussion is turning a little fruitless.

ANOVA tests are very widely used in sciences based on block experiments, like much of classical ecology. They are particularly nice because you can design the experiment to follow a specific ANOVA test and vice versa. For this reason, ANOVA is possibly the most widely performed statistical procedure in the world after the bivariate linear regression.

I know that there are statisticians that argue about potential pitfalls in ANOVA, and of course anyone should only implement statistical procedures in their own package that they are comfortable with. I don’t think it qualifies ss an argument for not having an ANOVA functionality in Julia at all.

Also, yes, they can be statistically demonstrated to be derivable from a specific type of GLM (using f-tests to compare linear models fit with only categorical dependent variable), but it is a little unclear why that needs to matter. Of course, understanding the statistical tests you are using, and it’s pitfalls, is very important, but as said before this is largely a documentation issue.

When it comes to mixed models, yes, the generalized linear mixed model is a complicated topic, but specifically in the context of ANOVA it is actually well-described and tractable. In base R, for instance, a mixed effects ANOVA is created by taking an lm object and running anova(mylmobject, type = "III"), no need at all to call up lme4 or use REML solvers.

Noone is asking anyone in particular to implement this, but it is good to have this discussion, and it is useful for people who want to contribute to building up the language to know that there is a need for this. Development can usefully start in bringing together the authors of the 4 existing ANOVA.jl packages and having a conversation about how best to alleviate the perceived lack of functionality.

9 Likes

I don’t think that anyone was arguing for not having ANOVA. And of course that would be rather pointless, because people are free to code what they want.

I think part of the problem is that ANOVA is used as an umbrella term, with specific fields having preferred/canonical models and doing ANOVA for those. Simple one- and two-way designs are not complicated to implement, but partitioning, incomplete block designs etc can quickly get complicated.

A quick Github search identifies two actively maintained packages,

which is a standalone design, and

which uses linear models.

I definitely am not and have not argued that someone shouldn’t create the functionality for simple fixed effect ANOVAs. All I initially did was try and answer a question that was posed about how to use the ftest function. When that did not seem to be enough, I pointed to specific examples found in the comments of the code and to give further background about the rationale that was used for it I linked to prior discussions that ultimately led LewisHein from initially wanting to do a standard ANOVA table (for fixed effect models) implementation in GLM to instead writing ftest. I wasn’t trying to argue that it should not be done but just trying to give answers about how to use it and why it was written. In further posts I was trying to respond to comments specifically directed at me. In hindsight, I should have stopped right there as it turned into a tangential and confused rabbit hole.

I’m sure I have less biostats experience than you, but I have a bit of Julia experience and I’ve gotten the f-test version of this working in the past (confirmed the results with R to be sure I did it right). Do you want to collaborate on a blog post or something explaining this to other folks in our situation? I agree that people are going to want a function that does the job (even if it might be misused), and it seems like other packages are working on that, but it might be nice to have a resource to point people to, written from the perspective of biologists w/o strong math/stats background.

There’s a Coursera intro to biostats course I got part way through, and their textbook (free if you want) discussed ANOVA in the context of linear models. I recall the book being really good in some places and not great in others, so YMMV.

2 Likes

To give an example of the opposite problem, I come from a pure math background and have an OK understanding of GLMs (and their variant with mixed effects) and what it means to compare two models of which one is a special case of the other and how one could use an ftest to do so. However as soon as my neuroscience / psychology colleagues start talking about one-way or two-way ANOVA within or between of type I, II or III, I get completely lost. I would actually like to see your blog post to be able to do the opposite translation (from the GLM jargon I understand to the ANOVA jargon that still confuses me).

7 Likes

Heh, I’ll keep that in mind. Happy to share a draft with you and you can let me know if there’s something that can be clarified from your perspective.

1 Like

My understanding is that the type I, II, III is historical and comes from SAS. I had to look it up once when learning R. I’d assume that’s why there’s been confusion about mixed models in this discussion because I think that may refer to something different with the SAS terminology.

I’d be willing to help with a blog post, once I’ve wrangled it myself. Thank you for the links.

@mkborregaard - Thank you for that successful intervention :slight_smile:

@piever - Yes, I think your description captures the issue nicely. As a biologist I’m thinking in terms of treatments, replicates, individuals, measurements and was taught there are tables and equations you just plug in to get the answer.

Hopefully the training can merge down the line so there isn’t this separation. Along with computer science, more theoretical stats exposure would be really useful.

@Zach_Christensen - Could be. My stats textbook is a dense reference that cites the original papers describing some of these analyses. Occasionally mentions a computational package and it might have been in SAS.

@Taylor_Maxwell - Apologies for the confusion, didn’t mean anything to get heated.

2 Likes

Hi, I’m also a biologist and it was really hard to get free from the ANOVA model. But I know that’s a hurdle to a lot of people, particularly in agricultural research, and my attempts to move people in my lab away from ANOVA failed. Anyway, I do think having an ANOVA framework is useful, so what you’re doing is important. And I think @Zach_Christensen is correct, that division was at least popularized by SAS. I went to school at NC State, the birthplace of SAS and got to use it for free in my stats classes, but I never used for my research, luckily my advisor let me work with R, even though she had no idea how to run my scripts. Please share your solution when you have it.

1 Like

@BioTurboNick, No apologies necessary, I realized that the thread was turning in a direction (mixed effect models) that I thought was very tangential from what specific type of ANOVA functionality (standard fixed effects) that I thought most people were really referring to and I was not sure exactly how I should answer some of the questions without making it an argument (which is the last thing I want). Obviously I am not the best at wording these things or coming across as the nicest person in the world so apologies from me if I came across poorly.

2 Likes

FWIW, I’ve put together an ANOVA package: https://github.com/BioTurboNick/SimpleANOVA.jl

Takes matrix-structured data, can handle many nested factors (I think) and up to 3 crossed factors, fixed or random. Textbook sums of squares method. 1 function to call.

Output is not polished. Code not stable right yet.

2 Likes

Great! how does it relate to the existing ANOVA packages?

2 Likes

marcpabst/ ANOVA.jl requires DataFrames, external setup and fitting of linear model (GLM.jl), and I think is only fixed factors, with “type” argument referring to variant ways to calculate the sums of squares. I don’t know if there’s any way to do nested factors with linear models. I’m assuming robust to unbalanced data due to reliance on regression?, updated 3 months ago

ZaneMuir/ ANOVA.jl uses vectors for data and factor level vectors, separate functions for different types of ANOVA, only 1 and 2 way available, only fixed factors I believe, no nested factors, repeated measures variants available, no support for unbalanced data, updated 2 months ago, available in Pkg

ibadr/ ANOVA.jl uses DataFrames and linear models with GLM.jl, last updated 2 years ago. marcpabst’s is essentially an improved version of this.

grero/ ANOVA.jl uses vectors to do 2-way ANOVA with factor level vectors, matrix to do repeated measures with a factor level vector, or vectors to do within-subjects repeated measures with factor level vectors in separate functions, last updated 2 years ago

JOfTheAncientGermanSpear/ SimpleAnova.jl has user package factors into groups, restricted to 1-way ANOVA, last updated 3 years ago

13 Likes

Awesome, thanks for this really useful overview!

Any efforts to consolidate the packages? There seems to be some features unique to some and all seem to have missing features.

I’m intending to implement repeated measures and block formats in my package and partial support for unbalanced designs, eventually. Also taking different inputs (DataFrames, vectors).

I don’t know how much use specifying different Sum of Squares types are, based on what I’ve read about them. (Type I makes no sense really, Type II (default) is stronger without interactions, and Type III is valid with interactions but it doesn’t help much to interpret the factor effects with interactions present.)

A further step would be to internally replace the implementation for all-fixed-factor (Model I) designs to GLM, using marcpabst’s code possibly. And then maybe figure out how to do mixed models for the other models?

It’d be nice for the final package to all be in one called ANOVA.jl in Pkg, which ZaneMuir’s has.

But this is a lot of work and I have other things to do right now :slight_smile:

2 Likes

Great review @BioTurboNick! As an author of one of those packages (grero/ANOVA.jl), I recall putting it together both as a training exercise for myself to understand repeated measures ANOVA, and because I needed it for a side project. It’d be happy to contribute whatever I can to a consolidated ANOVA package.

4 Likes

I have just looked into Julia and it seems very interesting.

I am one of those that have been involved in the discussion about sum of squares types.

These discussions have been about categorical explanatory variables. The functions in R (anova and car::Anova) can also take continuous variables as input. Then, the discussions about sum of squares are even more important because it is unlogic that Celsius and Fahrenheit give different results.

Mixing categorical and continuous variables are common when running designed experiments. Second order terms of continuous variables are used when doing response surface modelling. The theory for type II sums of squares extends naturally to polynomial terms. Then A is said to be contained in A^2 similarly to how A is contained in AB. In https://doi.org/10.1080/02664760701594246, we call this Type II* sums of squares.

A problem with the model formula in R and Julia (?) is that second order terms must be defined as new variables instead of having a possibility within the “formula language”.

In R you can, however, write something like “lm(y ~ A + I(A^2))”. I don’t know whether there exists a similar possibility in Julia. So in R it is possible to recognize A as being contained in I(A^2), but this is not the case in the function car::Anova. Celsius and Fahrenheit will give different results.

The function ffmanova::ffmanova can see that A is contained in I(A^2) and the problems are avoided. To see examples run the r code in this message: [R-pkgs] MANOVA for collinear responses with rotation testing (ffmanova) + Synthetic data (RegSDC)

It would have been very nice if Type II* sums of squares could be implemented in Julia.

The r package ffmanova is translated to R from Matlab. Maybe someday I, or someone else, can translate it to Julia

Porting car::Anova which is GPL to Julia would be great. Might be worthwhile to ask John Fox (jfox@mcmaster.ca) for extending the license to a permissive one to be able to work directly from the code without injecting copyleft.

As for keeping track of terms, take a look at https://github.com/JuliaStats/StatsModels.jl/pull/71.

Making use of car::Anova seems fine. To handle power of continuous variables (keeping track of terms), parts of this code need to be rewritten. The problem is the use of attr(terms(mod), “factors”). In ffmanova this matrix has be reformulated by the use of ffmanova::fixModelMatrix (see the documentation). (Unfortunately ModelMatrix has another meaning that model.matrix because the code was originally written in Matlab.)