I’m currently developing a small package that generates ANOVA tables from linear models. In R you have anova() from the stats package which basically performs an type I ANOVA and then there is drop1() which basically (using the right contrast settings) gives you a type III ANOVA. (There is also Anova() from the car package which gives you type II and (different) type III ANOVAS - in short: It’s quite a mess.)
So far I have a function anova(model[, anovatype]) which I think makes the most sense. So the questions here (actually it’s a question regarding package developement in general) is if we should abandon familiar (meaning: R, MATLAB) function structure in favor of a more sensible approach or if we should make it as easy as possible to switch.
I think making it sensible makes the most sense. Take the good ideas from other APIs but don’t just mimic them for ease of porting. Julia is different enough that skin deep similarity buys you less than it might seem.
Julia has an interface for getting the coefficients and printing the results for regressions via StatsModels. imo it would probably be best to have consistency within the julia ecosystem than trying to have consistency across ecosystems. especially because R is so inconsistent in its regression outputs.
I would love to be as consistent with the Julia ecosystem as possible so could you please elaborate how one could use StatsModels in this case? Would something like anova(model::DataFrameRegressionModel) be appropriate?
Interesting. The result would something like fit(AnovaModel, ...) that would internally use one or multiple calls to fit(LinearModel, ...) I guess? And then of course, we would need something like MixedAnovaModel as well.
Definitely go with the style that is more idiomatic to Julia. Specifically, making use of multiple dispatch should be encouraged, and can lead to very nice, fast, and modular interfaces.
In theory, R also has multiple dispatch, but the S3/S4 setup is so complicated that only seasoned developers use it. So R APIs are often a result of various compromises between speed, complexity (for the developer) and tradition.
The cost of switching should be mitigated by good documentation (docstrings, examples, ideally generated documentation with eg Documenter.jl).
Please note that the StatsModels package already has a dropterm function.
Also, I would recommend reading “Exegeses on Linear Models” by Bill Venables (https://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf) before implementing any anova functions. As he mentions, all the different “Types” are artificial constructs and holdovers from days when fitting a single linear model was exceedingly difficult and time consuming. Anyone who is not using punched cards for data input should ignore them.
I think it’s best to have well designed “Julian” APIs, and then have convenience packages that provide R, MATLAB, or whatever interfaces, to make it easy for people to more easily move code from those languages.
Thanks @dmbates . I will read the paper as soon as I have some spare time.
But in the end, any ANOVA function would be purely for convenience. If you try to convince a psychologist to switch to (or even consdier using) Julia, he or she will ask “Where is the ANOVA function? Why does it give different results than R / SPSS / STATA?”. In fact, published papers are full of (at least) questionable statistically methods and many scientists don’t really understand them, nor do they care.
And although any criticism is probably justified (especially given your background), I believe that such functions are a necessity. Otherwise people will simply continue to use the R packages that are already in place and widely used.
@pdeffebach that was the reason for my question. GLM.jl works an awful lot like the corresponding R functions, if not the same.
Yes, I agree that a one-stop shop for ANOVA (and probit, logit, etc.) is definitely needed, and thanks to multiple dispatch it doesn’t need its own interface. You might want to also loop in @ Nosferican about Econometrics.jl for how you could work with their package.
Thanks. As I said, any ANOVA function would merely be for convenience. As far as I know, there is no function I can throw my data in and get a nice ANOVA table as output. R on the other hand has a ton of them.
You can fit an AnovaDataFrameRegressionModel the same as a DataFrameRegressionModel with the addition of passing an AnovaOptions() object (see the README). It’s far from beautiful but it works quite well.
Hey @dmbates, just to make sure I understand your point: You’re arguing that introducing a type III ANOVA doesn’t really make sense because if there are indeed any significant interactions, then in most cases you can’t meaningfully interpret main effects. And if there is no significant interaction between factors, then your better off using what is called a type II ANOVA, thus comparing not against the saturated model (including main effects and interactions) but against a model containing only main effects. Is this correct?
There is already GLM.ftest which functions similarly to the anova function in base R. You fit two or more nested models and feed them into ftest and you get an analysis of variance table.
It was a conscious decision on the part of the R Core Development Team not to create a function that allowed for “Types” of sums of squares. John Fox or any others can create such functions but I cannot imagine that they would ever be part of base R.
This is Bill Venables’ point. In a modern computing environment there is no need for this nonsense about Type I and Type III and Type II and even Type IV, whatever that is. Fit the alternative model; fit the null model; compare them - and you’re done.
I realize that in some disciplines it is considered important to speak of model comparisons using all this arcane and nonsensical terminology. But it doesn’t make it meaningful. I have been a statistician for 45 years and am considered reasonably knowledgeable about linear models. If pressed I think I could describe what Type I and Type III mean. I have had Type II explained to me (and it seemed very suspect) but I don’t think I have understood what Type IV means and I have never met anyone who could tell me what it means.
For Julia I hope that the JuliaStats packages never incorporate an anova function that allows for different “Types”. If the user fits both the alternative and the null model then compares them one can hope the the user understands what the two models represent. The problem with the anova “Types” mumbo-jumbo is that the user thinks they know what it means but unless you can describe the models being compared I don’t see how the result is useful.