This is an open invitation for the community. LinRegOutliers is a Julia package for detecting outliers in linear regression. In its current state, the package contains some pioneering algorithms that have taken place in the literature. The scope of the package is direct and indirect (robust) outlier detection methods in linear regression and multivariate location and scale estimators.
Implementations of the some missing methods assigned to our newly joined friends. There are more!
If you want to have contributions, you are welcome.
This looks very promising. Just one question: the examples all use @formula(y ~ x1 + x2 + x3) etc. Is this required, or would (y,x) work? (where y is a vector and x a matrix)
I made a PR to GLM.jl with Cook’s Distance that I never felt the need to finish (shame on me), perhaps this is a better place for it to live if you don’t have it already.
Not that this isn’t a good feature, but you know that you can generate formulas programmatically, right? if you have a data frame df, you could actually do
Agreed, this works. It would still be nice if there were methods for (y,x). It’s as simple as it gets and often as good as the more complicated alternatives.
You are right, since the models are linear, the response vector and a design matrix are required and a tuple of (y, X) is enough. However, construction of design matrices takes time as they are not always simple, because of dummy variables, handling variable slopes and intercepts, interactions etc. The FormulaTerm or @formula works well with lm() in GLM. After all, this is the standard way of describing a regression model for both Julia GLM and R users.
Thank to multiple dispatch feature of Julia, we may have many methods for an algorithm.
@Paul_Soderlind opened a new issue for this, I offered to have second definitions of all methods that accepts data in the form of (y, X) tuple. He said he could help whenever he had time.
BTW, most of what you import from GLM is actually re-exports of stuff from StatsModels.jl (everything except lm). And for version of StatsModels 0.6 or greater, the ModelFrame/ModelMatrix interface is discouraged in favor of doing something like
I don’t think it’s a huge deal to use that interface (and it might simplify some of the code actually), by storing concrete_f in the regression setting struct. You can get just the response or design matrix by calling modelcols(f.lhs, data) or modelcols(f.rhs, data) (respectively).
Yeah if I have the time I’ll put in a PR In the mean time, the docs for StatsModels are (I hope) comprehensive enough to get started… (here’s the deep dive)
Okay, to be honest, I’m too focused on writing new methods in the package. I care the literature on this subject to be production ready and to be complete in Julia. My friends and people like you would help on the performance improvements, code optimizations, aesthetically improving, etc.
This package requires the model at hand is in form of y=X\beta + \epsilon and the estimation is based on least squares or maximum likelihood estimation. Implemented methods take the model as a @formula object or X and y where X is the design matrix and y is the response vector. As I know and correct me if I am wrong, the main estimation process in mixed models is generally handled by EM algorithm. I have never researched the literature on this topic, but there may be other outlier detection algorithms for such models.
“Outlier” is not a precisely defined term, and deterministic binary classification of whether something is an “outlier” is not a particularly robust method for inference. The advantage is, of course, that it is fast.
It is usually better to assume a mixture distribution (eg adding something with a fat tail), or start with something like a Student-t in the first place for the errors. This, of course, is better suited to Bayesian inference, or EM.
I have to clarify that I am definitely not an expert in statistics, so I might be missing some basic understanding of the issue. Anyway, MixedModels package use maximum likelihood to fit the formula. Does this mean I can use your approach to detect outliers then?