This is an open invitation for the community. LinRegOutliers is a Julia package for detecting outliers in linear regression. In its current state, the package contains some pioneering algorithms that have taken place in the literature. The scope of the package is direct and indirect (robust) outlier detection methods in linear regression and multivariate location and scale estimators.
Implementations of the some missing methods assigned to our newly joined friends. There are more!
If you want to have contributions, you are welcome.
This looks very promising. Just one question: the examples all use @formula(y ~ x1 + x2 + x3) etc. Is this required, or would (y,x) work? (where y is a vector and x a matrix)
I made a PR to GLM.jl with Cookās Distance that I never felt the need to finish (shame on me), perhaps this is a better place for it to live if you donāt have it already.
Not that this isnāt a good feature, but you know that you can generate formulas programmatically, right? if you have a data frame df, you could actually do
Agreed, this works. It would still be nice if there were methods for (y,x). Itās as simple as it gets and often as good as the more complicated alternatives.
You are right, since the models are linear, the response vector and a design matrix are required and a tuple of (y, X) is enough. However, construction of design matrices takes time as they are not always simple, because of dummy variables, handling variable slopes and intercepts, interactions etc. The FormulaTerm or @formula works well with lm() in GLM. After all, this is the standard way of describing a regression model for both Julia GLM and R users.
Thank to multiple dispatch feature of Julia, we may have many methods for an algorithm.
@Paul_Soderlind opened a new issue for this, I offered to have second definitions of all methods that accepts data in the form of (y, X) tuple. He said he could help whenever he had time.
BTW, most of what you import from GLM is actually re-exports of stuff from StatsModels.jl (everything except lm). And for version of StatsModels 0.6 or greater, the ModelFrame/ModelMatrix interface is discouraged in favor of doing something like
I donāt think itās a huge deal to use that interface (and it might simplify some of the code actually), by storing concrete_f in the regression setting struct. You can get just the response or design matrix by calling modelcols(f.lhs, data) or modelcols(f.rhs, data) (respectively).
Yeah if I have the time Iāll put in a PR In the mean time, the docs for StatsModels are (I hope) comprehensive enough to get started⦠(hereās the deep dive)
Okay, to be honest, Iām too focused on writing new methods in the package. I care the literature on this subject to be production ready and to be complete in Julia. My friends and people like you would help on the performance improvements, code optimizations, aesthetically improving, etc.
Satman et al., (2021). LinRegOutliers: A Julia package for detecting outliers in linear regression. Journal of Open Source Software, 6(57), 2892, https://doi.org/10.21105/joss.02892
This package requires the model at hand is in form of y=X\beta + \epsilon and the estimation is based on least squares or maximum likelihood estimation. Implemented methods take the model as a @formula object or X and y where X is the design matrix and y is the response vector. As I know and correct me if I am wrong, the main estimation process in mixed models is generally handled by EM algorithm. I have never researched the literature on this topic, but there may be other outlier detection algorithms for such models.
āOutlierā is not a precisely defined term, and deterministic binary classification of whether something is an āoutlierā is not a particularly robust method for inference. The advantage is, of course, that it is fast.
It is usually better to assume a mixture distribution (eg adding something with a fat tail), or start with something like a Student-t in the first place for the errors. This, of course, is better suited to Bayesian inference, or EM.
I have to clarify that I am definitely not an expert in statistics, so I might be missing some basic understanding of the issue. Anyway, MixedModels package use maximum likelihood to fit the formula. Does this mean I can use your approach to detect outliers then?