[ANN] LinRegOutliers: a Julia package for detecting outliers in linear regression

jbytecode · September 25, 2020, 6:10pm

Hi there,

This is an open invitation for the community. LinRegOutliers is a Julia package for detecting outliers in linear regression. In its current state, the package contains some pioneering algorithms that have taken place in the literature. The scope of the package is direct and indirect (robust) outlier detection methods in linear regression and multivariate location and scale estimators.

Implementations of the some missing methods assigned to our newly joined friends. There are more!

If you want to have contributions, you are welcome.

Please visit the repo at LinRegOutliers GitHub repository

Paul_Soderlind · September 25, 2020, 8:31pm

This looks very promising. Just one question: the examples all use @formula(y ~ x1 + x2 + x3) etc. Is this required, or would (y,x) work? (where y is a vector and x a matrix)

tbeason · September 25, 2020, 8:37pm

I made a PR to GLM.jl with Cook’s Distance that I never felt the need to finish (shame on me), perhaps this is a better place for it to live if you don’t have it already.

https://github.com/JuliaStats/GLM.jl/pull/368

pdeffebach · September 25, 2020, 8:40pm

Not that this isn’t a good feature, but you know that you can generate formulas programmatically, right? if you have a data frame df, you could actually do

term("y") ~ sum(term.(names(df, Not("y")))

to get many covariates at the same time.

Paul_Soderlind · September 25, 2020, 8:52pm

Agreed, this works. It would still be nice if there were methods for (y,x). It’s as simple as it gets and often as good as the more complicated alternatives.

jbytecode · September 26, 2020, 6:49am

We have cooks() in src/diagnostics.jl as a tool in an outlier detection algorithm.

jbytecode · September 26, 2020, 6:53am

You are right, since the models are linear, the response vector and a design matrix are required and a tuple of (y, X) is enough. However, construction of design matrices takes time as they are not always simple, because of dummy variables, handling variable slopes and intercepts, interactions etc. The FormulaTerm or @formula works well with lm() in GLM. After all, this is the standard way of describing a regression model for both Julia GLM and R users.

Tamas_Papp · September 26, 2020, 11:28am

This is true, but occasionally an y, X interface can be useful too.

If it already exists, perhaps making it part of the API would be nice.

jbytecode · September 27, 2020, 6:00pm

Hi @Tamas_Papp

Thank to multiple dispatch feature of Julia, we may have many methods for an algorithm.

@Paul_Soderlind opened a new issue for this, I offered to have second definitions of all methods that accepts data in the form of (y, X) tuple. He said he could help whenever he had time.

Thank you

dave.f.kleinschmidt · September 27, 2020, 7:18pm

BTW, most of what you import from GLM is actually re-exports of stuff from StatsModels.jl (everything except lm). And for version of StatsModels 0.6 or greater, the ModelFrame/ModelMatrix interface is discouraged in favor of doing something like

concrete_f = apply_schema(f, schema(f, data))
y,X = modelcols(concrete_f, data)

I don’t think it’s a huge deal to use that interface (and it might simplify some of the code actually), by storing concrete_f in the regression setting struct. You can get just the response or design matrix by calling modelcols(f.lhs, data) or modelcols(f.rhs, data) (respectively).

jbytecode · September 27, 2020, 7:28pm

Okay, I note this, thank you. If you want to join, you are welcome.

dave.f.kleinschmidt · September 27, 2020, 7:37pm

Yeah if I have the time I’ll put in a PR In the mean time, the docs for StatsModels are (I hope) comprehensive enough to get started… (here’s the deep dive)

jbytecode · September 27, 2020, 7:49pm

Okay, to be honest, I’m too focused on writing new methods in the package. I care the literature on this subject to be production ready and to be complete in Julia. My friends and people like you would help on the performance improvements, code optimizations, aesthetically improving, etc.

jbytecode · September 28, 2020, 6:46pm

We have now an issue about (y, X) in issue(14). Of course, the needs of the community must be taken into account.

jbytecode · October 7, 2020, 3:12pm

Hi folks!

(X, y) style multiple dispatch is implemented except a single method and the next revision will include

algorithm(X::Matrix, y::Vector)

beyond the original version

algorithm(setting::RegressionSetting)

for all. Fyi.

Best!

jbytecode · January 5, 2021, 6:32pm

Hey guys,

I am happy to announce that the research paper of our Julia package LinRegOutliers has just been published in Journal of Open Source Software.

Here is the citation info and link:

https://joss.theoj.org/papers/10.21105/joss.02892

Satman et al., (2021). LinRegOutliers: A Julia package for detecting outliers in linear regression. Journal of Open Source Software, 6(57), 2892, Journal of Open Source Software: LinRegOutliers: A Julia package for detecting outliers in linear regression

Do not forget to cite us!

Dario_Sarra · March 16, 2021, 3:53pm

I generally use linear mixed models from the MIxedModels.jl package. Can these methods be used for those cases as well?

jbytecode · March 17, 2021, 5:25am

This package requires the model at hand is in form of y=X\beta + \epsilon and the estimation is based on least squares or maximum likelihood estimation. Implemented methods take the model as a @formula object or X and y where X is the design matrix and y is the response vector. As I know and correct me if I am wrong, the main estimation process in mixed models is generally handled by EM algorithm. I have never researched the literature on this topic, but there may be other outlier detection algorithms for such models.

Tamas_Papp · March 17, 2021, 8:33am

“Outlier” is not a precisely defined term, and deterministic binary classification of whether something is an “outlier” is not a particularly robust method for inference. The advantage is, of course, that it is fast.

It is usually better to assume a mixture distribution (eg adding something with a fat tail), or start with something like a Student-t in the first place for the errors. This, of course, is better suited to Bayesian inference, or EM.

Dario_Sarra · March 17, 2021, 12:09pm

I have to clarify that I am definitely not an expert in statistics, so I might be missing some basic understanding of the issue. Anyway, MixedModels package use maximum likelihood to fit the formula. Does this mean I can use your approach to detect outliers then?

Topic		Replies	Views
[ANN] OutlierDetection.jl - Outlier / Anomaly Detection Ecosystem Package Announcements package	3	1315	March 5, 2022
[ANN] Linear Regression v0.7-alpha Package Announcements statistics , regression	18	2101	December 6, 2021
[ANN] RobustModels: Robust linear regression using M-Estimators and more Package Announcements statistics	7	1747	March 20, 2023
Multivariate Linear Regression in the Julia Machine Learning	1	752	November 23, 2018
[ANN] PartialLeastSquaresRegressor.jl Package Announcements package , announcement , statistics , machine-learning	17	2071	June 17, 2022

[ANN] LinRegOutliers: a Julia package for detecting outliers in linear regression

Related topics