[ANN] LinRegOutliers: a Julia package for detecting outliers in linear regression

Hi there,

This is an open invitation for the community. LinRegOutliers is a Julia package for detecting outliers in linear regression. In its current state, the package contains some pioneering algorithms that have taken place in the literature. The scope of the package is direct and indirect (robust) outlier detection methods in linear regression and multivariate location and scale estimators.

Implementations of the some missing methods assigned to our newly joined friends. There are more!

If you want to have contributions, you are welcome.

Please visit the repo at LinRegOutliers GitHub repository

15 Likes

This looks very promising. Just one question: the examples all use @formula(y ~ x1 + x2 + x3) etc. Is this required, or would (y,x) work? (where y is a vector and x a matrix)

5 Likes

I made a PR to GLM.jl with Cook’s Distance that I never felt the need to finish (shame on me), perhaps this is a better place for it to live if you don’t have it already.

https://github.com/JuliaStats/GLM.jl/pull/368

5 Likes

Not that this isn’t a good feature, but you know that you can generate formulas programmatically, right? if you have a data frame df, you could actually do

term("y") ~ sum(term.(names(df, Not("y")))

to get many covariates at the same time.

2 Likes

Agreed, this works. It would still be nice if there were methods for (y,x). It’s as simple as it gets and often as good as the more complicated alternatives.

6 Likes

We have cooks() in src/diagnostics.jl as a tool in an outlier detection algorithm.

1 Like

You are right, since the models are linear, the response vector and a design matrix are required and a tuple of (y, X) is enough. However, construction of design matrices takes time as they are not always simple, because of dummy variables, handling variable slopes and intercepts, interactions etc. The FormulaTerm or @formula works well with lm() in GLM. After all, this is the standard way of describing a regression model for both Julia GLM and R users.

This is true, but occasionally an y, X interface can be useful too.

If it already exists, perhaps making it part of the API would be nice.

3 Likes

Hi @Tamas_Papp

Thank to multiple dispatch feature of Julia, we may have many methods for an algorithm.

@Paul_Soderlind opened a new issue for this, I offered to have second definitions of all methods that accepts data in the form of (y, X) tuple. He said he could help whenever he had time.

Thank you :slight_smile:

1 Like

BTW, most of what you import from GLM is actually re-exports of stuff from StatsModels.jl (everything except lm). And for version of StatsModels 0.6 or greater, the ModelFrame/ModelMatrix interface is discouraged in favor of doing something like

concrete_f = apply_schema(f, schema(f, data))
y,X = modelcols(concrete_f, data)

I don’t think it’s a huge deal to use that interface (and it might simplify some of the code actually), by storing concrete_f in the regression setting struct. You can get just the response or design matrix by calling modelcols(f.lhs, data) or modelcols(f.rhs, data) (respectively).

2 Likes

Okay, I note this, thank you. If you want to join, you are welcome.

3 Likes

Yeah if I have the time I’ll put in a PR :slight_smile: In the mean time, the docs for StatsModels are (I hope) comprehensive enough to get started… (here’s the deep dive)

2 Likes

Okay, to be honest, I’m too focused on writing new methods in the package. I care the literature on this subject to be production ready and to be complete in Julia. My friends and people like you would help on the performance improvements, code optimizations, aesthetically improving, etc.

2 Likes

We have now an issue about (y, X) in issue(14). Of course, the needs of the community must be taken into account.

1 Like

Hi folks!

(X, y) style multiple dispatch is implemented except a single method and the next revision will include

algorithm(X::Matrix, y::Vector)

beyond the original version

algorithm(setting::RegressionSetting)

for all. Fyi.

Best!

2 Likes

Hey guys,

I am happy to announce that the research paper of our Julia package LinRegOutliers has just been published in Journal of Open Source Software.

Here is the citation info and link:

https://joss.theoj.org/papers/10.21105/joss.02892

Satman et al., (2021). LinRegOutliers: A Julia package for detecting outliers in linear regression. Journal of Open Source Software, 6(57), 2892, Journal of Open Source Software: LinRegOutliers: A Julia package for detecting outliers in linear regression

Do not forget to cite us! :smiling_face_with_three_hearts:

10 Likes

I generally use linear mixed models from the MIxedModels.jl package. Can these methods be used for those cases as well?

1 Like

This package requires the model at hand is in form of y=X\beta + \epsilon and the estimation is based on least squares or maximum likelihood estimation. Implemented methods take the model as a @formula object or X and y where X is the design matrix and y is the response vector. As I know and correct me if I am wrong, the main estimation process in mixed models is generally handled by EM algorithm. I have never researched the literature on this topic, but there may be other outlier detection algorithms for such models.

3 Likes

“Outlier” is not a precisely defined term, and deterministic binary classification of whether something is an “outlier” is not a particularly robust method for inference. The advantage is, of course, that it is fast.

It is usually better to assume a mixture distribution (eg adding something with a fat tail), or start with something like a Student-t in the first place for the errors. This, of course, is better suited to Bayesian inference, or EM.

I have to clarify that I am definitely not an expert in statistics, so I might be missing some basic understanding of the issue. Anyway, MixedModels package use maximum likelihood to fit the formula. Does this mean I can use your approach to detect outliers then?