[ANN] LinRegOutliers: a Julia package for detecting outliers in linear regression

Thanks for this package, which looks very useful. I’m interested in exploring possible outliers in a sample of a single random variable, rather than in a regression context. Would this be possible with the package? I have tried setting a formula where the dependent variable is regressed on a constant, and this seems to work in some cases, but not others (see example below). So, is this a reasonable idea for checking for outliers of a single random variable, when using the package? If so, of the methods the package provides, are there recommendations for this usage case? Thanks!

julia> using LinRegOutliers

julia> reg = createRegressionSetting(@formula(y ~ 1), hbk);

julia> hs93(reg)
Dict{Any, Any} with 3 entries:
  "outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  "t"        => -3.53755
  "d"        => [17.4311, 18.145, 18.502, 17.0741, 17.9665, 17.9665, 19.3944, 18.502, 17…

julia> smr98(reg)
ERROR: ArgumentError: Distance matrix should be symmetric.

When you set a formula using y ~ 1, a regression model of y = constant + epsilon is estimated and it is still a regression model. I think it is more convenient to use single variable tools in that situation. smr98 is based on a cluster analysis on standardized tuples (yhat, residuals), I think the problem is all of the yhat values are the same in your example (or something else that I can’t cover at first sight). Some of the other algorithms will work in the univariate case but I wouldn’t say this is a recommended method.

2 Likes

Thanks. I realized that I needed to read more about the methods, rather than just use them without checking suitability of a method for this usage case.

After 2 years since the first post, we got much attention to the package. As a re-announcement we are happy to revise our package to v0.8.16, with the latest implementation of the Quantile Regression estimator. Here is the latest list of the algorithms & estimators:

  • Ordinary Least Squares, Weighted Least Squares, Basic diagnostics
  • Hadi & Simonoff (1993)
  • Kianifard & Swallow (1989)
  • Sebert & Montgomery & Rollier (1998)
  • Least Median of Squares
  • Least Trimmed Squares
  • Minimum Volume Ellipsoid (MVE)
  • MVE & LTS Plot
  • Billor & Chatterjee & Hadi (2006)
  • Pena & Yohai (1995)
  • Satman (2013)
  • Satman (2015)
  • Setan & Halim & Mohd (2000)
  • Least Absolute Deviations (LAD)
  • Quantile Regression Parameter Estimation (quantileregression)
  • Least Trimmed Absolute Deviations (LTA)
  • Hadi (1992)
  • Marchette & Solka (2003) Data Images
  • Satman’s GA based LTS estimation (2012)
  • Fischler & Bolles (1981) RANSAC Algorithm
  • Minimum Covariance Determinant Estimator
  • Imon (2005) Algorithm
  • Barratt & Angeris & Boyd (2020) CCF algorithm
  • Atkinson (1994) Forward Search Algorithm
  • BACON Algorithm (Billor & Hadi & Velleman (2000))
  • Hadi (1994) Algorithm
  • Chatterjee & Mächler (1997)
  • Summary

Thank you the Julia community for the attention!

6 Likes