[ANN] Generalized Linear Regression package

MLJLinearModels.jl has already been around for some time (and used a lot in the MLJ tutorials) but was never announced explicitly (laziness from my side on getting some form of documentation going).

NOTE: despite its name, the package is not tied to the MLJ universe, it was just developed on the side and, of course, integrates well with MLJ but you could very well use it by itself.

The package offers a unified way to solve minimization problems of the form

L(y, X\theta) + P(\theta)

where y is a response vector, X a feature matrix, L a loss function and P a penalty. A classical example is the lasso regression

\| y - X\theta \|_2^2 + \lambda \|\theta\|_1

Which, in the package, is solved with

n = 100
p = 5
X = randn(n, p)
y = randn(n)
lasso = LassoRegression(lambda=0.5)
theta = fit(lasso, X, y)

Available models are:

  • (Regression) ridge, lasso, quantile, LAD, and a bunch of robust regression (Huber, Talwar, …)
  • (Classification) logistic regression & multinomial regression

All these allow l1, l2 and elastic net regularization, the package also offers the infrastructure to design your own model with custom loss and penalty functions.

The package is focused around solving the problem efficiently and does not offer things like encoding, imputation, PCA etc, for those, consider using the package within the MLJ universe.

The package leverages a couple of great packages: IterativeSolvers.jl for matrix-free CG and Optim.jl for LBFGS, Newton and Newton-CG.

See the docs for more information.

Comparison with other Julia packages

There’s a bunch of Julia packages which offer some of the functionalities, here’s a short overview in terms of which one you might want to pick:

(*) I’m by no means disparaging those packages, they serve a specific purpose and do it very well, they typically compute more stuff as well.

Finally there’s GLM.jl of course which is more geared towards statistical analysis for small/medium datasets. It lacks a few key things like regularisation, robust regressions etc but it offers a lot more in terms of statistical pointers (which MLJLinearModels does not, by design).

Comparison with external packages

This package has been tested a fair bit against scikitlearn for the relevant models and the R package quantreg for the quantile regression. It is typically comparable or better in terms of speed and accuracy (I’ll add benchmarks in the future). Of course this should be taken with a pinch of salt given that these packages are super mature and probably work in a much wider range of settings.

Help!

If you’re interested in that kind of stuff, there’s a lot to be done (and I’m a bit stretched). For instance at the moment I have not coded the kind of solvers you would want in case of fat data or specific solvers for L2-CV. If you’re interested, please get in touch, I’ll be happy to help you get up to speed with the package etc.
Help with documentation/ benchmarking / testing / etc are very welcome. If you know of specific models that are missing and could be added (and ideally with a working implementation) that would be great. (I worked for a while on the ADMM solver before coming to the conclusion that they typically suck but if you have examples / implementations that would convince me otherwise, I’m interested).

PS: thanks to @Albert_Zevelev and @samuel_okon for testing this package early on and their valuable feedback!

19 Likes