[ANN] Generalized Linear Regression package

MLJLinearModels.jl has already been around for some time (and used a lot in the MLJ tutorials) but was never announced explicitly (laziness from my side on getting some form of documentation going).

NOTE: despite its name, the package is not tied to the MLJ universe, it was just developed on the side and, of course, integrates well with MLJ but you could very well use it by itself.

The package offers a unified way to solve minimization problems of the form

L(y, X\theta) + P(\theta)

where y is a response vector, X a feature matrix, L a loss function and P a penalty. A classical example is the lasso regression

\| y - X\theta \|_2^2 + \lambda \|\theta\|_1

Which, in the package, is solved with

n = 100
p = 5
X = randn(n, p)
y = randn(n)
lasso = LassoRegression(lambda=0.5)
theta = fit(lasso, X, y)

Available models are:

  • (Regression) ridge, lasso, quantile, LAD, and a bunch of robust regression (Huber, Talwar, …)
  • (Classification) logistic regression & multinomial regression

All these allow l1, l2 and elastic net regularization, the package also offers the infrastructure to design your own model with custom loss and penalty functions.

The package is focused around solving the problem efficiently and does not offer things like encoding, imputation, PCA etc, for those, consider using the package within the MLJ universe.

The package leverages a couple of great packages: IterativeSolvers.jl for matrix-free CG and Optim.jl for LBFGS, Newton and Newton-CG.

See the docs for more information.

Comparison with other Julia packages

There’s a bunch of Julia packages which offer some of the functionalities, here’s a short overview in terms of which one you might want to pick:

(*) I’m by no means disparaging those packages, they serve a specific purpose and do it very well, they typically compute more stuff as well.

Finally there’s GLM.jl of course which is more geared towards statistical analysis for small/medium datasets. It lacks a few key things like regularisation, robust regressions etc but it offers a lot more in terms of statistical pointers (which MLJLinearModels does not, by design).

Comparison with external packages

This package has been tested a fair bit against scikitlearn for the relevant models and the R package quantreg for the quantile regression. It is typically comparable or better in terms of speed and accuracy (I’ll add benchmarks in the future). Of course this should be taken with a pinch of salt given that these packages are super mature and probably work in a much wider range of settings.

Help!

If you’re interested in that kind of stuff, there’s a lot to be done (and I’m a bit stretched). For instance at the moment I have not coded the kind of solvers you would want in case of fat data or specific solvers for L2-CV. If you’re interested, please get in touch, I’ll be happy to help you get up to speed with the package etc.
Help with documentation/ benchmarking / testing / etc are very welcome. If you know of specific models that are missing and could be added (and ideally with a working implementation) that would be great. (I worked for a while on the ADMM solver before coming to the conclusion that they typically suck but if you have examples / implementations that would convince me otherwise, I’m interested).

PS: thanks to @Albert_Zevelev and @samuel_okon for testing this package early on and their valuable feedback!

26 Likes

First of all, thanks for the great work in putting this package together. I’m particularly interested in multinomial regression, which I’m testing as an alternative to a multi-class LDA approach. I did find some interesting differences in performance, though, as the following Pluto notebook (hopefully demonstrates):

### A Pluto.jl notebook ###
# v0.17.7

using Markdown
using InteractiveUtils

# ╔═╡ f2e4564e-7fed-11ec-33b9-e96d36958b87
begin
	using LinearAlgebra
	using MultivariateStats
	using MLJLinearModels
	using Random
	using Statistics
end

# ╔═╡ b923c208-aa3c-46bf-9e76-5f60f453a165
"""
Function to generate all vertices of a simplex in `n` dimensions.
"""
function get_simplex_vertices(n)
    q,r = qr(fill(1.0, n+1))
    points = permutedims(q[:,2:end],[2,1])
    #rescale so that the distance is 1
    points ./= sqrt(2)
end

# ╔═╡ 79d2e3f0-f2bd-4341-8164-495d5a003ca9
"""
Train a multiclass LDA to classify `X` according to the labels in `Y`.
"""
function train_decoder(decoder::Type{MultivariateStats.MulticlassLDA}, X::Matrix{T}, Y::Vector{Int64}) where T <: Real
	ncats = maximum(Y)
	ntrials,nvars = size(X)
	Xp = permutedims(X,[2,1])
	lda = MultivariateStats.fit(MultivariateStats.MulticlassLDA, ncats, Xp, Y)
	Xpp = MultivariateStats.transform(lda, Xp)
	pp = fill(0.0, ncats, ntrials)
	for k in 1:ntrials
		for i in 1:ncats
			d = sum(abs2, Xpp[:,k] - lda.pmeans[:,i])
			pp[i,k] = exp(-d)
		end
	end
	lda, permutedims(pp, [2,1])
end

# ╔═╡ bb90441e-d07b-4b7a-a35c-66ece626b460
"""
Fit a multinomial regression model to classify `X` according to the labels in `Y`.
"""
function train_decoder(decoder::MLJLinearModels.GeneralizedLinearRegression{MLJLinearModels.MultinomialLoss{0}, MLJLinearModels.ScaledPenalty{MLJLinearModels.L2Penalty}}, X::Matrix{T}, Y::Vector{Int64}) where T <: Real
    ncats = maximum(Y)
    ntrials,nvars = size(X)
    m = MLJLinearModels.fit(decoder, X, Y)
    W = reshape(m, nvars+1, ncats)

    # compute performance
    pp = [X fill(1.0, size(X,1))]*W
    W, pp
end

# ╔═╡ 327d08b2-7974-4ed0-87c2-59185ce74cbd
md"""
First we create a very simple example where 3 classes are encoded by the matrix `X` in 126 dimensions.
"""

# ╔═╡ 38b45176-ca11-4722-aa6e-23b6754951d5
X, Y = let
	ntrials = 512
	nvars = 126
	vv = get_simplex_vertices(nvars)
	X = randn(ntrials, nvars)
	μ = fill(0.0, 3, nvars)
	μ[1,:] .= vv[:,1]
	μ[2,:] .= vv[:,2]
	μ[3,:] .= vv[:,3]
	Y = rand(1:3, ntrials)
	for i in 1:ntrials
		X[i,:] .= μ[Y[i], :] .+ 0.3*randn(nvars)
	end
	X, Y
end;

# ╔═╡ 082ef646-2047-4f89-ae44-61defdcb35d4
md"""
We train both a multiclass LDA and a multinomial regression model to classify the data in the example.
"""
# ╔═╡ 5670e758-5828-46ee-bdbb-4563255b7c9e
_, pp1 = train_decoder(MultivariateStats.MulticlassLDA, X, Y);

# ╔═╡ a6d77b77-4b63-4196-aa9a-020c20573217


# ╔═╡ 8fc8adec-d5c1-4cb5-81fb-19ef412bbb3d
_, pp2 = train_decoder(MLJLinearModels.MultinomialRegression(0.5), X, Y);

# ╔═╡ 723fc37b-7a1b-4ccd-b6b4-c6dea04394a7
md"""
Training the multiclass LDA is about 3000 times faster than fitting a regression model (2.3 ms vs 8.2s)
"""

# ╔═╡ abc01219-ea9a-49c4-a30b-b088620761af
let
	ntrials = size(X,1)
	perf1 = mean([argmax(pp1[i,:]).==Y[i] for i in 1:ntrials])
	perf2 = mean([argmax(pp2[i,:]).==Y[i] for i in 1:ntrials])
	perf1, perf2
end

# ╔═╡ 548b94b3-edff-480f-813c-8864e00fd700
md"""
Both approaches classify the data at about 90% accuracy.
"""

I’m looking to run thousands of iterations where I train/fit models to shuffled version of my data, so I was hoping that something could be done to improve the performance of fitting the multinomial regression models.
Thanks again!

I’m not sure what your question is, you’re comparing two models and one trains faster than the other? that doesn’t mean that the second one is not performant. Also if you find LDA in your case to be faster and as accurate, why not just use that?

MNR uses an iterative optimiser to get the solution (which LDA does not, at least not explicitly) meaning that, especially for higher dimensions, I’m not surprised it would be quite a bit more costly.

Two things you could try if you do want to use MNR:

  1. use Sklearn.jl’s multi class logisticregression and see if you get better timings (if you also get a factor 1-3000 then that means there’s an issue with MLJLM and I’ll dig into it, but I would expect you won’t get that, maybe Sklearn will be a bit faster because it uses more relaxed constraints on the objective but I’d be fairly surprised if it was 3 orders of magnitude faster)
  2. use dimensionality reduction before you apply MNR

A final point is that you use regularised MNR here with the penalty parameter set to 0.5, did you just guess that or is it principled? if it’s not principled, you might try to set it to zero and see if it helps, if your matrix is not ill conditioned (doesn’t look like it is) it won’t make much difference in terms of accuracy.

Thanks for getting back to me on this. Firstly, the example I provided was more to illustrate the very different times taken to “train” both models, and I made sure that they both gave the correct answer. I also wanted to explore regularisation because my actual data is not as well-conditioned as in the example I gave (and yes, I did just guess the regularisation parameter; I should probably justify that choice a bit better).

I’ll check out Sklearn as you suggest and see if that speeds things up a bit.

That makes sense, if you do experience significantly different times, please open an issue at the MLJLinearModels repo, and we’ll try to get to the bottom of things.

The code used by SKL and MLJLM is very similar however (though not identical) so I would expect similar timings within a factor 10 roughly (SKL tends to stop the optimisation earlier than MLJLM) but if you do see 2 or 3 orders of magnitude difference then there’s an issue that we need to fix.