How to write an objective function that has both a LASSO and a Ridge regularization term in JuMP

hello everyone,
I am trying to use JuMP to minimize an objective function, essentially logistic regression with both LASSO and Ridge regularization term. Here is the setting:
I have a matrix of inputs, X, having n rows and m columns. So there are n training examples, each with m features. Correspondingly, I have a vector y of length n, which are the true labels for the training examples.

Defining phi(i) = 1/(1 + exp(-Xi*theta)), where Xi is the i-th row of the matrix X, and theta is the vector of parameters, I have the following objective function to minimize over all theta:

L(theta) = -\sum_{over all i’s}( y[i]*log(phi(i)) + (1 - y[i])*log(1 - phi(i)) ) + (c1)norm(theta, 1) + (c2)norm(theta, 2)

Here c1 and c2 are constants.

I can optimize over the individual terms of L(theta), after converting minimization of norms to linear programs with constraints, but I can not seem to come up with the syntax to write the whole of L(theta) in JuMP’s @objective(model, Min, <some function>).

I would greatly appreciate any help as to how to write this in JuMP. I am open to any other optimization library / wrapper too if it is too complicated to do in JuMP.

Untested, but I think you want something like

function regression(X::Matrix, y::Vector; c1 = 1.0, c2 = 1.0)
    (m, n) = size(X)
    @assert length(y) == m
    model = Model(Ipopt.Optimizer)
    @variable(model, θ[1:n], start = 1.0)
    @variable(model, θ_abs[1:n] >= 0)
    @constraint(model, [j = 1:n], θ_abs[j] >= θ[j])
    @constraint(model, [j = 1:n], θ_abs[j] >= -θ[j])
        model, ϕ[i = 1:m], 1 / (1 + exp(sum(-X[i, j] * θ[j] for j = 1:n))
        -sum(y[i] * log(ϕ[i]) + (1 - y[i]) * log(1 - ϕ[i]) for i = 1:m) +
        c1 * sum(θ_abs[j] for j = 1:n) / m +
        c2 * sqrt(sum(θ[j]^2 for j = 1:n))
    @assert termination_status(model) == MOI.LOCALLY_SOLVED
    return value.(θ)

wow thanks a ton. Let me try it out :slight_smile:

note that logistic + elastic net (what you’re doing) is implemented in MLJLinearModels in case you want to compare it with what you get from Jump :slight_smile:

1 Like

thanks so much for pointing it out. I ran a quick google to avoid me writing buggy code, and simply using the available libraries. The code @odow very kindly wrote works to an extent, but it terminates with an error “Invalid number in NLP function or derivative detected”. That means there is some infinity of some sort (most likely divide by zero) somewhere but it is hard to debug.

While a quick Google didn’t return any example, would you have some sample code for the implementation of this problem (or something similar) using MLJLinearModels?

Basically you create the model then fit it:

en = LogisticRegression(a, b; penalty=:en)
fit(en, X, y)

where a is the coefficient of the L1 and b of the L2 penalty. These are un scaled (unlike sklearn) so you have to pay attention to that if you want to compare to sklearn (check the logistic tests in the code to see comparisons with sklearn if you want)

X, y are arrays of float

If you want something that works with data frames etc then use MLJ calling MLJLinearModels

1 Like

wow! That’s quite simple. Is the penalty parameter set to en is just an example and I’d need to figure out what penalty would be, or should I also set it to en?

Well you said elastic net? (:en is for that) If you want just l2 write :l2 same for l1; and then it takes only one parameter which is the coefficient for that penalty term.

By the way in my experience Elastic Net usually sucks and L1 is usually what we want but your mileage may vary

thank you very much. The penalty indeed is both LASSO and Ridge term (I didn’t know that it had a name, elastic net :slight_smile: )

1 Like

I tried looking at the documentation for MLJLinearModels at its GitHub repository here but couldn’t find much. Can you please point me to its documentation, so I can be more aware of it and use it better. Thanks

What is your question? if you’re looking for official docs, there’s a stub I wrote but it’s probably not going to be of help.

However in your case my previous answer should be all you need: specify a :en penalty and two coefficients (one for the l1 and one for l2) and then fit.

If an error is thrown, open an issue on the GitHub repo with an example where it caused you trouble and I’ll help you there :slight_smile:

I see. Thanks.
The question was to be certain that my function that I am trying to optimize is indeed the one optimized by this syntax. Second question, if I want to optimize different objectives, how to set those up…

The elastic net penalty implemented in the package is

A * || . || + B * || . ||

Where the first norm is l2 and second is l1, no scaling.

For other objectives it’s the same concept, there’s a bunch of penalties you can use (Huber, Fair etc)

I know this is probably not as documented as you’d like but if you look at the tests in the code there’s examples for pretty much all use cases that it can handle; if you struggle with a specific one, ask on GitHub I should be reasonably quick at answering

thank you :slight_smile:

Also maybe have a look at the constructors: which are all documented and cover the standard use cases

So I have coded up everything, and here is a setting:
I have

  • a 10,000-by-7 matrix, X
  • a 10,000-by-1 vector, y
    when I use the code

model = LogisticRegression(λ, α, penalty=:en), X, y)

for starters, I set both λ and α to 0. The return is a vector of 8 elements, instead of 7. What is it returning?

nice, the 8th term is the intercept; you can turn this off by setting fit_intercept to false

So to be clearer the full equation is

y = \sigma(X\beta + \gamma)

where \gamma is the intercept, \beta are the coefficients for each of your 7 parameters, \sigma is the sigmoid

(the last element in the returned vector is the intercept)

Thanks a lot. The program finishes with the following warning: “Proximal GD did not converge in 1000 iterations”. Is there a way to raise the number of maximum iterations?

yes but you can also broadly ignore these warnings (there’s an open issue that I should basically hide these warnings).

Note that when you use l1 (which is part of elastic net), setting the scale right is important (this is what it’s complaining about). If you call MLJLM from within MLJ then you can have access to hyper parameter tuning to select these parameters in a principled way

For instance in scikitlearn, the parameter is divided by 2n where n is the number of rows in X; so if you set it to 1 the actual passed parameter is 5e-5 ; in MLJLM there is no automatic scaling so you have to do this yourself :slight_smile:

I see. Still the actual values of the parameter vector I used to generate this data (values of y given X) were 0.5*ones(7), while the values returned after the algorithm finishes (1000 iterations) is
8-element Array{Float64,1}:

So ignoring the intercept, the values are of the parameters can be thought to be not so close to the original ones. May be running the algorithm for longer, might get it to converge nearer to the actual values of the parameters?