Huber regression by using Jump and optimizing with MOSEK

Dear JUMP Community,

I am solving a Huber regression problem. I am using the following second-order conic reformulation of the problem (from MOSEK tutorials):

My implementation is pretty straightforward:

function huber_regression(X, y; sigma = 1.0)
    N, n = size(X)
    model = Model() #start the model
    set_optimizer(model, MosekTools.Optimizer) #call MOSEK
    @variables(model, begin
        u[1:N] >= 0.0
        v[1:N] >= 0.0
        beta_0
        beta[1:n]
        t #aux for conic constraint
    end)
    #define constraints
    @constraints(model, begin
        -(u + v) .<= beta_0 .+ X*beta .- y
        (u + v) .>= beta_0 .+ X*beta .- y
        u .<= sigma
        [(1+t)/2; [(1-t)/2; u]] in SecondOrderCone() 
    end)
    @objective(model, Min, (t + 2*sigma*sum(v)))
    JuMP.optimize!(model)
    status, time, obj = termination_status(model), JuMP.solve_time(model), objective_value(model)
    return status, obj, time, value.(model[:beta_0]), value.(model[:beta])
end

However, unfortunately, I get the status “SLOW_PROGRESS” often times. The data does not have ill-conditions, and this status keeps coming under simulated or real-life data. Is there a problem at my implementation (it gives the true objective value – I verified it, but maybe there is something I can do for the efficiency?).

Thank you for your time!

I don’t understand the beta_0 variable in comparison to your math formulation. You could also use a rotated second order cone instead of the SOC. And you can move sigma to a variable bound.

What happens if you try:

function huber_regression(X, y; sigma = 1.0)
    N, n = size(X)
    model = Model(MosekTools.Optimizer)
    @variables(model, begin
        0 <= u[1:N] <= sigma
        v[1:N] >= 0
        beta_0
        beta[1:n]
        t >= 0
    end)
    @expression(model, residuals, beta_0 .+ X * beta .- y)
    @constraints(model, begin
        -(u + v) .<= residuals
        u + v .>= residuals
        vcat(t, 0.5, u) in RotatedSecondOrderCone() 
    end)
    @objective(model, Min, t + 2 * sigma * sum(v))
    JuMP.optimize!(model)
    status, time, obj = termination_status(model), solve_time(model), objective_value(model)
    return status, obj, time, value(beta_0), value.(beta)
end

Do you have a log of the Mosek solve? Is it printing any warnings?

1 Like

You may want to compare with Huber fitting — OSQP documentation with the Julia example at GitHub - PaulSoderlind/OSQPExamplesJulia: OSQP examples in Julia (at line 95 or so).

1 Like

Thank you very much for your reply! beta_0 is just an intercept.

  • I was initially constraining sigma in the variable definition. Didn’t help, so I tried taking it to the constraints.
  • I tried the dual_optimizer() and I still have the slow progress error. Similarly I tried set_attribute(model, “INTPNT_SOLVE_FORM”, 0/1/2) and all of them caused “Slow_Progress”.
  • I used SecondOrderCone instead of Rotated second order cone, but now I tried the rotated one and I still have “Slow_Progres” warning
  • The log is here:

Problem
Name :
Objective sense : minimize
Type : CONIC (conic optimization problem)
Constraints : 24000
Affine conic cons. : 1
Disjunctive cons. : 0
Cones : 0
Scalar variables : 16232
Matrix variables : 0
Integer variables : 0

Optimizer started.
Presolve started.
Linear dependency checker started.
Linear dependency checker terminated.
Eliminator started.
Freed constraints in eliminator : 0
Eliminator terminated.
Eliminator - tries : 1 time : 0.00
Lin. dep. - tries : 1 time : 0.01
Lin. dep. - number : 0
Presolve terminated. Time: 0.22
Problem
Name :
Objective sense : minimize
Type : CONIC (conic optimization problem)
Constraints : 24000
Affine conic cons. : 1
Disjunctive cons. : 0
Cones : 0
Scalar variables : 16232
Matrix variables : 0
Integer variables : 0

Optimizer - threads : 16
Optimizer - solved problem : the primal
Optimizer - Constraints : 16000
Optimizer - Cones : 2
Optimizer - Scalar variables : 32234 conic : 8234
Optimizer - Semi-definite variables: 0 scalarized : 0
Factor - setup time : 0.37 dense det. time : 0.00
Factor - ML order time : 0.21 GP order time : 0.00
Factor - nonzeros before factor : 1.99e+06 after factor : 1.99e+06
Factor - dense dim. : 235 flops : 2.52e+08
ITE PFEAS DFEAS GFEAS PRSTATUS POBJ DOBJ MU TIME
0 3.0e+01 1.0e+01 2.1e+00 0.00e+00 7.071067812e-01 -3.535533906e-01 1.0e+00 0.72
1 2.4e+01 8.1e+00 1.8e+00 -7.36e-01 3.496796289e+03 3.495930137e+03 8.1e-01 0.79
2 9.1e+00 3.1e+00 8.3e-01 -6.60e-01 2.388762684e+04 2.388779469e+04 3.1e-01 0.87
3 5.2e+00 1.7e+00 4.8e-01 -3.31e-01 3.887451222e+04 3.887506579e+04 1.7e-01 0.94
4 3.1e+00 1.1e+00 3.1e-01 -2.10e-01 8.264122109e+04 8.264213991e+04 1.1e-01 1.00
5 4.7e-01 1.6e-01 3.6e-02 5.43e-02 1.402486415e+05 1.402495708e+05 1.6e-02 1.07
6 1.8e-01 6.2e-02 9.0e-03 7.35e-01 1.563431755e+05 1.563435487e+05 6.2e-03 1.15
7 8.8e-02 3.0e-02 3.1e-03 7.70e-01 1.637037993e+05 1.637039920e+05 3.0e-03 1.22
8 2.4e-02 8.0e-03 4.4e-04 9.11e-01 1.675812845e+05 1.675813396e+05 8.0e-04 1.30
9 2.0e-03 6.8e-04 1.2e-05 9.46e-01 1.694074747e+05 1.694074798e+05 6.8e-05 1.38
10 7.4e-04 2.5e-04 2.8e-06 9.72e-01 1.695111212e+05 1.695111233e+05 2.6e-05 1.44
11 1.7e-04 5.7e-05 2.4e-07 9.89e-01 1.695589285e+05 1.695589290e+05 5.9e-06 1.51
12 2.3e-05 7.8e-06 1.7e-08 1.00e+00 1.695711798e+05 1.695711799e+05 8.1e-07 1.59
13 2.0e-05 6.7e-06 1.3e-08 1.00e+00 1.695714524e+05 1.695714525e+05 6.9e-07 1.71
14 1.4e-05 2.3e-05 8.8e-09 9.99e-01 1.695719605e+05 1.695719606e+05 4.8e-07 1.83
15 8.9e-06 6.9e-06 5.6e-09 1.00e+00 1.695723436e+05 1.695723437e+05 3.1e-07 1.95
16 2.4e-06 3.3e-05 4.4e-10 1.00e+00 1.695728785e+05 1.695728785e+05 8.3e-08 2.00
17 2.2e-06 4.9e-05 3.0e-10 1.00e+00 1.695728945e+05 1.695728945e+05 7.6e-08 2.12
18 2.2e-06 4.9e-05 5.5e-11 1.00e+00 1.695728945e+05 1.695728945e+05 7.6e-08 2.26
19 2.2e-06 4.9e-05 3.5e-12 1.00e+00 1.695728947e+05 1.695728948e+05 7.6e-08 2.39
20 2.2e-06 4.9e-05 3.5e-12 1.00e+00 1.695728947e+05 1.695728948e+05 7.6e-08 2.55
21 2.2e-06 4.9e-05 3.5e-12 1.00e+00 1.695728947e+05 1.695728948e+05 7.6e-08 2.71
22 2.2e-06 4.9e-05 1.8e-11 1.00e+00 1.695728952e+05 1.695728952e+05 7.6e-08 2.83
23 2.2e-06 4.9e-05 1.8e-11 1.00e+00 1.695728952e+05 1.695728952e+05 7.6e-08 2.99
24 2.2e-06 4.9e-05 1.8e-11 1.00e+00 1.695728952e+05 1.695728952e+05 7.6e-08 3.17
Optimizer terminated. Time: 3.40

status: SLOW_PROGRESS
solved in time (s): 3.4
objective value: 169572.895

Welcome, @selvi-aras !

Please can you provide a MWE? I.e. that generates some representative test data, X and y, then runs the optimization?

If so, I will try it on my robust, non-linear least squares optimizer, using the original robustified least squares formulation, rather than the second-order conic reformulation.

That looks very nice, @Paul_Soderlind

Thank you for your interest in this problem. My original problem was way more complicated, but I simplified everything to an MWE. Here it is:

## Read the related files and import the necessary packages
using Random, InvertedIndices, Gurobi, Mosek, MosekTools
function synthetic_data(N, n) #simulate data
    X_raw = randn(N,n) #the numerical data
    X_raw = (X_raw .- minimum(X_raw, dims=1)) ./ (maximum(X_raw, dims=1) .- minimum(X_raw, dims=1)) #min/max standardize
    #generate the labels
    y_raw = X_raw*randn(n) + randn(N).*5 .+ 1.0 #error added at the end, followed by an intercept of "1"
    return X_raw, y_raw
end

function huber_regression_mwe(X, y; solver = "MOSEK", sigma = 1.0, summarize = 0) #huber regression
    N, n = size(X)
    model = Model() #start the model
    @variables(model, begin #joint variables
        0 <= u[1:N] <= sigma
        v[1:N] >= 0.0
        beta_0
        beta[1:n]
    end)
    @expression(model, residuals, beta_0 .+ X*beta .- y)
    @constraints(model, begin
        -(u + v) .<= residuals
        (u + v) .>= residuals
        u .<= sigma
    end)
    if solver == "MOSEK"
        set_optimizer(model, MosekTools.Optimizer)
        @variable(model, t)
        @constraint(model, vcat(t, 0.5, u) in RotatedSecondOrderCone())
        @objective(model, Min, (t + 2*sigma*sum(v)))
    elseif solver == "Gurobi"
        set_optimizer(model, Gurobi.Optimizer)
        @objective(model, Min, (u'u + 2*sigma*sum(v)))
    end
    set_silent(model)
    JuMP.optimize!(model)
    status, time, obj = termination_status(model), JuMP.solve_time(model), objective_value(model)
    if summarize == 1
        println("status: ", status)
        println("solved in time (s): ", round(time, digits = 3))
        println("objective value: ", round(obj, digits=3))
    end
    #return solution
    return status, obj, time, value.(model[:beta_0]), value.(model[:beta])
end

### MWE
Random.seed!(1); 
N, n = 1000, 10; #size
X_raw, y_raw = synthetic_data(N, n); #simulate data
sigma = 6.0 ##sigma --> CV-later
println("MOSEK: ")
status, obj, time, beta_0, beta_N = huber_regression_mwe(X_raw, y_raw; solver = "MOSEK", sigma = sigma, summarize = 1);
println("Gurobi: ")
status, obj, time, beta_0, beta_N = huber_regression_mwe(X_raw, y_raw; solver = "Gurobi", sigma = sigma, summarize = 1);

And I get:

MOSEK:
status: SLOW_PROGRESS
solved in time (s): 0.103
objective value: 24008.094
Gurobi:
Academic license - for non-commercial use only - expires 2023-09-03
status: OPTIMAL
solved in time (s): 0.044
objective value: 24008.096

Thank you! I now see that this problem isn’t (currently) suitable for my solver, because the number of variables that each residual depends on is large (and probably dynamic). However, it got me to understand a limitation of the package, and therefore the types of problems it’s more suited to. And I may yet add dynamically sized variable blocks! So thank you, and sorry I couldn’t help more.

Great to hear at least one problem is resolved now in this thread :smiley: I will look into this a bit more – the following is a suboptimality in my view:

  1. MOSEK, that directly solves an SOCP, takes longer than its usual performance, and has slow progress error.
  2. GUROBI, that solves SOCPs quite fast, requires you to first give QP to JuMP, and the problem reformulation from QP → SOCP ends up taking so much time that the efficiency of GUROBI becomes redundant.

Ps: your solver looks interesting – would be great to chat later, especially now that I am extensively into the hidden convexity and S-Lemma stuff :slight_smile:

Hi everyone,

Do we have any updates here, by any chance? I think the discussion in the MOSEK forum concluded that this is more about JuMP’s modeling than MOSEK’s solution algorithm, but might be wrong as well.

Many thanks for your time!

I think the discussion in the MOSEK forum concluded

Link?

that this is more about JuMP’s modeling

It’s not obvious why this would be. JuMP is passing what you’ve written to Mosek. There might be other formulations that are more efficient, but that’s a separate question.

Here is the link, many thanks for your support!

How Julia and Yalmip does this I do not know.

If Yalmip employs an eigenvalue decomposition whereas Julia employs a Choleksy factorization,

then that could explain what you observe.

https://groups.google.com/g/mosek/c/Ty1WJb2pXG0

I’ve now implemented support for dynamic-sized vectors on the dev branch of NLLSsolver. So you can now solve your problem using that solver as follows:

using NLLSsolver, Random, Static

function synthetic_data(N, n) #simulate data
    X_raw = randn(N, n) #the numerical data
    X_raw = (X_raw .- minimum(X_raw, dims=1)) ./ (maximum(X_raw, dims=1) .- minimum(X_raw, dims=1)) #min/max standardize
    #generate the labels
    y_raw = X_raw*randn(n) + randn(N).*5 .+ 1.0 #error added at the end, followed by an intercept of "1"
    return X_raw, y_raw
end

struct LinearResidual{T} <: NLLSsolver.AbstractResidual
    y::T # Target value
    X::NLLSsolver.DynamicVector{T}
end
NLLSsolver.ndeps(::LinearResidual) = static(1) # Residual depends on 2 variables
NLLSsolver.nres(::LinearResidual) = static(1) # Residual vector has length 2
NLLSsolver.varindices(::LinearResidual) = 1
NLLSsolver.getvars(::LinearResidual{T}, vars::Vector) where T = (vars[1]::NLLSsolver.DynamicVector{T},)
NLLSsolver.computeresidual(res::LinearResidual, w) = res.X' * w - res.y
NLLSsolver.computeresjac(varflags, res::LinearResidual, w) = res.X' * w - res.y, res.X'
const huberrobustifier = NLLSsolver.Huber2oKernel(6.0)
NLLSsolver.robustkernel(::LinearResidual) = huberrobustifier
Base.eltype(::LinearResidual{T}) where T = T

# Generate some test data
Random.seed!(1); 
N, n = 1000, 10; #size
X_raw, y_raw = synthetic_data(N, n); #simulate data

# Create the problem
problem = NLLSsolver.NLLSProblem{NLLSsolver.DynamicVector{Float64}}()
NLLSsolver.addvariable!(problem, zeros(n))
for i in 1:N
    NLLSsolver.addresidual!(problem, LinearResidual(y_raw[i], vec(X_raw[i,:])))
end

# Optimize
result = NLLSsolver.optimize!(problem, NLLSsolver.NLLSOptions(iterator=NLLSsolver.gaussnewton, storecosts=true))
show(result)

I get a solution cost of about 24033, but perhaps there is some difference in our data. :man_shrugging:

Sorry if this is out of scope but if you want to compare the result, MLJLinearModels implements the Huber regression i.e.:

\arg\min_\theta \sum \rho(X\theta - y) + n\lambda\|\theta\|_2^2 + n\gamma\|\theta\|_1

where the L2 and L1 penalty can be ignored in your case (\lambda=\gamma=0) and \rho is the Huber loss with \rho(r)=r^2/2 for |r|\le \delta and \delta(|r|-\delta/2) otherwise (note there’s a factor 2 compared to your formulation but with no penalty this doesn’t make a difference). So basically your M is our \delta.

with the example suggested above:

using MLJLinearModels
using Random

# code taken from previous post
function synthetic_data(N, n)
    X_raw = randn(N, n)
    X_raw = (X_raw .- minimum(X_raw, dims=1)) ./ (maximum(X_raw, dims=1) .- minimum(X_raw, dims=1))
    y_raw = X_raw*randn(n) + randn(N).*5 .+ 1.0
    return X_raw, y_raw
end
Random.seed!(1)
N, n = 1000, 10
X_raw, y_raw = synthetic_data(N, n)

# the MLJLM fit
hr = HuberRegression(delta=6.0, lambda=0.0, gamma=0.0)
theta = fit(hr, X, y)
obj = objective(hr, X, y)
obj(theta)  # 12250

This matches the solution cost of around 24k reported earlier with the factor 2 mentioned earlier. (It’s not an ideal example though as generating data from a gaussian won’t really make the Huber regression perform significantly differently to an OLS, but that’s not really the point I guess)

Might be useful for sanity-checking results maybe.