Ipopt does NOT memorize a best solution found so far

The title is expounded through optimizing this function

function g(x) return sin(7 * x) + 1.2 * cos(4.7 * x) + 1.3 * x end

which can be visualized via, e.g.

using CairoMakie
f = Figure();
Axis(f[1, 1]);
xs = LinRange(0, 9, 500);
lines!(xs, g.(xs))
f

We verify the title through the following code

import JuMP, Gurobi, Ipopt
# 1️⃣ with Gurobi
NLP = JuMP.Model(() -> Gurobi.Optimizer())
JuMP.@variable(NLP, 0 <= x <= 9)
JuMP.@objective(NLP, Min, g(x))
JuMP.optimize!(NLP)
JuMP.assert_is_solved_and_feasible(NLP; allow_local = false)
x1 = JuMP.value(x) # 0.6553872496146986 (optimal solution)
g(x1) # -1.3379824147669828 (optimal obj value)
x3 = JuMP.value(x; result = 3) # 4.358805270131485 (suboptimal solution)
g(x3) # 4.801279216797521 (suboptimal obj value)
# 2️⃣ with Ipopt
NLP = JuMP.Model(Ipopt.Optimizer)
JuMP.@variable(NLP, 0 <= x <= 9)
JuMP.@objective(NLP, Min, sin(7 * x) + 1.2 * cos(4.7 * x) + 1.3 * x) # 🍅 i.e. Minimize g(x)
JuMP.set_start_value(x, x1)
JuMP.optimize!(NLP)
JuMP.assert_is_solved_and_feasible(NLP)
r1 = JuMP.value(x) # 0.6542659763135713
g(r1) # -1.3380295865474872
# proceed with another start value
JuMP.set_start_value(x, x3)
JuMP.optimize!(NLP)
JuMP.assert_is_solved_and_feasible(NLP)
r3 = JuMP.value(x) # 4.358805269704956
g(r3) # 4.80127921679752
# remark: the second solve gives an inferior solution
# This Ipopt's behavior is supposed to never occur with Gurobi

(Additional question: will it be all right to directly write g(x) in the :tomato: line, as it was look like in the Gurobi’s code?)

Same answer as Ipopt's "local" optimum can be very substandard - #5 by odow

Additional question: will it be all right to directly write g(x) in the :tomato: line, as it was look like in the Gurobi’s code?

Yes. JuMP uses function tracing to compute the underlying expression.

1 Like

I wrote a simple barrier method myself, and used it to solve a toy bilinear program.
I introduce R to memorize the best solution found so far, such that it becomes a “global” algorithm, despite the fact that it cannot ensure a global optimal point (i.e., it only tries to provide a good primal-side heuristic solution). In this small test case, it outperforms Ipopt, given the same starting point.

The bilinear program studied is

using JuMP, Ipopt
m = Model(Ipopt.Optimizer)
@variable(m, -5 <= x <= 3)
@variable(m, -2 <= y <= 4)
@variable(m, -3 <= z <= 2)
@objective(m, Min, x * (y + z))
set_start_value(x, 1.97)
set_start_value(y, 0.04)
set_start_value(z, 0.23)
optimize!(m); JuMP.solution_summary(m; verbose = true)

My barrier algorithm is

function get_d(J, F) # get a direction of improvement via Newton's step
    return -J \ F
end
function bfc(x) # This is a convex function ✅ x > 0 MUST hold
    -log(x)
end
function f(p) # the objective function to be Minimized
    x, y, z = p
    return x * (y + z)
end
function f_bar(e, p) # `e` = BFC_epsilon ✅ In the context of Minimization
    x, y, z = p
    return x * y + x * z - e * (log(x + 5) + log(3 - x) + log(y + 2) + log(4 - y) + log(z + 3) + log(2 - z))
end
function F(e, p) # "F(p) = 0" is precisely "∇f_bar(p) = 0"
    x, y, z = p
    [
        y + z + e*(1 / (3 - x) + -1 / (5 + x))
        x + e*(1 / (4 - y) + -1 / (2 + y))
        x + e*(1 / (2 - z) + -1 / (3 + z))
    ] # an error Vector
end
function J(e, p)
    x, y, z = p
    [
    e*(1 / ((3 - x)^2) + 1 / ((5 + x)^2)) 1 1
    1 e*(1 / ((2 + y)^2) + 1 / ((4 - y)^2)) 0
    1 0 e*(1 / ((2 - z)^2) + 1 / ((3 + z)^2))
    ] # Jacobian Matrix
end
function get_α(pbm, p, d) # p is an interior point, d is the (alleged) direction of improvement
    α = zero(d)
    for r in eachindex(d)
        if d[r] > 0
            α[r] = (pbm[r, 2] - p[r]) /  d[r]
        elseif d[r] < 0
            α[r] = (p[r] - pbm[r, 1]) / -d[r]
        end       
    end
    α .= 0.999 .* α 
end
function initialize_R(p)
    ub = f(p)
    if ub < R.ub[1]
        R.p[:], R.ub[1] = p, ub
        @info "*ub = $ub, p = $p"
    else
        @info "current ub = $ub"
    end
end
plb = [-5, -2, -3]
pub = [3,   4,  2]
pbm = [plb pub]
e = 1.5 # 🟠 [need to tune] initial BFC_epsilon
x = 1.97
y = 0.04
z = 0.23
p = [x, y, z]
R = (p = NaN * ones(3), ub = [Inf]) # Global Solution Register
initialize_R(p)
while true # main procedure
    d = get_d(J(e, p), F(e, p))
    α = get_α(pbm, p, d)
    @. p += α * d # update forcefully
    ub = f(p)
    if ub < R.ub[1]
        R.p[:], R.ub[1] = p, ub
        @info "*ub = $ub, p = $p; derived at e = $e"
    else
        @info "current ub = $ub"
    end
    e < 1e-13 && break
    e *= 0.12 # 🟠 [need to tune]
end

It’s not hard to outperform Ipopt on select problems it wasn’t designed to do well on. But your method is unlikely to scale well to higher dimensions, ~where second-order methods can do far better~ EDIT: due to sampling.

You may have better luck with a method like CMA-ES, which takes samples to approximate the second order. It ranks the best candidates in each generation, and updates a covariance matrix (like an inverse Hessian) which is sampled for the next generation. It doesn’t memorize all past candidates, only the samples in each generation, but it’s more forgiving of weird landscapes than Ipopt. It also doesn’t scale as well as Ipopt, but probably better than your hand-tuned approach.

2 Likes

Thanks for your information.
I actually did use second-order method, I thought. If the objective function is the primal function, then I take its gradient, and then used the Jacobian of that gradient.

This is probably true. But the reason might be that I was not using Sparse techniques.
Do you have any idea about this?
I guess I need to invoke some autodiff packages, which works well in conjunction with Sparse techniques. Yet I haven’t figure this out. :slightly_smiling_face:
Here is a related link, I’m now reading Advanced tutorial · DifferentiationInterface.jl.

You’re right, I just saw Jacobian and didn’t realize it was of F=0, so it’s a Hessian.

AFAIK the NLP methods benefit from a few things: sparse linear solve, smart approximation of Hessian/inverse to avoid full computation, use of AD when available, and smart line search. They tend to scale well (I guess roughly N^3) but can’t deal with rugged landscapes. I don’t think your issue is needing AD since you supplied the derivative, and many NLP solvers will have trouble with your problem even though Julia makes great use of AD.

My impression is that sampling methods such as yours are difficult to scale. The advantage of something like CMA-ES is that it doesn’t attempt to fully memorize past candidates, but to incorporate them in the evolution of the mean and covariance. Even older Genetic Algorithms avoid full memorization.

I didn’t look through the previous thread, but the usual recommendation would be Nonconvex.jl.

Thanks for your recommendation.
The problem I aim to investigate is the problem studied in this paper doi.org/10.1287/ijoc.2022.1163.

It is a disjointed bilinear programming. The optimal solution, just resembling the LP case, lies on the boundary of the feasible solution.

What do you mean by “sampling” method? I don’t quite get the point. My method follows the basic idea of barrier method (or interior point).

Since the problem itself is nonconvex and NP-hard. The objective, is to find a feasible solution that achieves the best cost (rather than to find a local stationary point). My simplest intuitive is that we shouldn’t use an Integer programming method, e.g. the Gurobi solver. My hope is that NLP method could provide a good heuristic. My knowledge about those so called “global optimization methods” e.g. CMA-ES, is heavily limited, because I’m never tried them before.

I guess I’ll keep update this topic. First I will try cases with larger scale. Then I’ll make a comparison between the algorithms—those you’ve suggested.

By sampling I mean you evaluate the objective at certain points, and you actually look at the values (storing in R). Many NLP methods only use function values and gradients to fit a model, e.g. local quadratic, and don’t otherwise store or “look at” the samples.

I don’t think most NLP methods will work well for your problem. Although I wonder if stochastic gradient decent would actually do okay, since many ML problems have bilinearity.