# How is Optim.jl's IPNewton so good compared to (exponentiated) gradient descent?

This question is about implementing an optimization algorithm in Julia and comparing it with `IPNewton` from Optim.jl. It’s kind of broad, so not sure if it fits here. Let me know if it doesn’t.

I wrote some code to minimize a function where some parameters need to be on the probability simplex, so this is constrained minimization:

``````minimize f(p1, p2 other_stuff)
s.t. p1, p2 >= 0 and p1 + p2 == 1
``````

I know that I can substitute `p2 = 1 - p1` and use something like projected gradient descent to satisfy the non-negativity constraints, but that won’t work for higher-dimensional simplexes.

I wanted to compare `IPNewton` (interior point method for constrained optimization) to basic gradient descent. Since some parameters are constrained to the probability simplex, I’m using exponentiated gradient descent (EGD). I felt like using a serious interior-point optimizer for such a simple constraint was overkill, so I wanted to try something simpler, like EGD.

The problem is that EGD is much less consistent, much slower than `IPNewton` and sometimes even produces completely wrong results. The “optimum” found by EGD also results on higher loss than that of `IPNewton`.

Questions: is the interior point method just this good? Is my implementation of EGD wrong? Is gradient descent in general this bad for constrained optimization? Am I even allowed to use EGD here? I’ve never seen it used outside online learning.

# Code

The main functions are `minimize_optim` and `minimize_expgrad`.

``````import Random
using ComponentArrays
import Optim, ForwardDiff

const AV = AbstractVector{T} where T

# ========== Loss function to minimize ==========
function get_params(params::AV{<:Real})
K = length(params) ÷ 3

ax = Axis((p=1:K, mu=(K+1):2K, sigma=(2K+1):3K))
ComponentVector(params, ax)
end

normal_pdf(x::Real, mu::Real, var::Real) =
exp(-(x - mu)^2 / (2var)) / sqrt(2π * var)

function loss(p::AV{<:Real}, mu::AV{<:Real}, sigma::AV{<:Real}, b::Real, data::AV{<:Real})
N = length(data)

penalty = sum(
p[k] * p[h] * normal_pdf(mu[k], mu[h], 2b + sigma[k]^2 + sigma[h]^2)
for k in eachindex(p), h in eachindex(p)
)

-2/N * sum(
sum(
p_k * normal_pdf(x_n, mu_k, 2b + sigma_k^2)
for (p_k, mu_k, sigma_k) in zip(p, mu, sigma)
)
for x_n in data
) + penalty
end

function loss(params::AV{<:Real}, b::Real, data::AV{<:Real})
par = get_params(params)
loss(par.p, par.mu, par.sigma, b, data)
end

# ========== Minimization algorithms ==========
function constraint_probability!(c::AV{<:Real}, params::AV{<:Real})
K = length(params) ÷ 3
p = @view params[1:K]

c = 1 - sum(p)
end

# Minimize using interior point method from Optim.jl
function minimize_optim(params0::AV{<:Real}, b::Real, data::AV{<:Real}; tol=1e-5)
K = length(params0) ÷ 3

# [weights; means; stds]
# stds can be in R because they're squared later
lo_params = [zeros(K); fill(-Inf, K); fill(-Inf, K)]
hi_params = [ones(K); fill(Inf, K); fill(Inf, K)]
lo_constr = hi_constr = [0.0]

dfc = Optim.TwiceDifferentiableConstraints(
constraint_probability!,
lo_params, hi_params, lo_constr, hi_constr, :forward
)

res = Optim.optimize(
params->loss(params, b, data), dfc,
params0, Optim.IPNewton(),
Optim.Options(
successive_f_tol=20,
f_tol=tol, f_reltol=tol
),
autodiff=:forward
)
Optim.minimizer(res) |> get_params
end

# Minimize using exponentiated gradient descent
function minimize_expgrad(params0::AV{<:Real}, b::Real, data::AV{<:Real}; lr=1e-2, tol=1e-5, max_iter=1_000_000)
K = length(params0) ÷ 3
objective = params->loss(params, b, data)

params = copy(params0)
obj_values = fill(Inf, 50)
for it in 1:max_iter

# 2. Exponentiated gradient descent w.r.t. weights
@. params[1:K] *= exp(-lr * gradient[1:K])
params[1:K] ./= sum(params[1:K])

# 3. Regular gradient descent w.r.t. other params
@. params[K+1:end] -= lr * gradient[K+1:end]

# 4. Track convergence
obj_values[1:end-1] .= @view obj_values[2:end]
obj_values[end] = objective(params)

metric = if all(isfinite, obj_values)
@views maximum(
abs(old - new)
for (old, new) in zip(obj_values[1:end-1], obj_values[2:end])
)
else
1e6
end

if (metric < tol)
@info "Converged" it
break
end
end

get_params(params)
end

function random_params(rng, K::Integer)
p = rand(K)
p ./= sum(p)

mu = randn(K)
sigma = rand(K)

[p; mu; sigma]
end

# ========== Run minimization ==========
rng = Random.MersenneTwister(42)
data = [randn(rng, 200) .- 3; 0.3 * randn(rng, 400) .+ 2]
params0 = random_params(rng, 2) #[0.5, 0.5, 0,0, 1e-3, 5e-3]
b = 0.01

@info "Minimizing w/ Optim..."
par_optim = minimize_optim(params0, b, data, tol=1e-6)
@show par_optim.p
@show par_optim.mu
@show par_optim.sigma
@show loss(par_optim, b, data)

println()

@info "Minimizing w/ exponentiated gradient descent..."
``````

The correct answer is one of these:

``````(p=[0.33333, 0.66666], mu=[-3, 2], sigma=[1, 0.3])
(p=[0.66666, 0.33333], mu=[2, -3], sigma=[0.3, 1])
``````

# Results

• `IPNewton` produces correct results given any random initial parameters. Moreover, it does so fast, literally in a blink of an eye.
• EGD often converges to terrible points. It also takes a couple of seconds to converge. Given random initial points, it mostly converges to correct points, but not as consistently and precisely as `IPNewton`.

### Some examples of bad results by EGD

``````julia> params0 = random_params(rng, 2)
6-element Vector{Float64}:
0.13901488648354637
0.8609851135164537
-1.5498971658168026
-1.9026076357002404
0.27366742133875555
0.9996700977843878

┌ Info: Converged
└   it = 19336
ComponentVector{Float64}(p = [8.126461586000774e-14, 0.9999999999999187], mu = [-1.5702887008520934, 0.5103155146607207], sigma = [0.42607377649097655, 3.0958258461104533])
``````

This is terrible because `p=0`, but it shouldn’t be. Given the same initial point, Optim produces results that are correct to the 3rd decimal digit, which is fantastic:

``````julia> minimize_optim(params0, b, data, tol=1e-6)
ComponentVector{Float64}(p = [0.3330468090222882, 0.6669531909777119], mu = [-3.088148801196878, 1.9922352674701609], sigma = [0.9469738468169181, 0.3000181581983476])
``````

Another one:

``````julia> params0 = random_params(rng, 2)
6-element Vector{Float64}:
0.34854447603600636
0.6514555239639936
-0.8641166349035045
1.136448930313033
0.45127735273921554
0.1342420503488576

┌ Info: Converged
└   it = 12080
ComponentVector{Float64}(p = [0.3460074950904417, 0.6539925049095583], mu = [-2.520558492806493, 1.9930655538825621], sigma = [1.4743322284971923, 0.2953434943385471])

julia> minimize_optim(params0, b, data, tol=1e-6)
ComponentVector{Float64}(p = [0.3330468076989589, 0.666953192301041], mu = [-3.0881488012483316, 1.99223526745685], sigma = [0.9469738440566653, 0.30001815864792564])
``````

Here, EGD overestimated `sigma=1.47` and `mu` is also not quite correct. `IPNewton` is spot on, as always.

Questions repeated: is the interior point method just this good? Is my implementation of EGD wrong? Is gradient descent in general this bad for constrained optimization? Am I even allowed to use EGD here? I’ve never seen it used outside online learning.

1 Like

You’ve got a lot of code going on, so the comparison isn’t trivial, but it looks like you’re doing constant step size gradient descent. That algorithm, despite showing up a lot in ML, is very bad for actually optimizing functions. Were you to gradient descent code using the function in Optim, you’d have a line search, which your code seems to not do.

3 Likes

I’m honestly not familiar with EGD. However here are a few key points:

• Gradient descent methods work with first-order derivatives. Therefore, they can be attracted to non-minimum stationary points (local max, saddle points);
• Newton-based methods use second-order derivatives, that is they capture local curvature. The local model is therefore more precise. If you’re “close enough” to the solution, you converge very fast (Newton’s method has a quadratic rate of convergence).
• Instead of just moving in the direction opposite the gradient, the descent direction in interior-point methods is obtained by solving a (strictly convex) equality-constrained Quadratic Problem. In practice, this is equivalent to solving a linear system of equations (= the first-order optimality conditions). Therefore, you can guarantee that the direction is a descent direction for your merit function.
• A robust code implements globalization strategies (global convergence = converging to a local minimum whatever the initial point), e.g. a line search + sufficient decrease condition as pointed out by @johnmyleswhite.

Why do you have a penalty term in your code? Have you tried writing explicit constraints instead?

2 Likes

I implemented a simple backtracking line search like this:

``````# Exponentiated gradient update
function project_exp!(x::AV{<:Real}, direction::AV{<:Real}, lr::Real)::AV{<:Real}
K = length(x) ÷ 3

@. x[1:K] *= exp(-lr * direction[1:K])
x[1:K] ./= sum(x[1:K])

@. x[K+1:end] -= lr * direction[K+1:end]

x
end

project_exp(x::AV{<:Real}, direction::AV{<:Real}, lr::Real) =
project_exp!(copy(x), direction, lr)

function linesearch(f::Function, x::AV{<:Real}, grad::AV{<:Real}, direction::AV{<:Real}, lr0::Real, rho::Real, c::Real)
@assert lr0 > 0
@assert 0 < rho < 1
@assert 0 < c < 1

f_x = f(x)
a = lr0
while f(project_exp(x, direction, a)) > f_x + a * c * local_slope
a *= rho
end

a
end
``````

(I’m not particularly sure if I should be doing `f(project_exp(x, direction, a))` in the loop, since the regular line search uses `x + a*direction`, but I’m trying to satisfy the constraints. Anyway, I have some paper that supposedly explains how to do this properly, so I should read it first.)

Then used the line search inside the optimization loop like this:

``````# 1. Compute gradient

# 2. Find optimal step size
I tested it with random starting points `params0`, random `data`, and it seems to work great! I’m getting consistent results, nice speed, nice everything! I’m not 100% sure whether this is a fluke or not (because it seems to be such a huge improvement!), but it seems to work much better than without the line search.
I ended up spending this entire week reading Boyd’s “Convex optimization” and also Nocedal, tried to implement my own barrier method, but it… didn’t really have any convergence at all, so I abandoned this idea because Optim.jl exists and I really don’t need to implement my own interior point method. Originally, I wanted my code to be self-contained and remove the dependence on Optim, but now this seems too hard to do properly (EGD is simple, but it doesn’t work as well as `IPNewton`).