How is Optim.jl's IPNewton so good compared to (exponentiated) gradient descent?

This question is about implementing an optimization algorithm in Julia and comparing it with IPNewton from Optim.jl. It’s kind of broad, so not sure if it fits here. Let me know if it doesn’t.

I wrote some code to minimize a function where some parameters need to be on the probability simplex, so this is constrained minimization:

minimize f(p1, p2 other_stuff)
s.t. p1, p2 >= 0 and p1 + p2 == 1

I know that I can substitute p2 = 1 - p1 and use something like projected gradient descent to satisfy the non-negativity constraints, but that won’t work for higher-dimensional simplexes.

I wanted to compare IPNewton (interior point method for constrained optimization) to basic gradient descent. Since some parameters are constrained to the probability simplex, I’m using exponentiated gradient descent (EGD). I felt like using a serious interior-point optimizer for such a simple constraint was overkill, so I wanted to try something simpler, like EGD.

The problem is that EGD is much less consistent, much slower than IPNewton and sometimes even produces completely wrong results. The “optimum” found by EGD also results on higher loss than that of IPNewton.

Questions: is the interior point method just this good? Is my implementation of EGD wrong? Is gradient descent in general this bad for constrained optimization? Am I even allowed to use EGD here? I’ve never seen it used outside online learning.

Code

The main functions are minimize_optim and minimize_expgrad.

import Random
using ComponentArrays
import Optim, ForwardDiff

const AV = AbstractVector{T} where T

# ========== Loss function to minimize ==========
function get_params(params::AV{<:Real})
    K = length(params) ÷ 3

    ax = Axis((p=1:K, mu=(K+1):2K, sigma=(2K+1):3K))
    ComponentVector(params, ax)
end

normal_pdf(x::Real, mu::Real, var::Real) =
    exp(-(x - mu)^2 / (2var)) / sqrt(2π * var)

function loss(p::AV{<:Real}, mu::AV{<:Real}, sigma::AV{<:Real}, b::Real, data::AV{<:Real})
    N = length(data)

    penalty = sum(
        p[k] * p[h] * normal_pdf(mu[k], mu[h], 2b + sigma[k]^2 + sigma[h]^2)
        for k in eachindex(p), h in eachindex(p)
    )

    -2/N * sum(
        sum(
            p_k * normal_pdf(x_n, mu_k, 2b + sigma_k^2)
            for (p_k, mu_k, sigma_k) in zip(p, mu, sigma)
        )
        for x_n in data
    ) + penalty
end

function loss(params::AV{<:Real}, b::Real, data::AV{<:Real})
    par = get_params(params)
    loss(par.p, par.mu, par.sigma, b, data)
end


# ========== Minimization algorithms ==========
function constraint_probability!(c::AV{<:Real}, params::AV{<:Real})
    K = length(params) ÷ 3
    p = @view params[1:K]

    c[1] = 1 - sum(p)
end

# Minimize using interior point method from Optim.jl
function minimize_optim(params0::AV{<:Real}, b::Real, data::AV{<:Real}; tol=1e-5)
    K = length(params0) ÷ 3

    # [weights; means; stds]
    # stds can be in R because they're squared later
    lo_params = [zeros(K); fill(-Inf, K); fill(-Inf, K)]
    hi_params = [ones(K); fill(Inf, K); fill(Inf, K)]
    lo_constr = hi_constr = [0.0]

    dfc = Optim.TwiceDifferentiableConstraints(
        constraint_probability!,
        lo_params, hi_params, lo_constr, hi_constr, :forward
    )

    res = Optim.optimize(
        params->loss(params, b, data), dfc,
        params0, Optim.IPNewton(),
        Optim.Options(
            successive_f_tol=20,
            f_tol=tol, f_reltol=tol
        ),
        autodiff=:forward
    )
    Optim.minimizer(res) |> get_params
end

# Minimize using exponentiated gradient descent
function minimize_expgrad(params0::AV{<:Real}, b::Real, data::AV{<:Real}; lr=1e-2, tol=1e-5, max_iter=1_000_000)
    K = length(params0) ÷ 3
    objective = params->loss(params, b, data)

    grad_cfg = ForwardDiff.GradientConfig(objective, params0)

    params = copy(params0)
    gradient = similar(params)
    obj_values = fill(Inf, 50)
    for it in 1:max_iter
        # 1. Compute gradient
        ForwardDiff.gradient!(gradient, objective, params)

        # 2. Exponentiated gradient descent w.r.t. weights
        @. params[1:K] *= exp(-lr * gradient[1:K])
        params[1:K] ./= sum(params[1:K])

        # 3. Regular gradient descent w.r.t. other params
        @. params[K+1:end] -= lr * gradient[K+1:end]

        # 4. Track convergence
        obj_values[1:end-1] .= @view obj_values[2:end]
        obj_values[end] = objective(params)

        metric = if all(isfinite, obj_values)
            @views maximum(
                abs(old - new)
                for (old, new) in zip(obj_values[1:end-1], obj_values[2:end])
            )
        else
            1e6
        end

        if (metric < tol)
            @info "Converged" it
            break
        end
    end

    get_params(params)
end

function random_params(rng, K::Integer)
    p = rand(K)
    p ./= sum(p)

    mu = randn(K)
    sigma = rand(K)

    [p; mu; sigma]
end


# ========== Run minimization ==========
rng = Random.MersenneTwister(42)
data = [randn(rng, 200) .- 3; 0.3 * randn(rng, 400) .+ 2]
params0 = random_params(rng, 2) #[0.5, 0.5, 0,0, 1e-3, 5e-3]
b = 0.01

@info "Minimizing w/ Optim..."
par_optim = minimize_optim(params0, b, data, tol=1e-6)
@show par_optim.p
@show par_optim.mu
@show par_optim.sigma
@show loss(par_optim, b, data)

println()

@info "Minimizing w/ exponentiated gradient descent..."
par_expgrad = minimize_expgrad(params0, b, data, tol=1e-6, lr=1e-3)
@show par_expgrad.p
@show par_expgrad.mu
@show par_expgrad.sigma
@show loss(par_expgrad, b, data)

The correct answer is one of these:

(p=[0.33333, 0.66666], mu=[-3, 2], sigma=[1, 0.3])
(p=[0.66666, 0.33333], mu=[2, -3], sigma=[0.3, 1])

Results

  • IPNewton produces correct results given any random initial parameters. Moreover, it does so fast, literally in a blink of an eye.
  • EGD often converges to terrible points. It also takes a couple of seconds to converge. Given random initial points, it mostly converges to correct points, but not as consistently and precisely as IPNewton.

Some examples of bad results by EGD

julia> params0 = random_params(rng, 2)
6-element Vector{Float64}:
  0.13901488648354637
  0.8609851135164537
 -1.5498971658168026
 -1.9026076357002404
  0.27366742133875555
  0.9996700977843878

julia> minimize_expgrad(params0, b, data, tol=1e-6)
┌ Info: Converged
└   it = 19336
ComponentVector{Float64}(p = [8.126461586000774e-14, 0.9999999999999187], mu = [-1.5702887008520934, 0.5103155146607207], sigma = [0.42607377649097655, 3.0958258461104533])

This is terrible because p[1]=0, but it shouldn’t be. Given the same initial point, Optim produces results that are correct to the 3rd decimal digit, which is fantastic:

julia> minimize_optim(params0, b, data, tol=1e-6)
ComponentVector{Float64}(p = [0.3330468090222882, 0.6669531909777119], mu = [-3.088148801196878, 1.9922352674701609], sigma = [0.9469738468169181, 0.3000181581983476])

Another one:

julia> params0 = random_params(rng, 2)
6-element Vector{Float64}:
  0.34854447603600636
  0.6514555239639936
 -0.8641166349035045
  1.136448930313033
  0.45127735273921554
  0.1342420503488576

julia> minimize_expgrad(params0, b, data, tol=1e-6)
┌ Info: Converged
└   it = 12080
ComponentVector{Float64}(p = [0.3460074950904417, 0.6539925049095583], mu = [-2.520558492806493, 1.9930655538825621], sigma = [1.4743322284971923, 0.2953434943385471])

julia> minimize_optim(params0, b, data, tol=1e-6)
ComponentVector{Float64}(p = [0.3330468076989589, 0.666953192301041], mu = [-3.0881488012483316, 1.99223526745685], sigma = [0.9469738440566653, 0.30001815864792564])

Here, EGD overestimated sigma[1]=1.47 and mu[1] is also not quite correct. IPNewton is spot on, as always.


Questions repeated: is the interior point method just this good? Is my implementation of EGD wrong? Is gradient descent in general this bad for constrained optimization? Am I even allowed to use EGD here? I’ve never seen it used outside online learning.

1 Like

You’ve got a lot of code going on, so the comparison isn’t trivial, but it looks like you’re doing constant step size gradient descent. That algorithm, despite showing up a lot in ML, is very bad for actually optimizing functions. Were you to gradient descent code using the function in Optim, you’d have a line search, which your code seems to not do.

3 Likes

I’m honestly not familiar with EGD. However here are a few key points:

  • Gradient descent methods work with first-order derivatives. Therefore, they can be attracted to non-minimum stationary points (local max, saddle points);
  • Newton-based methods use second-order derivatives, that is they capture local curvature. The local model is therefore more precise. If you’re “close enough” to the solution, you converge very fast (Newton’s method has a quadratic rate of convergence).
  • Instead of just moving in the direction opposite the gradient, the descent direction in interior-point methods is obtained by solving a (strictly convex) equality-constrained Quadratic Problem. In practice, this is equivalent to solving a linear system of equations (= the first-order optimality conditions). Therefore, you can guarantee that the direction is a descent direction for your merit function.
  • A robust code implements globalization strategies (global convergence = converging to a local minimum whatever the initial point), e.g. a line search + sufficient decrease condition as pointed out by @johnmyleswhite.

Why do you have a penalty term in your code? Have you tried writing explicit constraints instead?

2 Likes

I implemented a simple backtracking line search like this:

# Exponentiated gradient update
function project_exp!(x::AV{<:Real}, direction::AV{<:Real}, lr::Real)::AV{<:Real}
    K = length(x) ÷ 3

    @. x[1:K] *= exp(-lr * direction[1:K])
    x[1:K] ./= sum(x[1:K])

    @. x[K+1:end] -= lr * direction[K+1:end]

    x
end

project_exp(x::AV{<:Real}, direction::AV{<:Real}, lr::Real) =
    project_exp!(copy(x), direction, lr)

function linesearch(f::Function, x::AV{<:Real}, grad::AV{<:Real}, direction::AV{<:Real}, lr0::Real, rho::Real, c::Real)
    @assert lr0 > 0
    @assert 0 < rho < 1
    @assert 0 < c < 1

    f_x = f(x)
    local_slope::Real = grad' * direction
    a = lr0
    while f(project_exp(x, direction, a)) > f_x + a * c * local_slope
        a *= rho
    end

    a
end

(I’m not particularly sure if I should be doing f(project_exp(x, direction, a)) in the loop, since the regular line search uses x + a*direction, but I’m trying to satisfy the constraints. Anyway, I have some paper that supposedly explains how to do this properly, so I should read it first.)

Then used the line search inside the optimization loop like this:

# 1. Compute gradient
ForwardDiff.gradient!(gradient, objective, params, grad_cfg)

# 2. Find optimal step size
lr_opt = linesearch(objective, params, gradient, gradient, 1.0, 0.9, 1e-3)

# 3. Exponentiated gradient descent w.r.t. weights
# and regular gradient descent w.r.t. other params
project_exp!(params, gradient, lr_opt)

I tested it with random starting points params0, random data, and it seems to work great! I’m getting consistent results, nice speed, nice everything! I’m not 100% sure whether this is a fluke or not (because it seems to be such a huge improvement!), but it seems to work much better than without the line search.

TBH, I didn’t think a line search could improve optimization this much…

I ended up spending this entire week reading Boyd’s “Convex optimization” and also Nocedal, tried to implement my own barrier method, but it… didn’t really have any convergence at all, so I abandoned this idea because Optim.jl exists and I really don’t need to implement my own interior point method. Originally, I wanted my code to be self-contained and remove the dependence on Optim, but now this seems too hard to do properly (EGD is simple, but it doesn’t work as well as IPNewton).

The “penalty term” isn’t the penalty in the optimization sense. It’s part of the objective function that I like to call “penalty” for no particular reason.