This question is about implementing an optimization algorithm in Julia and comparing it with IPNewton
from Optim.jl. It’s kind of broad, so not sure if it fits here. Let me know if it doesn’t.
I wrote some code to minimize a function where some parameters need to be on the probability simplex, so this is constrained minimization:
minimize f(p1, p2 other_stuff)
s.t. p1, p2 >= 0 and p1 + p2 == 1
I know that I can substitute p2 = 1 - p1
and use something like projected gradient descent to satisfy the non-negativity constraints, but that won’t work for higher-dimensional simplexes.
I wanted to compare IPNewton
(interior point method for constrained optimization) to basic gradient descent. Since some parameters are constrained to the probability simplex, I’m using exponentiated gradient descent (EGD). I felt like using a serious interior-point optimizer for such a simple constraint was overkill, so I wanted to try something simpler, like EGD.
The problem is that EGD is much less consistent, much slower than IPNewton
and sometimes even produces completely wrong results. The “optimum” found by EGD also results on higher loss than that of IPNewton
.
Questions: is the interior point method just this good? Is my implementation of EGD wrong? Is gradient descent in general this bad for constrained optimization? Am I even allowed to use EGD here? I’ve never seen it used outside online learning.
Code
The main functions are minimize_optim
and minimize_expgrad
.
import Random
using ComponentArrays
import Optim, ForwardDiff
const AV = AbstractVector{T} where T
# ========== Loss function to minimize ==========
function get_params(params::AV{<:Real})
K = length(params) ÷ 3
ax = Axis((p=1:K, mu=(K+1):2K, sigma=(2K+1):3K))
ComponentVector(params, ax)
end
normal_pdf(x::Real, mu::Real, var::Real) =
exp(-(x - mu)^2 / (2var)) / sqrt(2π * var)
function loss(p::AV{<:Real}, mu::AV{<:Real}, sigma::AV{<:Real}, b::Real, data::AV{<:Real})
N = length(data)
penalty = sum(
p[k] * p[h] * normal_pdf(mu[k], mu[h], 2b + sigma[k]^2 + sigma[h]^2)
for k in eachindex(p), h in eachindex(p)
)
-2/N * sum(
sum(
p_k * normal_pdf(x_n, mu_k, 2b + sigma_k^2)
for (p_k, mu_k, sigma_k) in zip(p, mu, sigma)
)
for x_n in data
) + penalty
end
function loss(params::AV{<:Real}, b::Real, data::AV{<:Real})
par = get_params(params)
loss(par.p, par.mu, par.sigma, b, data)
end
# ========== Minimization algorithms ==========
function constraint_probability!(c::AV{<:Real}, params::AV{<:Real})
K = length(params) ÷ 3
p = @view params[1:K]
c[1] = 1 - sum(p)
end
# Minimize using interior point method from Optim.jl
function minimize_optim(params0::AV{<:Real}, b::Real, data::AV{<:Real}; tol=1e-5)
K = length(params0) ÷ 3
# [weights; means; stds]
# stds can be in R because they're squared later
lo_params = [zeros(K); fill(-Inf, K); fill(-Inf, K)]
hi_params = [ones(K); fill(Inf, K); fill(Inf, K)]
lo_constr = hi_constr = [0.0]
dfc = Optim.TwiceDifferentiableConstraints(
constraint_probability!,
lo_params, hi_params, lo_constr, hi_constr, :forward
)
res = Optim.optimize(
params->loss(params, b, data), dfc,
params0, Optim.IPNewton(),
Optim.Options(
successive_f_tol=20,
f_tol=tol, f_reltol=tol
),
autodiff=:forward
)
Optim.minimizer(res) |> get_params
end
# Minimize using exponentiated gradient descent
function minimize_expgrad(params0::AV{<:Real}, b::Real, data::AV{<:Real}; lr=1e-2, tol=1e-5, max_iter=1_000_000)
K = length(params0) ÷ 3
objective = params->loss(params, b, data)
grad_cfg = ForwardDiff.GradientConfig(objective, params0)
params = copy(params0)
gradient = similar(params)
obj_values = fill(Inf, 50)
for it in 1:max_iter
# 1. Compute gradient
ForwardDiff.gradient!(gradient, objective, params)
# 2. Exponentiated gradient descent w.r.t. weights
@. params[1:K] *= exp(-lr * gradient[1:K])
params[1:K] ./= sum(params[1:K])
# 3. Regular gradient descent w.r.t. other params
@. params[K+1:end] -= lr * gradient[K+1:end]
# 4. Track convergence
obj_values[1:end-1] .= @view obj_values[2:end]
obj_values[end] = objective(params)
metric = if all(isfinite, obj_values)
@views maximum(
abs(old - new)
for (old, new) in zip(obj_values[1:end-1], obj_values[2:end])
)
else
1e6
end
if (metric < tol)
@info "Converged" it
break
end
end
get_params(params)
end
function random_params(rng, K::Integer)
p = rand(K)
p ./= sum(p)
mu = randn(K)
sigma = rand(K)
[p; mu; sigma]
end
# ========== Run minimization ==========
rng = Random.MersenneTwister(42)
data = [randn(rng, 200) .- 3; 0.3 * randn(rng, 400) .+ 2]
params0 = random_params(rng, 2) #[0.5, 0.5, 0,0, 1e-3, 5e-3]
b = 0.01
@info "Minimizing w/ Optim..."
par_optim = minimize_optim(params0, b, data, tol=1e-6)
@show par_optim.p
@show par_optim.mu
@show par_optim.sigma
@show loss(par_optim, b, data)
println()
@info "Minimizing w/ exponentiated gradient descent..."
par_expgrad = minimize_expgrad(params0, b, data, tol=1e-6, lr=1e-3)
@show par_expgrad.p
@show par_expgrad.mu
@show par_expgrad.sigma
@show loss(par_expgrad, b, data)
The correct answer is one of these:
(p=[0.33333, 0.66666], mu=[-3, 2], sigma=[1, 0.3])
(p=[0.66666, 0.33333], mu=[2, -3], sigma=[0.3, 1])
Results
-
IPNewton
produces correct results given any random initial parameters. Moreover, it does so fast, literally in a blink of an eye. - EGD often converges to terrible points. It also takes a couple of seconds to converge. Given random initial points, it mostly converges to correct points, but not as consistently and precisely as
IPNewton
.
Some examples of bad results by EGD
julia> params0 = random_params(rng, 2)
6-element Vector{Float64}:
0.13901488648354637
0.8609851135164537
-1.5498971658168026
-1.9026076357002404
0.27366742133875555
0.9996700977843878
julia> minimize_expgrad(params0, b, data, tol=1e-6)
┌ Info: Converged
└ it = 19336
ComponentVector{Float64}(p = [8.126461586000774e-14, 0.9999999999999187], mu = [-1.5702887008520934, 0.5103155146607207], sigma = [0.42607377649097655, 3.0958258461104533])
This is terrible because p[1]=0
, but it shouldn’t be. Given the same initial point, Optim produces results that are correct to the 3rd decimal digit, which is fantastic:
julia> minimize_optim(params0, b, data, tol=1e-6)
ComponentVector{Float64}(p = [0.3330468090222882, 0.6669531909777119], mu = [-3.088148801196878, 1.9922352674701609], sigma = [0.9469738468169181, 0.3000181581983476])
Another one:
julia> params0 = random_params(rng, 2)
6-element Vector{Float64}:
0.34854447603600636
0.6514555239639936
-0.8641166349035045
1.136448930313033
0.45127735273921554
0.1342420503488576
julia> minimize_expgrad(params0, b, data, tol=1e-6)
┌ Info: Converged
└ it = 12080
ComponentVector{Float64}(p = [0.3460074950904417, 0.6539925049095583], mu = [-2.520558492806493, 1.9930655538825621], sigma = [1.4743322284971923, 0.2953434943385471])
julia> minimize_optim(params0, b, data, tol=1e-6)
ComponentVector{Float64}(p = [0.3330468076989589, 0.666953192301041], mu = [-3.0881488012483316, 1.99223526745685], sigma = [0.9469738440566653, 0.30001815864792564])
Here, EGD overestimated sigma[1]=1.47
and mu[1]
is also not quite correct. IPNewton
is spot on, as always.
Questions repeated: is the interior point method just this good? Is my implementation of EGD wrong? Is gradient descent in general this bad for constrained optimization? Am I even allowed to use EGD here? I’ve never seen it used outside online learning.