'Optim' fails when supplying gradient and Hessian

Fairly new to Julia, but I am loving it. This forum is a GREAT support but I can’t get my head around a problem I am experiencing right now. I am trying to perform a Monte Carlo for a variation of the standard multinomial logit model. I am using Optim to perform non-linear optimization. The latter works with automatic differentiation, but does not work when supplying a user-written gradient and Hessian, despite many attempts. Let me reproduce a MWE.

julia> using LinearAlgebra, Optim, StatsFuns

# Write the data
Y = [[-20.0, -30.0, 9.0, -20.0, 19.0, 22.0],
    [7.0, -62.0, -16.0, -13.0, 27.0, -8.0],
    [58.0, 32.0, 6.0, -2.0, 26.0, 48.0],
    [53.0, 13.0, 41.0, 59.0, 6.0, 29.0],
    [-13.0, 8.0, -20.0, 2.0, -6.0, -7.0]]

# Express the log-likelihood function
function ℓ(β::Vector)
    return sum(logsumexp.(β .* Y) - β .* map(y -> y[1], Y))
end
function ∇ℓ!(β::Vector, s)
    s[1] = sum(dot.(softmax.(β .* Y), Y) - map(y -> y[1], Y))
end
function ∇²ℓ!(β::Vector, h)
    h[1] = sum(dot.(softmax.(β .* Y), map(y -> y .^ 2, Y)) - dot.(softmax.(β .* Y), Y) .^ 2)
end

This sets up the data, log-likelihood function, gradient and Hessian. Plotting these over an appropriate interval around zero reveals that they are well behaved and that together, they can help identify a minimum. Codewise, I am trying to do everything by the book. However, if I try to perform optimization I get this.

julia> O = optimize(ℓ, [1.0], method = LBFGS(); autodiff=:forward, show_trace=true)
Iter     Function value   Gradient norm 
     0     8.905359e+01     8.882755e+01
 * time: 0.0
     1     1.316398e+01     7.149078e+01
 * time: 0.0010001659393310547
     2     8.891946e+00     3.215174e+01
 * time: 0.002000093460083008
     3     8.492614e+00     8.647194e+00
 * time: 0.002000093460083008
     4     8.467783e+00     9.217495e-02
 * time: 0.003000020980834961
     5     8.467780e+00     1.786547e-06
 * time: 0.003000020980834961
     6     8.467780e+00     8.881784e-15
 * time: 0.003999948501586914
 * Status: success

 * Candidate solution
    Final objective value:     8.467780e+00

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 1.12e-09 ≰ 0.0e+00
    |x - x'|/|x'|          = 4.76e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 3.55e-15 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 4.20e-16 ≰ 0.0e+00
    |g(x)|                 = 8.88e-15 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    6
    f(x) calls:    18
    ∇f(x) calls:   18


julia> Q = optimize(ℓ, ∇ℓ!, ∇²ℓ!, [1.0], method=LBFGS(), show_trace=true)
Iter     Function value   Gradient norm 
     0              NaN              NaN
 * time: 0.0
 * Status: failure

 * Candidate solution
    Final objective value:     NaN

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = NaN ≰ 0.0e+00
    |x - x'|/|x'|          = NaN ≰ 0.0e+00
    |f(x) - f(x')|         = NaN ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = NaN ≰ 0.0e+00
    |g(x)|                 = NaN ≰ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    0
    f(x) calls:    1
    ∇f(x) calls:   1

I have no idea re: what I am doing wrong and any help would be greatly appreciated.

I think I know what’s going on. In Julia, syntactic conventions related to the bang ! (which indicates mutation) are the following: the mutated arguments usually go in front.
So instead of writing

function ∇ℓ!(β::Vector, s)
    s[1] = sum(dot.(softmax.(β .* Y), Y) - map(y -> y[1], Y))
end
function ∇²ℓ!(β::Vector, h)
    h[1] = sum(dot.(softmax.(β .* Y), map(y -> y .^ 2, Y)) - dot.(softmax.(β .* Y), Y) .^ 2)
end

you only need to switch:

function ∇ℓ!(s, β::Vector)
    s[1] = sum(dot.(softmax.(β .* Y), Y) - map(y -> y[1], Y))
end
function ∇²ℓ!(h, β::Vector)
    h[1] = sum(dot.(softmax.(β .* Y), map(y -> y .^ 2, Y)) - dot.(softmax.(β .* Y), Y) .^ 2)
end

Does that work?

1 Like

It works! Thank you so much! I don’t know how could I get confused like this. Though the time difference with gradient and Hessian is negligible… I don’t know if this speaks better of Julia as a numerical solver or worse of my functions and approach! :sweat_smile:

If your input is low-dimensional, it’s gonna be hard to beat ForwardDiff. If your input is high-dimensional, you can probably do better with handcoded derivatives.
But regardless of whether you need manual gradients and hessians, you can improve your objective function (and the others) by profiling. My best guess is: your code allocates a lot of new memory instead of reusing it. I think you can hunt down allocations and make it much more efficient. Storing y as a matrix instead of a vector of vectors might also help. And it should not be a global variable. See the Performance Tips · The Julia Language

I think your computations for the loss, gradient and hessian are fairly overlapping each others. I fairly suspect it will be faster to make a single functions that compute the three together, Optim allows that if I recall correctly (should be described in Optim’s docs)

My best bet would be :
1° write a single function hat computes the three
2° expand loops to make them explicit and merge them between loss, gradient and hessian to avoid repeating computations
3° try to use LoopVectorization.jl to parallelize the final loops

If the hessian is very costly, try to do the same with only the loss and gradient, it might be faster to let BFGS approximate the hessian than to actually compute it (depends on problems…)

1 Like

Thanks for the tips. I’ve been reading the guide linked by @gdalle yesterday before going to sleep and tried to implement the main hints to the best of my capabilities. Here is the state of my Monte Carlo:

# Load packages
using Optim, StatsFuns

# Pre-allocate a vector to store the estimates (whose lengths shall equal the number of replications)
b = zeros(2)

# Some global variables and function definitions here
[...]

# Express the negative log-likelihood function
function ℓ(X::Vector{Vector{Float64}}, β)
    return sum(logsumexp.(β.*X) - β.*map(x -> x[1], X))
end

# # Express the gradient andHessian
# function ∇ℓ!(s::Vector{Float64}, β::Vector{Float64})
#     s[1] = sum(dot.(softmax.(β .* X), X) - map(x -> x[1], X))
# end
# function ∇²ℓ!(h::Matrix{Float64}, β::Vector{Float64})
#     h[1, 1] = sum(dot.(softmax.(β .* X), map(x -> x .^ 2, X)) - dot.(softmax.(β .* X), X) .^ 2)
# end

# Write a large loop that performs the whole Monte Carlo
for r ∈ eachindex(b)
    # Set up the shared technological parameters
    Iᵣ = technology(λ₀, 𝑆, f)
    # Create the vectors of interest
    Xᵣ = repeat(network(β₀, σ₀, θ₀, Iᵣ, f), 𝑇)
    # Provide the estimate of interest
    b[r] = optimize(β -> ℓ(Xᵣ,β), [β₀], method=BFGS(); autodiff=:forward).minimizer[1]
end

I can expand on what functions technology() and network() or the global variables in my code mean but basically they generate the data in form of vectors-of-vectors (yes I know, matrices should be preferable, but the lengths of these vectors may vary) that is supplied to the loop. I have two problems.

  1. First, the code set up this way works with automatic differentiation so long as I do not declare a type for β in the main likelihood function. For example, if I declare β::Vector{Float64} as I do in the currently commented out gradient and Hessian, I get a type/method error at the optimize) stage in the loop. This seems to go against one of the principles from the guide: always declare input types if possible.

  2. I am trying to follow @lrnv’s advice about avoiding duplicate calculations, following the hints here. This should be somewhat manageable. The problem is that as far as I understand I need to provide a “double closure”: one for the data inputs X and one to get the f, g! and h! functions that supply objective, gradient and Hessian to optimize(). This seems to be rather delicate so any guide or example on this would be great.