Using gradient function with Optim

I am taking first steps with Julia, trying to use Optim.jl to estimate a probit. I have adapted some code I [found] :

using DataFrames
using Distributions
using Optim

# -------------- Normal distribution functions prevent 0.0 and 1.0 ----------------------

function normcdf(x::Vector{Float64})
  out = Distributions.cdf(Distributions.Normal(), x)
  out + (out .== 0.0)*eps(1.0) - (out .== 1.0)*eps(1.0)
end

function normpdf(x::Vector{Float64})
  Distributions.pdf(Distributions.Normal(), x)
end

# -------------- Likelihood and gradient  ----------------------

function λ(β::Vector{Float64})
	q =	2d-1
	q .* normpdf(q .* X*β) ./ normcdf(q.*X*β)
end

function LL(β::Vector{Float64})
	- sum( log.( normcdf( (2d-1) .* X*β) ) )
end

function g!(storage, β)
    storage[:] = -sum( λ(β).*X,  1 )
end

# ------------------------ Probit DGP ------------------------

N     = 10^5
k     = 4
u     = rand(Normal(),N)
X     = [ones(N) reshape( rand(Uniform(-1,1), N*(k-1)), (N,k-1) )]
βTrue = [2.0, 3, 5, 10]
y     = X*βTrue + u
d     = 1.0 * (y .>= 0 )

I then run this, first without supplying the gradient function:

res1 = Optim.optimize(LL, zeros(k), method=BFGS())

which works fine but when I supply the gradient function:

res2 = Optim.optimize(LL, g!, zeros(k), method=BFGS())

it reports convergence despite not iterating and the results being clearly incorrect. I presume I have made a simple mistake but I can’t see it.

Have you tried checking your gradient? You can either do this manually (check that f(x+eps h) ~= f(x) + eps*<g,h>), or use Calculus or one of the automatic differentiation packages (it looks like the bottleneck of your function is a matmul, so ReverseDiff should be pretty efficient)

Thanks for the tip. For res1 (the successful run) I ran the following:

q = Vector{Float64}(4)
g!(q, res1.minimizer)

which showed the gradient calculated by g! to look right

1×4 Array{Float64,2}:
-1.25987e-7  5.45146e-8  5.98943e-8  1.43703e-7

I should say, I’m attempting this as a learning exercise. I just want to understand how to supply the gradient function so I can apply this understanding in a different (more complicated) model. Since I will apply that to a bigger dataset, I want the speed that analytical derivatives allow.

Incidentally, how do you see that the bottleneck is a matmul (and what is a matmul)? Apologies if these are basic questions, I’m not from a CS background.

You’re probably on Optim version 0.7.8 or lower. Back then you needed

function g!(β, storage)
    storage[:] = -sum( λ(β).*X,  1 )
end

The new versions of Optim (0.9.x) follow the usual Julia style, where you mutate the first argument.
So either switch the order of arguments as above, or upgrade Optim to the latest version.

As a warning, if you downloaded Julia from JuliaPro, odds are you’d probably have a hard time upgrading.

Switching the arguments solved it - thank you! I was indeed using JuliaPro.