# Improving performance of a number of functions for large scale gradient based optimization

Hello, I am trying to conduct a large-scale Maximum Likelihood Estimation. Luckily, I have analytical gradients and Hessians and therefore I can speed up my optimization routine by using derivative based methods.

However, my gradient and my Hessian are very slow to compute. I would appreciate very much if you could provide tips on how to improve things on this front. See MWE below

The `llike` function has been improved here Improving Performance of a Loop . However, the gradient and the Hessian are still annoyingly slow.

EDIT: I was able to improve the performance of the gradient and updated the MWE. The `hess` function is still very slow. My guess is that this is because of the outer product.

``````using Random, Optim
using LinearAlgebra
using BenchmarkTools

###############################################################
# Loglikelihood
###############################################################

const J = 75
const Ι = 200
const N = 10000
const Mat1 = ones(Int8,Ι,N)
const Mat2 = ones(Int8,J,N)
const Mat3 =  rand(Ι,J,N) .>0.9
NumberParams = Ι
const τ = rand(Ι,J)

#NumberParams = 2

# Independent Variables Matrix

XMat = zeros(Ι+1,Ι,J)

for j = 1 : J
XMat[:,:,j]  = vcat(τ[:,j]',Matrix(I,Ι,Ι))
end

XMat = XMat[1:end-1,:,:]

@views function Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
llike = 0.0
θ = params
γ = params[2:end]
γ = vcat(γ,1.0)

log_vals = -θ.*τ .+ γ
vals = exp.(log_vals)

cond = sum(Mat1,dims=1)'

@fastmath @inbounds  for n = 1 : N
if cond[n]>1
for j = 1 : J
denom_index = 0.0
if Mat2[j,n] == 1
denom_index = Mat1[:,n]'*vals[:,j]
for i = 1 : Ι
llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j]  - log(denom_index))
#llike_test[i,j,n] = Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j]  - log(denom_index))
end
end
end
end
end

llike = -llike

return  llike

end

θ = params
γ = params[2:end]
γ = vcat(γ,1.0)
vec = 0 .*ones(size(params))

log_vals = θ.*τ .+ γ
vals = exp.(log_vals)

cond = sum(Mat1,dims=1)'

@fastmath @inbounds  for n = 1 : N
if cond[n]>1
for j = 1 : J
denom_index = 0.0
if Mat2[j,n] == 1
denom_index = Mat1[:,n]'*vals[:,j]
for i = 1 : Ι
vec += (Mat2[j,n]*Mat1[i,n]*(Mat3[i,j,n].-vals[i,j]./denom_index).*xvec[:,i,j])
end
end
end
end
end

storage .= -vec

return  storage

end

@views function hess(N,Ι,J,τ,Mat1,Mat2,Mat3,xvec,storage,params)
θ = params
γ = params[2:end]
γ = vcat(γ,1.0)
mat = 0 .*ones(size(params),size(params))

log_vals = θ.*τ .+ γ
vals = exp.(log_vals)

cond = sum(Mat1,dims=1)'

pvals = zeros(Ι,J,N)

@fastmath @inbounds  for n = 1 : N
if cond[n]>1
for j = 1 : J
denom_index = 0.0
if Mat2[j,n] == 1
denom_index = Mat1[:,n]'*vals[:,j]
for i = 1 : Ι
pvals[i,j,n] = Mat2[j,n]*Mat1[i,n]*vals[i,j]/denom_index
end
end
end
end
end

@fastmath @inbounds  for n = 1 : N
if cond[n]>1
for j = 1 : J
if Mat2[j,n] == 1
xbar = xvec[:,:,j]*pvals[:,j,n]
for i = 1 : Ι
mat += (Mat2[j,n]*Mat1[i,n]*pvals[i,j,n].*(xvec[:,i,j]-xbar)*(xvec[:,i,j]-xbar)')
end
end
end
end
end

storage .= mat

return  storage

end

f(x) = Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,x)
h!(storage,x) = hess(N,Ι,J,τ,Mat1,Mat2,Mat3,XMat,storage,x)

XXX = @time Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vcat(7.0,ones(NumberParams-1)))

XXX = @time hess(N,Ι,J,τ,Mat1,Mat2,Mat3,zeros(NumberParams,NumberParams),vcat(7.0,ones(NumberParams-1)))

func = TwiceDifferentiable(vars -> Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
vcat(7.0,ones(NumberParams-1)); autodiff=:forward);

#opt1 = @time Optim.optimize(func, vcat(0.0,0.0),show_trace = true)

opt1 = @time Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace = true)

#opt1 = @time Optim.optimize(func, vcat(0.0,0.0),show_trace = true)

opt1 = @time Optim.optimize(func, ones(NumberParams),show_trace = true)

#opt_explicit = @time Optim.optimize(func_explicit, ones(NumberParams), Optim.Newton())

opt_explicit = Optim.optimize(f, g!, h!, ones(NumberParams), Optim.Newton(),
Optim.Options(show_trace=true, iterations = 10))

``````

I think this computes the same log 200 times, probably that should be outside the `for i` loop. The `Mat1[:,n]'*vals[:,j]` is inside `@views` so is probably OK.

This allocates quite a few arrays. I think you can fix the inner loop with `@. mat += ...` and no other `.`s. Less importantly, you can replace the line above with `mul!`.

1 Like

Looks like `@fastmath` is pessimizing your code considerably. Juno profiler shows allocations at

``````    llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j]  - log(denom_index))
``````

Here is a test:

``````using Random, Optim
using LinearAlgebra
using BenchmarkTools

@views function Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
llike = 0.0
θ = params
γ = params[2:end]
γ = vcat(γ,1.0)

log_vals = -θ.*τ .+ γ
vals = exp.(log_vals)

cond = sum(Mat1,dims=1)'
for n = 1 : N
if cond[n]>1
for j = 1 : J
denom_index = 0.0
if Mat2[j,n] == 1
denom_index = Mat1[:,n]'*vals[:,j]
for i = 1 : Ι
llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j] - log(denom_index))
end
end
end
end
end
return -llike
end

@views function fastmath_Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
llike = 0.0
θ = params
γ = params[2:end]
γ = vcat(γ,1.0)

log_vals = -θ.*τ .+ γ
vals = exp.(log_vals)

cond = sum(Mat1,dims=1)'
@fastmath for n = 1 : N
if cond[n]>1
for j = 1 : J
denom_index = 0.0
if Mat2[j,n] == 1
denom_index = Mat1[:,n]'*vals[:,j]
for i = 1 : Ι
llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j] - log(denom_index))
end
end
end
end
end
return -llike
end

function test(J,N,Ι,show_trace)
Mat1 = ones(Int8,Ι,N)
Mat2 = ones(Int8,J,N)
Mat3 =  rand(Ι,J,N) .>0.9
NumberParams = Ι
τ = rand(Ι,J)

# Independent Variables Matrix

XMat = zeros(Ι+1,Ι,J)
for j = 1 : J
XMat[:,:,j]  = vcat(τ[:,j]',Matrix(I,Ι,Ι))
end
XMat = XMat[1:end-1,:,:]

func = TwiceDifferentiable(vars -> Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
vcat(7.0,ones(NumberParams-1)); autodiff=:forward);
Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace=show_trace)
end

function fastmath_test(J,N,Ι,show_trace)
Mat1 = ones(Int8,Ι,N)
Mat2 = ones(Int8,J,N)
Mat3 =  rand(Ι,J,N) .>0.9
NumberParams = Ι
τ = rand(Ι,J)

# Independent Variables Matrix

XMat = zeros(Ι+1,Ι,J)
for j = 1 : J
XMat[:,:,j]  = vcat(τ[:,j]',Matrix(I,Ι,Ι))
end
XMat = XMat[1:end-1,:,:]

func = TwiceDifferentiable(vars -> fastmath_Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
vcat(7.0,ones(NumberParams-1)); autodiff=:forward);
Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace=show_trace)
end

@btime fastmath_test(75, 100, 2, false)
@btime fastmath_test(75, 1000, 20, false)
@btime test(75, 100, 2, false)
@btime test(75, 1000, 20, false)
``````

showing

``````  18.041 ms (495678 allocations: 44.25 MiB)
182.590 s (792001067 allocations: 413.94 GiB)
3.462 ms (678 allocations: 316.09 KiB)
5.310 s (1033 allocations: 75.21 MiB)
``````

But it seems this is still not enough to get an answer to the original problem in reasonable time.

Maybe this issue is related.

I have not profiled anything here. But it is common that computing the hessians is not worthy. Try a quasi newton method, instead.

2 Likes

That sounds like an understatement. The test

``````using Random, Optim
using LinearAlgebra
using BenchmarkTools

@views function Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
llike = 0.0
θ = params
γ = params[2:end]
γ = vcat(γ,1.0)

log_vals = -θ.*τ .+ γ
vals = exp.(log_vals)

cond = sum(Mat1,dims=1)'
for n = 1 : N
if cond[n]>1
for j = 1 : J
denom_index = 0.0
if Mat2[j,n] == 1
denom_index = Mat1[:,n]'*vals[:,j]
for i = 1 : Ι
llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j] - log(denom_index))
end
end
end
end
end
return -llike
end

function test(J,N,Ι,show_trace)
Mat1 = ones(Int8,Ι,N)
Mat2 = ones(Int8,J,N)
Mat3 =  rand(Ι,J,N) .>0.9
NumberParams = Ι
τ = rand(Ι,J)

func = OnceDifferentiable(vars -> Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
vcat(7.0,ones(NumberParams-1)); autodiff=:forward)
Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace=show_trace)
end

@btime test(75, 100, 2, false)
@btime test(75, 1000, 20, false)
@time test(75, 10000, 200, true)
``````

shows

``````  4.082 ms (313 allocations: 266.62 KiB)
1.710 s (922 allocations: 30.22 MiB)
0     1.025838e+08     5.333363e+06
* time: 0.020000219345092773
1     8.005156e+07     1.214639e+06
* time: 391.87900018692017
2     7.944264e+07     4.315082e+03
* time: 479.1490001678467
3     7.944149e+07     3.477603e+04
* time: 652.2050001621246
4     7.944046e+07     9.562177e+01
* time: 780.7230000495911
5     7.944046e+07     3.279957e+02
* time: 911.2210001945496
6     7.944046e+07     4.497560e-01
* time: 1038.9860000610352
7     7.944046e+07     1.462263e+01
* time: 1256.3560001850128
8     7.944046e+07     1.006288e+01
* time: 1435.6440000534058
9     7.944046e+07     2.215416e-01
* time: 1607.3940000534058
10     7.944046e+07     3.465821e-02
* time: 1693.7860000133514
11     7.944046e+07     8.896165e-06
* time: 1779.3620002269745
12     7.944046e+07     8.919999e-06
* time: 1951.7210001945496
13     7.944046e+07     1.799785e-07
* time: 2079.3300001621246
14     7.944046e+07     7.276142e-08
* time: 2210.949000120163
15     7.944046e+07     1.857133e-08
* time: 2339.0380001068115
16     7.944046e+07     1.076338e-08
* time: 2424.7460000514984
17     7.944046e+07     3.543048e-09
* time: 2552.7780001163483
2596.754130 seconds (4.69 M allocations: 4.530 GiB, 0.02% gc time, 0.04% compilation time)
* Status: success

* Candidate solution
Final objective value:     7.944046e+07

* Found with
Algorithm:     L-BFGS

* Convergence measures
|x - x'|               = 1.32e-13 ≰ 0.0e+00
|x - x'|/|x'|          = 1.31e-13 ≰ 0.0e+00
|f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
|g(x)|                 = 3.54e-09 ≤ 1.0e-08

* Work counters
Seconds run:   2553  (vs limit Inf)
Iterations:    17
f(x) calls:    61
∇f(x) calls:   61
``````

One more question to OP: did you test your MWE? I see a couple of issues with it on Julia 1.6.3

Thanks so much for the feedback @goerch @mcabbott and @lmiq . Indeed not computing Hessians and removing the `fastmath` from the loop makes the optimization fast enough.

I guess that it would be still better to improve the `hess` function to be able to pass both functions explicitly to the program, but either using automatic differentiation for only the gradient or passing the explicit gradient only seem to be doing the work.

1 Like

Hm, I still see two questions for the specialists:

• What is happening here with `@fastmath`? I’d suspect some kind of complexity limit for supported operations…
• Which performance problems do we have to expect using Hessians (with or without autodiff)? Because I didn’t even see the first iteration with `TwiceDifferentiable` and I waited quite some time…

Also, do you see anyway to improve the explicit Hessian function that I am constructing.

Really, I didn’t look at the problem here at all. But Hessians require the computation of n^2 components (where n is the number of variables). Unless you know that it has some sparsity pattern, that has a cost of, well, O(n^2). The gradient is O(n).

For problems with many variables the computation of the hessian is prohibitive. Quasi-newton methods use approximate hessians that are computed in O(n) time and require O(n) storage, thus the iterations are faster. Of course the steps are less accurate, since the true hessian provides more information about the true objective function, but how much that makes a different is problem dependent, and if the functions are not roughly convex the true hessian is not that useful. In my experience there are not many problems in which the computation of the hessian is worth the effort (but there may be whole classes of problems in which it is).

1 Like

That is what I saw! Thank you very much for the explanation.