Improving performance of a number of functions for large scale gradient based optimization

Hello, I am trying to conduct a large-scale Maximum Likelihood Estimation. Luckily, I have analytical gradients and Hessians and therefore I can speed up my optimization routine by using derivative based methods.

However, my gradient and my Hessian are very slow to compute. I would appreciate very much if you could provide tips on how to improve things on this front. See MWE below

The llike function has been improved here Improving Performance of a Loop . However, the gradient and the Hessian are still annoyingly slow.

EDIT: I was able to improve the performance of the gradient and updated the MWE. The hess function is still very slow. My guess is that this is because of the outer product.

using Random, Optim
using LinearAlgebra
using BenchmarkTools


###############################################################
# Loglikelihood
###############################################################

const J = 75
const Ι = 200
const N = 10000
const Mat1 = ones(Int8,Ι,N)
const Mat2 = ones(Int8,J,N)
const Mat3 =  rand(Ι,J,N) .>0.9
NumberParams = Ι 
const τ = rand(Ι,J)

#NumberParams = 2

# Independent Variables Matrix

XMat = zeros(Ι+1,Ι,J)

for j = 1 : J
    XMat[:,:,j]  = vcat(τ[:,j]',Matrix(I,Ι,Ι))
end    


XMat = XMat[1:end-1,:,:]


@views function Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params) 
    llike = 0.0
    θ = params[1]
    γ = params[2:end]
    γ = vcat(γ,1.0) 


    log_vals = -θ.*τ .+ γ
    vals = exp.(log_vals)


    cond = sum(Mat1,dims=1)'


    @fastmath @inbounds  for n = 1 : N   
                if cond[n]>1
                    for j = 1 : J
                        denom_index = 0.0 
                        if Mat2[j,n] == 1
                                denom_index = Mat1[:,n]'*vals[:,j]
                            for i = 1 : Ι
                                llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j]  - log(denom_index))
                                #llike_test[i,j,n] = Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j]  - log(denom_index))
                            end    
                        end  
                    end    
                end      
    end    

    llike = -llike

    return  llike

end


@views function grad(N,Ι,J,τ,Mat1,Mat2,Mat3,xvec,storage,params) 
    θ = params[1]
    γ = params[2:end]
    γ = vcat(γ,1.0) 
    vec = 0 .*ones(size(params))


    log_vals = θ.*τ .+ γ
    vals = exp.(log_vals)

    cond = sum(Mat1,dims=1)'

    @fastmath @inbounds  for n = 1 : N   
                if cond[n]>1
                    for j = 1 : J
                        denom_index = 0.0 
                        if Mat2[j,n] == 1
                                denom_index = Mat1[:,n]'*vals[:,j]
                            for i = 1 : Ι
                                vec += (Mat2[j,n]*Mat1[i,n]*(Mat3[i,j,n].-vals[i,j]./denom_index).*xvec[:,i,j])
                            end    
                        end  
                    end    
                end      
    end    

    storage .= -vec

    return  storage

end


@views function hess(N,Ι,J,τ,Mat1,Mat2,Mat3,xvec,storage,params) 
    θ = params[1]
    γ = params[2:end]
    γ = vcat(γ,1.0) 
    mat = 0 .*ones(size(params)[1],size(params)[1])


    log_vals = θ.*τ .+ γ
    vals = exp.(log_vals)

    cond = sum(Mat1,dims=1)'

    pvals = zeros(Ι,J,N)

    @fastmath @inbounds  for n = 1 : N   
                if cond[n]>1
                    for j = 1 : J
                        denom_index = 0.0 
                        if Mat2[j,n] == 1
                                denom_index = Mat1[:,n]'*vals[:,j]
                            for i = 1 : Ι
                                pvals[i,j,n] = Mat2[j,n]*Mat1[i,n]*vals[i,j]/denom_index
                            end    
                        end  
                    end    
                end      
    end 

    @fastmath @inbounds  for n = 1 : N   
                if cond[n]>1
                    for j = 1 : J
                        if Mat2[j,n] == 1
                                xbar = xvec[:,:,j]*pvals[:,j,n]
                            for i = 1 : Ι
                                mat += (Mat2[j,n]*Mat1[i,n]*pvals[i,j,n].*(xvec[:,i,j]-xbar)*(xvec[:,i,j]-xbar)')
                            end    
                        end  
                    end    
                end      
    end   
    
    storage .= mat

    return  storage

end

f(x) = Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,x) 
g!(storage,x) = grad(N,Ι,J,τ,Mat1,Mat2,Mat3,XMat,storage,x) 
h!(storage,x) = hess(N,Ι,J,τ,Mat1,Mat2,Mat3,XMat,storage,x) 





XXX = @time Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vcat(7.0,ones(NumberParams-1)))

XXX = @time grad(N,Ι,J,τ,Mat1,Mat2,Mat3,zeros(NumberParams),vcat(7.0,ones(NumberParams-1)))

XXX = @time hess(N,Ι,J,τ,Mat1,Mat2,Mat3,zeros(NumberParams,NumberParams),vcat(7.0,ones(NumberParams-1)))




func = TwiceDifferentiable(vars -> Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
                           vcat(7.0,ones(NumberParams-1)); autodiff=:forward);


#opt1 = @time Optim.optimize(func, vcat(0.0,0.0),show_trace = true)

opt1 = @time Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace = true)

#opt1 = @time Optim.optimize(func, vcat(0.0,0.0),show_trace = true)

opt1 = @time Optim.optimize(func, ones(NumberParams),show_trace = true)

#opt_explicit = @time Optim.optimize(func_explicit, ones(NumberParams), Optim.Newton())

opt_explicit = Optim.optimize(f, g!, h!, ones(NumberParams), Optim.Newton(),
                      Optim.Options(show_trace=true, iterations = 10))

I think this computes the same log 200 times, probably that should be outside the for i loop. The Mat1[:,n]'*vals[:,j] is inside @views so is probably OK.

This allocates quite a few arrays. I think you can fix the inner loop with @. mat += ... and no other .s. Less importantly, you can replace the line above with mul!.

1 Like

Looks like @fastmath is pessimizing your code considerably. Juno profiler shows allocations at

    llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j]  - log(denom_index))

Here is a test:

using Random, Optim
using LinearAlgebra
using BenchmarkTools

@views function Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
    llike = 0.0
    θ = params[1]
    γ = params[2:end]
    γ = vcat(γ,1.0)

    log_vals = -θ.*τ .+ γ
    vals = exp.(log_vals)

    cond = sum(Mat1,dims=1)'
    for n = 1 : N
        if cond[n]>1
            for j = 1 : J
                denom_index = 0.0
                if Mat2[j,n] == 1
                    denom_index = Mat1[:,n]'*vals[:,j]
                    for i = 1 : Ι
                        llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j] - log(denom_index))
                    end
                end
            end
        end
    end
    return -llike
end

@views function fastmath_Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
    llike = 0.0
    θ = params[1]
    γ = params[2:end]
    γ = vcat(γ,1.0)

    log_vals = -θ.*τ .+ γ
    vals = exp.(log_vals)

    cond = sum(Mat1,dims=1)'
    @fastmath for n = 1 : N
        if cond[n]>1
            for j = 1 : J
                denom_index = 0.0
                if Mat2[j,n] == 1
                    denom_index = Mat1[:,n]'*vals[:,j]
                    for i = 1 : Ι
                        llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j] - log(denom_index))
                    end
                end
            end
        end
    end
    return -llike
end

function test(J,N,Ι,show_trace)
    Mat1 = ones(Int8,Ι,N)
    Mat2 = ones(Int8,J,N)
    Mat3 =  rand(Ι,J,N) .>0.9
    NumberParams = Ι
    τ = rand(Ι,J)

    # Independent Variables Matrix

    XMat = zeros(Ι+1,Ι,J)
    for j = 1 : J
        XMat[:,:,j]  = vcat(τ[:,j]',Matrix(I,Ι,Ι))
    end
    XMat = XMat[1:end-1,:,:]

    func = TwiceDifferentiable(vars -> Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
                               vcat(7.0,ones(NumberParams-1)); autodiff=:forward);
    Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace=show_trace)
end

function fastmath_test(J,N,Ι,show_trace)
    Mat1 = ones(Int8,Ι,N)
    Mat2 = ones(Int8,J,N)
    Mat3 =  rand(Ι,J,N) .>0.9
    NumberParams = Ι
    τ = rand(Ι,J)

    # Independent Variables Matrix

    XMat = zeros(Ι+1,Ι,J)
    for j = 1 : J
        XMat[:,:,j]  = vcat(τ[:,j]',Matrix(I,Ι,Ι))
    end
    XMat = XMat[1:end-1,:,:]

    func = TwiceDifferentiable(vars -> fastmath_Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
                               vcat(7.0,ones(NumberParams-1)); autodiff=:forward);
    Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace=show_trace)
end

@btime fastmath_test(75, 100, 2, false)
@btime fastmath_test(75, 1000, 20, false)
@btime test(75, 100, 2, false)
@btime test(75, 1000, 20, false)

showing

  18.041 ms (495678 allocations: 44.25 MiB)
  182.590 s (792001067 allocations: 413.94 GiB)
  3.462 ms (678 allocations: 316.09 KiB)
  5.310 s (1033 allocations: 75.21 MiB)

But it seems this is still not enough to get an answer to the original problem in reasonable time.

Maybe this issue is related.

I have not profiled anything here. But it is common that computing the hessians is not worthy. Try a quasi newton method, instead.

2 Likes

That sounds like an understatement. The test

using Random, Optim
using LinearAlgebra
using BenchmarkTools

@views function Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,params)
    llike = 0.0
    θ = params[1]
    γ = params[2:end]
    γ = vcat(γ,1.0)

    log_vals = -θ.*τ .+ γ
    vals = exp.(log_vals)

    cond = sum(Mat1,dims=1)'
    for n = 1 : N
        if cond[n]>1
            for j = 1 : J
                denom_index = 0.0
                if Mat2[j,n] == 1
                    denom_index = Mat1[:,n]'*vals[:,j]
                    for i = 1 : Ι
                        llike += Mat3[i,j,n]*Mat2[j,n]*Mat1[i,n]*(log_vals[i,j] - log(denom_index))
                    end
                end
            end
        end
    end
    return -llike
end

function test(J,N,Ι,show_trace)
    Mat1 = ones(Int8,Ι,N)
    Mat2 = ones(Int8,J,N)
    Mat3 =  rand(Ι,J,N) .>0.9
    NumberParams = Ι
    τ = rand(Ι,J)

    func = OnceDifferentiable(vars -> Log_Likelihood(N,Ι,J,τ,Mat1,Mat2,Mat3,vars[1:NumberParams]),
                              vcat(7.0,ones(NumberParams-1)); autodiff=:forward)
    Optim.optimize(func, vcat(7.0,ones(NumberParams-1)),show_trace=show_trace)
end

@btime test(75, 100, 2, false)
@btime test(75, 1000, 20, false)
@time test(75, 10000, 200, true)

shows

  4.082 ms (313 allocations: 266.62 KiB)
  1.710 s (922 allocations: 30.22 MiB)
Iter     Function value   Gradient norm
     0     1.025838e+08     5.333363e+06
 * time: 0.020000219345092773
     1     8.005156e+07     1.214639e+06
 * time: 391.87900018692017
     2     7.944264e+07     4.315082e+03
 * time: 479.1490001678467
     3     7.944149e+07     3.477603e+04
 * time: 652.2050001621246
     4     7.944046e+07     9.562177e+01
 * time: 780.7230000495911
     5     7.944046e+07     3.279957e+02
 * time: 911.2210001945496
     6     7.944046e+07     4.497560e-01
 * time: 1038.9860000610352
     7     7.944046e+07     1.462263e+01
 * time: 1256.3560001850128
     8     7.944046e+07     1.006288e+01
 * time: 1435.6440000534058
     9     7.944046e+07     2.215416e-01
 * time: 1607.3940000534058
    10     7.944046e+07     3.465821e-02
 * time: 1693.7860000133514
    11     7.944046e+07     8.896165e-06
 * time: 1779.3620002269745
    12     7.944046e+07     8.919999e-06
 * time: 1951.7210001945496
    13     7.944046e+07     1.799785e-07
 * time: 2079.3300001621246
    14     7.944046e+07     7.276142e-08
 * time: 2210.949000120163
    15     7.944046e+07     1.857133e-08
 * time: 2339.0380001068115
    16     7.944046e+07     1.076338e-08
 * time: 2424.7460000514984
    17     7.944046e+07     3.543048e-09
 * time: 2552.7780001163483
2596.754130 seconds (4.69 M allocations: 4.530 GiB, 0.02% gc time, 0.04% compilation time)
 * Status: success

 * Candidate solution
    Final objective value:     7.944046e+07

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 1.32e-13 ≰ 0.0e+00
    |x - x'|/|x'|          = 1.31e-13 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 3.54e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   2553  (vs limit Inf)
    Iterations:    17
    f(x) calls:    61
    ∇f(x) calls:   61

One more question to OP: did you test your MWE? I see a couple of issues with it on Julia 1.6.3

Thanks so much for the feedback @goerch @mcabbott and @lmiq . Indeed not computing Hessians and removing the fastmath from the loop makes the optimization fast enough.

I guess that it would be still better to improve the hess function to be able to pass both functions explicitly to the program, but either using automatic differentiation for only the gradient or passing the explicit gradient only seem to be doing the work.

1 Like

Hm, I still see two questions for the specialists:

  • What is happening here with @fastmath? I’d suspect some kind of complexity limit for supported operations…
  • Which performance problems do we have to expect using Hessians (with or without autodiff)? Because I didn’t even see the first iteration with TwiceDifferentiable and I waited quite some time…

Also, do you see anyway to improve the explicit Hessian function that I am constructing.

Really, I didn’t look at the problem here at all. But Hessians require the computation of n^2 components (where n is the number of variables). Unless you know that it has some sparsity pattern, that has a cost of, well, O(n^2). The gradient is O(n).

For problems with many variables the computation of the hessian is prohibitive. Quasi-newton methods use approximate hessians that are computed in O(n) time and require O(n) storage, thus the iterations are faster. Of course the steps are less accurate, since the true hessian provides more information about the true objective function, but how much that makes a different is problem dependent, and if the functions are not roughly convex the true hessian is not that useful. In my experience there are not many problems in which the computation of the hessian is worth the effort (but there may be whole classes of problems in which it is).

1 Like

That is what I saw! Thank you very much for the explanation.