Suggestions needed: speed up optimization

Agreed. After reading documentation and discussions I realized this issue, but haven’t figured out how to get it work :frowning:

This will allocate an array on the heap each time it is called, which is bad for performance.

Could you just update negloglike inside the loop? That way you don’t need to worry about the prob vector at all.

Have you considered a non-adaptive numerical integration? In this case I see two possible problems with adaptive quadrature. 1) There is a sum and log transforms after it and thus effect of rtol in the end is not clear to me. 2) Since you are using somewhat low precision (rtol=0.001) it might cause the choice_prob_naive to have spurious jumps if the adaptation changes the quadrature points in a discontinuous manner. This could be bad for optimization.

2 Likes

What you do mean by “update negloglike inside the loop”? Compute it element by element in a loop? Doesn’t that require pre-allocating and gives me the trouble that Tamas_Papp pointed out?

I replaced rtol with with f_abstol. Would this help with the issue?

I mean something like this

function loglike(θ, x, y, R, c, d, p0, B)
  negloglike = 0.0
  for i = 1:size(B,1)
    prob = choice_prob_naive(θ, r[i], x[i,:], y[i,:], R[i], c[i], d[i], p0[i])
    negloglike += - (B[i]*log(prob) + (1 - B[i])*log(1 - prob))
  end
  return negloglike
end

I think the same issues apply to any kind of adaptive quadrature, if it is necessary to use a largeish tolerance.

I recommend checking
https://github.com/JuliaMath/QuadGK.jl#gaussian-quadrature-and-arbitrary-weight-functions
for non-adaptive gaussian quadrature

2 Likes

Thanks, this is helpful! I didn’t realize that negloglike = 0.0 would not trigger the same issue for autodiff as the one triggered by pre-allocating using prob = zeros(size(B,1)).

My understanding of the tolerance problem from reading is that the adaptive integtion subdivides the space to do approximate integration until it is considered “good enough” and the number of subdivision depends on my tolerance level. If I switch to the gaussian quadrature, I impose a fixed number of subdivision and the number can be based on a tiny rtol. Then I apply this number to each integration in the loop. It this right?

One more general point is that you can pass arrays of parametric types if you wanted to i.e. you can define a function like

function f(x::Array{T}) where {T}
        return Array{T}(repeat([0], outer = [10, 1]))
       end

julia> f(fill(0.0, 2))
10×1 Array{Float64,2}:
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0

julia> f(fill(0, 2))
10×1 Array{Int64,2}:
 0
 0
 0
 ⋮
 0
 0
 0

julia> f(Array{Real}(undef, 0, 0))
10×1 Array{Real,2}:
 0
 0
 0
 ⋮
 0
 0
 0

So, in your code, something like

function loglike(θ::Array{T}, r, x y, R, c, d, p0, B) where {T}
    #prob = zeros(size(B,1))
    prob = Array{T}(repeat([0], outer = [size(B)[1], 1]))
    for i = 1:size(B,1)    
        prob[i] = choice_prob_naive(θ, r[i], x[i,:], y[i,:], R[i], c[i], d[i], p0[i])
    end
    negloglike = - (B'*log.(prob) + (1.0 .- B)'*log.(1.0 .- prob))  
    return negloglike
end

should be fine (but also check how you have defined choice_prob_naive().

Computing Dual + Float64 results in Dual, which means that type of negloglike changes. In general changing types is bad, but it might not be a problem here, since the computation happens in choice_prob_naive. It would be better to use zero(T) where T is the eltype of theta.

Yes, more specifically the adaptation chooses only some intervals, where the estimated approximation error is largest and subdivides them. I think you can also pre-compute the integration points and weights and use those for each integration.

In the end you probably should experiment with adaptive and non-adaptive quadrature and autodiff and derivative free optimization, to see which combination works best.

I am still confused about the type issue related to autodiff. With type annotations removed, the optimization runs fine but when I try to get the Hessian from the results, Julia complains about the type issue…

td = TwiceDifferentiable(θ -> loglike(θ, r, x, y, R, c, d, p0, B), theta; autodiff = :forward);
@time theta_naive = optimize(td, theta, BFGS(), Optim.Options(f_abstol=10^-3, iterations=100_000, show_trace=true, show_every=1))

julia> H  = Optim.hessian!(td, theta_naive)
ERROR: MethodError: no method matching iterate(::Optim.MultivariateOptimizationResults{BFGS{LineSearches.InitialStatic{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Nothing,Nothing,Flat},Float64,Array{Float64,1},Float64,Float64,Array{OptimizationState{Float64,BFGS{LineSearches.InitialStatic{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Nothing,Nothing,Flat}},1},Bool})

The error is implying that hessian! doesn’t expect an OptimizationResult. You are missing Optim.minimizer.

1 Like

Hmmm I see. How do you get this information from reading the error message? I often find these messages hard to understand.
Should we expect Optim.hessian! to run for a long time, say, 10% of the optimization?

ERROR: MethodError: no method matching iterate(::Optim.MultivariateOptimizationResults{...})

Says that there is “no method matching” i.e. “no known way of doing:” iterate(blah) where blah is the type of thing it got as an input. But you didn’t call iterate, so that may seem strange. The stacktrace will show that iterate is being called from something (that is being called from something that is being called from… etc.) that is eventually called from hessian!. Generally with stacktraces, you look for the last line in the stack that you “own” to figure out what you did wrong.

1 Like