Possible to used cached values in Optim.optimize?

Hello,

I am using L-BFGS to solve a maximum likelihood problem. I provide analytical gradients to the solver which is nice. However, many steps I use in the function calculation can be repeated in the gradient calculation. In fact, it’s possible to calculate these within one function call which returns two outputs (function and gradient). Given that I can re-use intermediate values during the function call in the gradient evaluation, I was wondering how one could provide function calls to Optim efficiently. For example, I would like to have something like

Optim.optimize(x -> f(x), y -> f(y), beta, LBFGS())

but would like to call f only once. Note that f would be calculated in parallel with a macro like @distributed. In the current implementation, it seems like LBFGS() with HagerZhang() linesearch will first call the function for the gradient and then the function for the objective and so on, but that’s not ideal for my situation as I don’t require two separate function calls.

You just need to write a function that accepts the gradient cache. https://julianlsolvers.github.io/Optim.jl/stable/#user/tipsandtricks/#avoid-repeating-computations

3 Likes

Thank you! This is cool and while I read the docs on Optim many times, I definitely skipped over this part.

I will try to make it more obvious. There’s a similar one for hessians as well only_fgh!, but it seems like you don’t have the hessian available.

I just use outer product of gradients to calculate hessian (information equality). It will help to know or see in the docs how one would achieve this when functions and gradients are calculated in parallel.

A small example would suffice but will help fix ideas.

I’m not sure I understand what you’re saying here. Do you want an example for only_fgh!?

Of fg! for now would be great (this calculates objective and gradient right?). It may be helpful to have some more details on the docs than currently present for fg! as it appears a bit too generic right now. In general, I have found the Optim documentation slightly lacking when working with parallel functions. Ideally, this should be included as in most practical circumstances one is passing parallel-calculated gradients and parallel-calculated objective inside Optim.

I think Optim is a great package but could benefit from these slight clarifications!

So you mean a specific example?

Yes, that would be most helpful. Ideally, have the function and gradient be calculated using something like @distributed so that we can know how to handle passing functions and gradients calculated in parallel and using Optim most efficiently.

Hm, I guess I could write up such a problem. However, there isn’t much more to it than in the link. Maybe you can provide a simplified version of your f and you g that I can use for the example?

My particular use case may be too difficult to compress in a clean MWE. How about a simple sum
of Rosenbrock functions with gradients, with both function and gradient computed in parallel (for the sake of example)?

Maybe you can show what you have tried with that example? Might be easier to see where the problem is then.

1 Like

Sure, in that case let me get back here in a few days once I create a small MWE for my particular use but meanwhile it will help if the Optim documentation is updated in the manner @pkofod was suggesting above. The documentation on the link that was shared is unclear and does not have an accompanying example.

Hm, I guess I could write up such a problem. However, there isn’t much more to it than in the link. Maybe you can provide a simplified version of your f and you g that I can use for the example?

@pkofod, since you wanted an example of f and g for the example in the docs, here’s something I cooked up quickly below and illustrates the point of being able to re-use values from the likelihood calculation in the gradient calculation

#data 
data = DataFrame(x = repeat([1,2], 500), y = sample([0,1], 1000))

#likelihood
function likelihood(beta::Array{T}, data::R) where {T, R}
 Prob = SharedArray{T}(1000, 1)
 z = @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
  else
    Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
  end
 end
 return Prob, -z #z is the sum of likelihoods
end

#manual gradients
function gradient_likelihood(Prob, data::R) where {R}
 grad = SharedArray{Float64}(1000, 1)
 z = @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    grad[i] = Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
  else
    grad[i] = - Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
  end
 end
 return -z
end

#check results 
julia> likelihood([1.0], data)
([0.2689414213699951; 0.8807970779778824; … ; 0.2689414213699951; 0.8807970779778824], -510.25180766298547)

julia> gradient_likelihood(likelihood([1.0], data)[1], data)
-6.898809531258901

julia> ForwardDiff.gradient(x -> likelihood(x, data)[2], [1.0])
1-element Array{Float64,1}:
 -6.898809531258822

Given the documentation https://julianlsolvers.github.io/Optim.jl/stable/#user/tipsandtricks/#avoid-repeating-computations, I don’t know how to efficiently pass these functions in Optim and what I have tried so far is a mess though happy to post if helpful. Remember, I’d like to avoid two function calls since both likelihood and gradient can be calculated simultaneously.

I think this is what you need

data = DataFrame(x = repeat([1,2], 500), y = sample([0,1], 1000))
function fg!(F, G, beta)
    Prob = SharedArray{T}(1000, 1)
    z = @distributed (+) for i in 1:1000
     if data[!, :y][i] .== 1
       Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
     else
       Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
     end
    end

    if !(G == nothing)
        grad = SharedArray{Float64}(1000, 1)
        z = @distributed (+) for i in 1:1000
        if data[!, :y][i] .== 1
            grad[i] = Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
        else
            grad[i] = - Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
        end
        copyto!(G, grad)
    end
    if !(F == nothing)
        return -z
    end
end
optimize(only_fg!(fg!), x0, ...)
1 Like

Thanks a ton. This worked with minor tweaks to your fg!. Question for you: do we need to pre-initialize G somewhere inside fg!?

Copying the full code and output for someone who may run into something similar. I converted the likelihood function to log-likelihood and adjusted gradients accordingly since that’s a more typical thing to encounter in statistics/economics.

#data
data = DataFrame(x = repeat([1,2], 500), y = sample([0,1], 1000))

#log-likelihood
function likelihood(beta::Array{T}, data::R) where {T, R}
 Prob = SharedArray{T}(1000, 1)
 @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
  else
    Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
  end
 end
 return Prob, -sum(log.(Prob))/1000
end

#manual gradients of log-likelihood
function gradient_likelihood(Prob, data::R) where {R}
 grad = SharedArray{Float64}(1000, 1)
 @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    grad[i] = (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
  else
    grad[i] = - (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
  end
 end
 return -sum(grad)/1000
end

#check results
julia> likelihood([1.0], data)
([0.2689414213699951; 0.8807970779778824; … ; 0.2689414213699951; 0.8807970779778824], 0.9240948492805974)

julia> gradient_likelihood(likelihood([1.0], data)[1], data)
0.4503263672928852

julia> ForwardDiff.gradient(x -> likelihood(x, data)[2], [1.0])
1-element Array{Float64,1}:
 0.4503263672928852

#avoid repeating computations in Optim.optimize
function fg!(F, G, beta)

      Prob = SharedArray{Float64}(1000, 1)
      @distributed (+) for i in 1:1000
       if data[!, :y][i] .== 1
        Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
       else
        Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
      end
     end

    if !(G == nothing)
      grad = SharedArray{Float64}(1000, 1)
      @distributed (+) for i in 1:1000
       if data[!, :y][i] .== 1
         grad[i] = (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
       else
         grad[i] = - (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
       end
      end
      copyto!(G, -sum(grad)/1000)
    end

    if !(F == nothing)
        return -sum(log.(Prob))/1000
    end
end

julia> Optim.optimize(Optim.only_fg!(fg!), rand(Uniform(), 1), Optim.LBFGS(), Optim.Options(show_trace = true))
Iter     Function value   Gradient norm 
     0     7.616226e-01     2.766918e-01
 * time: 0.0
     1     6.915962e-01     1.335341e-02
 * time: 0.031000137329101562
     2     6.914531e-01     1.507051e-06
 * time: 0.04999995231628418
     3     6.914531e-01     1.105449e-15
 * time: 0.07500004768371582
 * Status: success

 * Candidate solution
    Minimizer: [7.37e-02]
    Minimum:   6.914531e-01

 * Found with
    Algorithm:     L-BFGS
    Initial Point: [5.61e-01]

 * Convergence measures
    |x - x'|               = 2.42e-06 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.29e-05 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.83e-12 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 2.64e-12 ≰ 0.0e+00
    |g(x)|                 = 1.11e-15 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    3
    f(x) calls:    9
    ∇f(x) calls:   9

Optim initializes G.

Perfect, thanks.