Possible to used cached values in Optim.optimize?

bashonubuntu · November 24, 2019, 10:42pm

Hello,

I am using L-BFGS to solve a maximum likelihood problem. I provide analytical gradients to the solver which is nice. However, many steps I use in the function calculation can be repeated in the gradient calculation. In fact, it’s possible to calculate these within one function call which returns two outputs (function and gradient). Given that I can re-use intermediate values during the function call in the gradient evaluation, I was wondering how one could provide function calls to Optim efficiently. For example, I would like to have something like

Optim.optimize(x -> f(x), y -> f(y), beta, LBFGS())

but would like to call f only once. Note that f would be calculated in parallel with a macro like @distributed. In the current implementation, it seems like LBFGS() with HagerZhang() linesearch will first call the function for the gradient and then the function for the objective and so on, but that’s not ideal for my situation as I don’t require two separate function calls.

pkofod · November 24, 2019, 10:56pm

You just need to write a function that accepts the gradient cache. https://julianlsolvers.github.io/Optim.jl/stable/#user/tipsandtricks/#avoid-repeating-computations

bashonubuntu · November 24, 2019, 10:59pm

Thank you! This is cool and while I read the docs on Optim many times, I definitely skipped over this part.

pkofod · November 25, 2019, 8:02am

I will try to make it more obvious. There’s a similar one for hessians as well only_fgh!, but it seems like you don’t have the hessian available.

bashonubuntu · November 25, 2019, 10:51am

I just use outer product of gradients to calculate hessian (information equality). It will help to know or see in the docs how one would achieve this when functions and gradients are calculated in parallel.

A small example would suffice but will help fix ideas.

pkofod · November 25, 2019, 1:23pm

I’m not sure I understand what you’re saying here. Do you want an example for only_fgh!?

bashonubuntu · November 25, 2019, 1:33pm

Of fg! for now would be great (this calculates objective and gradient right?). It may be helpful to have some more details on the docs than currently present for fg! as it appears a bit too generic right now. In general, I have found the Optim documentation slightly lacking when working with parallel functions. Ideally, this should be included as in most practical circumstances one is passing parallel-calculated gradients and parallel-calculated objective inside Optim.

I think Optim is a great package but could benefit from these slight clarifications!

pkofod · November 25, 2019, 1:39pm

So you mean a specific example?

bashonubuntu · November 25, 2019, 1:40pm

Yes, that would be most helpful. Ideally, have the function and gradient be calculated using something like @distributed so that we can know how to handle passing functions and gradients calculated in parallel and using Optim most efficiently.

pkofod · November 25, 2019, 4:10pm

Hm, I guess I could write up such a problem. However, there isn’t much more to it than in the link. Maybe you can provide a simplified version of your f and you g that I can use for the example?

bashonubuntu · November 25, 2019, 4:21pm

My particular use case may be too difficult to compress in a clean MWE. How about a simple sum
of Rosenbrock functions with gradients, with both function and gradient computed in parallel (for the sake of example)?

kristoffer.carlsson · November 25, 2019, 5:35pm

Maybe you can show what you have tried with that example? Might be easier to see where the problem is then.

bashonubuntu · November 25, 2019, 5:54pm

Sure, in that case let me get back here in a few days once I create a small MWE for my particular use but meanwhile it will help if the Optim documentation is updated in the manner @pkofod was suggesting above. The documentation on the link that was shared is unclear and does not have an accompanying example.

bashonubuntu · November 29, 2019, 3:14pm

Hm, I guess I could write up such a problem. However, there isn’t much more to it than in the link. Maybe you can provide a simplified version of your f and you g that I can use for the example?

@pkofod, since you wanted an example of f and g for the example in the docs, here’s something I cooked up quickly below and illustrates the point of being able to re-use values from the likelihood calculation in the gradient calculation

#data 
data = DataFrame(x = repeat([1,2], 500), y = sample([0,1], 1000))

#likelihood
function likelihood(beta::Array{T}, data::R) where {T, R}
 Prob = SharedArray{T}(1000, 1)
 z = @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
  else
    Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
  end
 end
 return Prob, -z #z is the sum of likelihoods
end

#manual gradients
function gradient_likelihood(Prob, data::R) where {R}
 grad = SharedArray{Float64}(1000, 1)
 z = @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    grad[i] = Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
  else
    grad[i] = - Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
  end
 end
 return -z
end

#check results 
julia> likelihood([1.0], data)
([0.2689414213699951; 0.8807970779778824; … ; 0.2689414213699951; 0.8807970779778824], -510.25180766298547)

julia> gradient_likelihood(likelihood([1.0], data)[1], data)
-6.898809531258901

julia> ForwardDiff.gradient(x -> likelihood(x, data)[2], [1.0])
1-element Array{Float64,1}:
 -6.898809531258822

Given the documentation https://julianlsolvers.github.io/Optim.jl/stable/#user/tipsandtricks/#avoid-repeating-computations, I don’t know how to efficiently pass these functions in Optim and what I have tried so far is a mess though happy to post if helpful. Remember, I’d like to avoid two function calls since both likelihood and gradient can be calculated simultaneously.

pkofod · November 29, 2019, 8:14pm

I think this is what you need

data = DataFrame(x = repeat([1,2], 500), y = sample([0,1], 1000))
function fg!(F, G, beta)
    Prob = SharedArray{T}(1000, 1)
    z = @distributed (+) for i in 1:1000
     if data[!, :y][i] .== 1
       Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
     else
       Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
     end
    end

    if !(G == nothing)
        grad = SharedArray{Float64}(1000, 1)
        z = @distributed (+) for i in 1:1000
        if data[!, :y][i] .== 1
            grad[i] = Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
        else
            grad[i] = - Prob[i] .* (1 - Prob[i]) .* data[!, :x][i]
        end
        copyto!(G, grad)
    end
    if !(F == nothing)
        return -z
    end
end
optimize(only_fg!(fg!), x0, ...)

bashonubuntu · November 30, 2019, 1:23am

Thanks a ton. This worked with minor tweaks to your fg!. Question for you: do we need to pre-initialize G somewhere inside fg!?

Copying the full code and output for someone who may run into something similar. I converted the likelihood function to log-likelihood and adjusted gradients accordingly since that’s a more typical thing to encounter in statistics/economics.

#data
data = DataFrame(x = repeat([1,2], 500), y = sample([0,1], 1000))

#log-likelihood
function likelihood(beta::Array{T}, data::R) where {T, R}
 Prob = SharedArray{T}(1000, 1)
 @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
  else
    Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
  end
 end
 return Prob, -sum(log.(Prob))/1000
end

#manual gradients of log-likelihood
function gradient_likelihood(Prob, data::R) where {R}
 grad = SharedArray{Float64}(1000, 1)
 @distributed (+) for i in 1:1000
  if data[!, :y][i] .== 1
    grad[i] = (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
  else
    grad[i] = - (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
  end
 end
 return -sum(grad)/1000
end

#check results
julia> likelihood([1.0], data)
([0.2689414213699951; 0.8807970779778824; … ; 0.2689414213699951; 0.8807970779778824], 0.9240948492805974)

julia> gradient_likelihood(likelihood([1.0], data)[1], data)
0.4503263672928852

julia> ForwardDiff.gradient(x -> likelihood(x, data)[2], [1.0])
1-element Array{Float64,1}:
 0.4503263672928852

#avoid repeating computations in Optim.optimize
function fg!(F, G, beta)

      Prob = SharedArray{Float64}(1000, 1)
      @distributed (+) for i in 1:1000
       if data[!, :y][i] .== 1
        Prob[i] = exp(beta[1] .* data[!, :x][i])/(1 + exp(beta[1] .* data[!, :x][i]))
       else
        Prob[i] = 1/(1 + exp(beta[1] .* data[!, :x][i]))
      end
     end

    if !(G == nothing)
      grad = SharedArray{Float64}(1000, 1)
      @distributed (+) for i in 1:1000
       if data[!, :y][i] .== 1
         grad[i] = (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
       else
         grad[i] = - (Prob[i] .* (1 - Prob[i]) .* data[!, :x][i])./Prob[i]
       end
      end
      copyto!(G, -sum(grad)/1000)
    end

    if !(F == nothing)
        return -sum(log.(Prob))/1000
    end
end

julia> Optim.optimize(Optim.only_fg!(fg!), rand(Uniform(), 1), Optim.LBFGS(), Optim.Options(show_trace = true))
Iter     Function value   Gradient norm 
     0     7.616226e-01     2.766918e-01
 * time: 0.0
     1     6.915962e-01     1.335341e-02
 * time: 0.031000137329101562
     2     6.914531e-01     1.507051e-06
 * time: 0.04999995231628418
     3     6.914531e-01     1.105449e-15
 * time: 0.07500004768371582
 * Status: success

 * Candidate solution
    Minimizer: [7.37e-02]
    Minimum:   6.914531e-01

 * Found with
    Algorithm:     L-BFGS
    Initial Point: [5.61e-01]

 * Convergence measures
    |x - x'|               = 2.42e-06 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.29e-05 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.83e-12 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 2.64e-12 ≰ 0.0e+00
    |g(x)|                 = 1.11e-15 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    3
    f(x) calls:    9
    ∇f(x) calls:   9

pkofod · November 30, 2019, 7:01am

Optim initializes G.

bashonubuntu · November 30, 2019, 11:23am

Perfect, thanks.

Topic		Replies	Views
Gradient-based optimization when gradient is only a function of y = f(x) Optimization (Mathematical) optim , optimization	3	244	May 1, 2024
Optim.jl - computing value, gradient and hessian simultaneously Optimization (Mathematical)	17	3907	August 28, 2018
Optim user-supplied gradient not working Optimization (Mathematical) optim	6	1053	May 21, 2019
Providing Objective, Gradient and Hessian to the Ipopt wrapper using a single function? Optimization (Mathematical) optimization	12	1264	April 22, 2021
Function and gradient evaluation in same function Ipopt General Usage optimization	2	939	February 25, 2018

Possible to used cached values in Optim.optimize?

Related topics