Optim.jl vs scipy.optimize once again

I have been using Python’s scipy.optimize.minimize(method="LBFGSB") for my research, and have been looking to speed the code because it doesn’t scale.

And so I tried to rewrite my code in Julia using Optim.jl, with Optim.LBFGS as the method. Unfortunately, my situation is the opposite of Optimize performance comparison - Optim.jl vs Scipy.
Here is my benchmark output:

  • scipy.optimize: 0.448 ± 0.013 s
  • Optim.jl: 6.744 s (30180 allocations: 1.07 MiB) – using @btime

I.e. Optim.jl is ~15x slower.

Here is the source code: GitHub - rht/climate_stress_test_benchmark (~300 LOC of Julia / Python).
I have already made sure that my objective function uses as few allocations as possible (see climate_stress_test_benchmark/climate_stress_test_simplified.jl.274731.mem at main · rht/climate_stress_test_benchmark · GitHub).

I have received help in optimizing my code in Zulip (e.g. specifying const for the globals). But at the time I haven’t had the permission to publish a subset of the research code yet. So it was like shooting in the dark. Now I have got the permission.

According to Patrick Kofod Mogensen, Zulip , it could be that the parameters of the LBFGS optimizations aren’t exactly comparable. But according to minimize(method=’L-BFGS-B’) — SciPy v1.9.1 Manual, the scipy version uses 1500 maxfun and maxiter, while there is no such equivalent parameter in Optim.jl, so I can’t compare directly.

Thank you for reading the post!

4 Likes

is it due to compilation latency?

Here’s my post on this topic Julia vs R vs Python: simple optimization | Codementor

2 Likes

I used @btime, which has conveniently excluded compilation latency.

1 Like

Things to rule out:

  1. Is Optim.jl taking qualitatively different steps than your Python code?
  2. Is the minimizer you reach different than what Python produces?
  3. Is Optim pushing for higher accuracy than Python is?
  4. Are the gradients being used different?

In general, focusing on performance optimization with these comparisons is often a red herring since it often turns out that the speed issue in play isn’t the time it takes to complete one iteration, but the overall behavior of the optimizer – which depends on all of the things above that are not about the performance of code.

9 Likes

Looking in the opposite direction to John, have you benchmarked how long a single evaluation of your objective function takes in Python vs Julia?

6 Likes
  1. Is Optim.jl taking qualitatively different steps than your Python code?

Optim.jl did 3 iterations, scipy.optimize did 4 iterations.
Optim.jl did 3833 function calls, scipy.optimize did 186!!
Optim.jl defaults to ftol = 0.0, scipy.optimize defaults to ftol = 2.220446049250313e-09.
Optim.jl defaults to gtol = 1e-8, scipy.optimize defaults to 1e-5.

  1. Is the minimizer you reach different than what Python produces?

The minimizers are virtually identical. I will post the output of each here.

Optim.jl:

 * Status: success

 * Candidate solution
    Final objective value:     -6.263329e+00

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 3.26e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 6.05e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 1.94e-10 ≤ 1.0e-08

 * Work counters
    Seconds run:   9  (vs limit Inf)
    Iterations:    3
    f(x) calls:    3833
    ∇f(x) calls:   3833

minimizer: [1.6265845677255805e-10, 0.9999999998063619, 0.999999999921528, 0.9999999999434931, 0.9999999999526344, 0.9999999999586152, 0.9999999999615918, 0.999999999963598, 0.9999999999650679, 0.999999999966145, 0.9999999999668981, 0.9999999999676399, 0.9999999999681027, 0.999999999968442, 0.9999999999688338, 0.9999999999689775, 0.9999999999690062, 0.9999999999691217, 0.9999999999690876, 0.9999999999689795, 0.9999999999687171, 0.9999999999682864, 0.9999999999677559, 0.9999999999668858, 0.9999999999658892, 0.9999999999643274, 0.9999999999623169, 0.9999999999596437, 0.9999999999558721, 0.999999999950147]

scipy.optimize:

      fun: -6.453688614845538
 hess_inv: <30x30 LbfgsInvHessProduct with dtype=float64>
      jac: array([ 0.42337325, -0.34515572, -0.84077109, -1.19605996, -1.45835991,
       -1.64439181, -1.78088317, -1.89369586, -1.9603929 , -2.01088434,
       -2.05932017, -2.10036094, -2.13964863, -2.17572324, -2.18934461,
       -2.20348316, -2.21532107, -2.2168737 , -2.21463664, -2.20394439,
       -2.18523198, -2.1545035 , -2.11170751, -2.06271514, -1.99429139,
       -1.91358342, -1.81256849, -1.69211453, -1.54430281, -1.37051987])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 186
      nit: 4
     njev: 6
   status: 0
  success: True
        x: array([0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

For completeness:

Single evaluation of scipy.optimize: 0.00136s
Single evaluation of Optim.jl: 27.658 μs (1 allocation: 16 bytes), i.e. 49x faster.

So I tried to increase Optim.jl’s f_tol and g_tol by doing this

    result = Optim.optimize(fn, lower, upper, xs0, Optim.Fminbox(inner_optimizer()), Optim.Options(g_tol = 1e-5, f_tol = 2.2e-9))

But the time it took is still ~6s and it still did 3737 function calls.

3 Likes

There might be deeper issues here.

1 Like

It is possible that the number of line searches per iteration is different. Apparently scipy defaults to a maximum of 20. I could not find quickly to what Optim defaults. It is not clear to me if either allows you to change that.

That could explain making one more iteration and declaring convergence much more quickly.

(What kind of gradient is being used in each case?)

1 Like

Have you tried using scipy.optimize through PyCall.

Rather than keeping both .py and .jl versions. You only keep the Julia code and use scipy when you need.

In the models I have worked, scipy through Pycall is actually faster than when I use Optim.jl. Don’t know why, but that’s the way it is.

4 Likes

Regarding with line search, what is the algorithm used by scipy.optimize? https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L762-L829 contains no reference to any paper.

I will try this soon. Thank you for the recommendation!

Do you guys think it is worth it maybe to port scipy.optimize’s line search to LineSearches.jl?

1 Like

Is this dictated by the L-BFGS-B algorithm, or are users supposed to have freedom in specifying their own gradient function?

1 Like

If you do not provide the gradient function probably both are using some kind of finite-difference gradient (which, if your function is differentiable, is really not a good idea).

You may want to try automatic differentiation, using, for example:

julia> Optim.minimizer(optimize(f, initial_x, BFGS(); autodiff = :forward))
2-element Array{Float64,1}:
 1.0
 1.0

which is described here:

1 Like

I am not sure, the documentation is not very direct. But both which algorithm and how many steps of this algorithm are used at each iteration of LBFGS will affect the final result. Note that these options are very problem-specific (one may question why Optim has one set of defaults and Scipy another, and that probably has not deep reasons or maybe have been defined by some specific test set, and you can get very different final performances depending on the problem you are dealing with).

Optimization algorithms are like that, after choosing one, one has to figure out which is the parameter set of the method that best fits our problem, and by chance the default choice of one or other implementation may be better for our own case.

1 Like

I tried this. The optimization took 0.014093, which 31.8x faster than full Python version!

I remembered being told to use autodiff when I asked in JuliaLang Zulip. Got this

ERROR: LoadError: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{var"#_obj_fn#5"{Array{Array{Float64,1},1},Array{Array{Float64,1},1},Firm,Array{Float64,1},Array{Float64,1}},Float64},Float64,10})
Closest candidates are:
  Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  Float64(::T) where T<:Number at boot.jl:732
  Float64(::UInt16) at float.jl:66

This is the explanation why autodiff doesn’t work: https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-.28published.29/topic/why.20is.20Optim.2Ejl.20so.20slow.3F/near/227257823
Quoted:

2 Likes

This makes sense since your function evaluation is faster in Julia. The difference you are seeing almost certainly then related to some of the parameters of the methods. Both methods certainly call the same implementation of LBFGS under the hood, so unless there was a very odd bug in the interface, there is not much more that can differ.

Concerning the derivatives: if you need really that to be fast (and what you have is not enough), you may have to think carefully about that implementation of your function. If it is supposed to be differentiable, than you should pursue re-implementing it such that it becomes so. You may want to start a different thread about that, if you reach some barrier.

ps: maybe you can file an issue in the Optim github with your example, it is possible that some developer takes a closer look at that and finds exactly the reason for the difference and, who knows, decides to change something.

3 Likes

After digging scipy.optimize’s code a bit further, I found the reference to their line search algorithm: https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L2439-L2480.

c     This subroutine calls subroutine dcsrch from the Minpack2 library
c       to perform the line search.  Subroutine dscrch is safeguarded so
c       that all trial points lie within the feasible region.
c
c     Be mindful that the dcsrch subroutine being called is a copy in
c       this file (lbfgsb.f) and NOT in the Minpack2 copy distributed
c       by scipy.

This is the modified Minpack2 code: https://github.com/scipy/scipy/blob/4ec6f29c0344ddce80845df0c2b8765fc8f27a72/scipy/optimize/lbfgsb_src/lbfgsb.f#L3326

Do you think it is worth it to ask Optim.jl to expose a maxls (max linesearch) parameter like scipy.optimize does?

Regarding with implementing my objective function to be differentiable, I don’t know if it will be possible. The full model has more details than the struct of the code I released in GitHub - rht/climate_stress_test_benchmark.

Will do!

1 Like

Ideally every parameter should be tunable by the user at the end, so yes, if it is not, why not ask.

What one has to think is if the model is differentiable or not, in principle. It it is, it can be done (even if by hand). How complicated is that is another story, but depending on the importance of that, it may be worth the effort.

1 Like

I found the parameter linesearchmax in LineSearches.jl: LineSearches.jl/hagerzhang.jl at 62ffafe6f4dd68032e42b11c31b8e4e24c4c9ae4 · JuliaNLSolvers/LineSearches.jl · GitHub.

2 Likes

So, by setting linesearchmax to 20,

result = Optim.optimize(fn, lower, upper, xs0, Optim.Fminbox(inner_optimizer(linesearch = LineSearches.HagerZhang(linesearchmax = 20))), Optim.Options(g_tol = 1e-5, f_tol = 2.2e-9))

I got 700.513 ms (3365 allocations: 148.81 KiB), which is slightly slower than the full Python version, which is 448 ms. Notice that f_tol and g_tol have been set correctly by looking at the summary below:

 * Status: success

 * Candidate solution
    Final objective value:     -1.617721e+00

 * Found with
    Algorithm:     Fminbox with L-BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 2.2e-09
    |g(x)|                 = 4.68e-01 ≰ 1.0e-05

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    5
    f(x) calls:    384
    ∇f(x) calls:   384

384 is still a bit more than 2x of the full Python’s 186 nfev. If I rescale the number, 700.512 /384 * 186, I still get 339.3105 ms. The remaining thing that is different is the line search algorithm, HagerZhang vs Minpack2.

5 Likes

Nice. As you see, the “time” of optimization methods is very sensitive to these details. Be aware that these differences may or not be reproducible in other examples of the same problem. There are numerical random chances of convergence criteria to be achieved in between iterations that sometimes define that in one specific run some set of parameters if faster than the other, but sometimes even only changing the initial guess you get something different.

(probably one should aim a robust set of parameters with reasonable average performance for a specific problem)

4 Likes