Optimize performance comparison - Optim.jl vs Scipy


I am just starting to learn about optimization. I picked up Optim.jl (great documentation, btw) and tried to do the same thing in Python.

It seems that Rosenbrock function is what everyone uses as an example. Perhaps not too surprisingly, Julia is a lot faster than Python (appox. 60x) but then I am curious where the performance difference come from. Has anyone done similar exercise before? What do you think?

I’ve uploaded my notebooks for reference.

  1. Julia (63 μs) http://nbviewer.jupyter.org/github/tk3369/julia-notebooks/blob/master/Rosenbrock_Julia.ipynb

  2. Python (3.85 ms, with numba jit 3.55 ms) http://nbviewer.jupyter.org/github/tk3369/julia-notebooks/blob/master/Rosenbrock_Python.ipynb

Performance of Optim.jl vs. Matlab's fminunc

The runtime of rosenbrock is totally negligble in Julia:

julia> @btime rosenbrock($x2) # result of running optimization
  1.354 ns (0 allocations: 0 bytes)

but it may take close to a microseond or per call in Python?

Also, FWIW, I once tried to lean up Optim for use on cheap functions in simulations. Things aren’t in a pretty state at the moment, but, setup (note this “once differentiable” object uses ForwardDiff):

using BenchmarkTools, DifferentiableObjects # https://github.com/chriselrod/DifferentiableObjects.jl
using TriangularMatrices 
# This should really be StaticArrays, but in a rush job I didn't make things as generic as I should have.
# I'll get around to making it appropriately generic within a couple months
# https://github.com/chriselrod/TriangularMatrices.jl
rosenbrock(x) =  (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
x = RecursiveVector{Float64,2}() .= 0; # basically, an MVector
od = OnceDifferentiable(rosenbrock, deepcopy(x));


julia> optimize_light(od, x, BFGS())
([1.0, 1.0], 7.645688055381766e-21)

julia> x
2-element RecursiveVector{Float64,2}:

julia> @benchmark optimize_light($od, $x, BFGS())
  memory estimate:  14.08 KiB
  allocs estimate:  313
  minimum time:     10.747 μs (0.00% GC)
  median time:      11.165 μs (0.00% GC)
  mean time:        18.872 μs (37.76% GC)
  maximum time:     51.231 ms (99.95% GC)
  samples:          10000
  evals/sample:     1

Also, you should definitely give BackTracking a try:

julia> using LineSearches

julia> @benchmark optimize_light($od, $x, BFGS(;linesearch = BackTracking()))
  memory estimate:  3.02 KiB
  allocs estimate:  83
  minimum time:     5.105 μs (0.00% GC)
  median time:      5.244 μs (0.00% GC)
  mean time:        6.884 μs (20.84% GC)
  maximum time:     9.548 ms (99.93% GC)
  samples:          10000
  evals/sample:     6

I’ve found it to be much faster than the default line search HagerZhang in my problems (but slightly less reliable when the parameter space is poorly behaved).

Anyway, for most problems, the actual function calls dominate runtime. This I think is overall a major advantage of Julia – you’re optimizing Julia functions, not Python functions.
And it also means the performance benefits of leaning out Optim, like you see above, are negligible for most problems.

Final aside: would be nice to take advantage of https://github.com/yuyichao/FunctionWrappers.jl
to get type stability, without having to recompile for different objectives.


This is similar to the ODE example. DifferentialEquations.jl (called from Python) beats Numba-compiled SciPy ODE calls by about 10x even though it’s just using Fortran+Numba.

The reason is because function calls in Python are expensive. This is true whether or not a function call is JIT-compiled since you still need a Python interpreter in the middle. This dynamicness in the middle also inhibits all sorts of optimizations. So Numba is really just a trick for small scripts and microbenchmarks but it’s not something that will work well for packages. Its worse case scenario are higher order functions (functions which take in functions), and these show up all of the time in differential equation solvers, optimization problems, etc. (anything that handles user-defined nonlinearity).

I’ll be putting a blog post out which goes into this in some detail.


Do you know why this is so much faster? I just tried it on an "Optim vs. fminunc" horse race that I am working on, and it sped things up tremendously.


Follow up question, I’ve also made some benchmark here of Optim vs. scipy, and python algorithms are much slower. The issue is that I’m calling python from Julia and that my functions are defined in Julia, so presumably python has to callback into Julia. Is there any overhead involved in that or python is just that slow.


Unfortunately, I have made the opposite experience. I want to analyse some experimental data and wrote my objective function once in Julia and in python (numpy + numba). Within the objective function I have to made some loops, calls of own defined functions and calculations of eigenvalues and eigenvectors of complex matrices. By using BenchmarkTools, Julia needs about 480ms (2118151 allocations: 259.96 MiB) whereas python (+ numba) need 510ms. Interestingly, by using Optim and the BFGS algorithm

optimize(calc_mse, start_parameter, BFGS(), Optim.Options(iterations = 2))

I need for 2 iterations about 10h whereas with scipy and scipy.optimize.minimize I need only 3,5h. So its a factor of about 3 faster than Julia.

  • The parameter which as to be optimized has a length of 6317
  • Since I have to calculate the eigenvalues and eigenvectors of a complex matrix in my objective I cannot use autodiff (if I understand some posts in this forum correctly)

If it will be interesting I can try post a short version of the code.


Recomputing the gradient by finite differences is very suboptimal. Is your objective really that complicated that analytic computation of the gradient is out of reach?


This doesn’t really make sense to me. You’re saying that one evaluation takes 480ms but two BFGS iterations takes 10 hours? If you only have 6300 parameters, this makes no sense, even with finite differences. That’s enough for 18000 evaluations unless there’s something I don’t understand?


If the gradient is reconstructed by central finite differences then one gradient step takes about 2h. There’s still a factor of 5 in there though, perhaps because the gradient is evaluated multiple times in the line search (switching to backtracking linesearch should decrease that)


Oh, yes. Finite differences, I missed that part.

Yes, it’s probably because the gradient is evaluated as part of HagerZhang. My best bet is that scipy.optimize is not really faster in any meaningful sense here. What I mean is, that they’re most likely using a one-way (forward would be standard :slight_smile: ) finite difference approximation, and a line search algorithm that doesn’t use the derivative information. You can do the same if you want (but really, you don’t, because that finite differencing of a R^6300->R function is killing you).

Can you post a description (draft paper or similar) of the problem?


In the experiment I measured several 4 times 4 matrixes. In order to fit them, I created an array (vector) which contains all the elements as well as the uncertainty. The physical properties which I have to fit are complex-valued. Thus, I created a paramter vector where the first elements corresponds to real and the lower one to the imaginary part.

The code for the objective function looks like this:

function calc_model(df, df2, phi0, d_surf)
model = Array{Float64}(LaengeExp)
index = 1
for i=1:LaengeData

   # This matrix does not depend on the fittng parameter so its derivative is zeros
    enerpos = Int(data_input[i,1]) + 1
    L1 = calc_L1( 1.0, winkel ) 

    L2 = calc_L2( df[enerpos,:], alpha, [winkel_array[enerpos]] , winkel[enerpos] ) # this function creates a complex 4x4 matrix. In doing that, the eigenvalues and eigenvectors of another complex valued 4x4 matrix is calculated

    L3 = calc_L3( df2[enerpos,:], alpha, [winkel_array[enerpos]] , winkel[enerpos], d, ener[enerpos] ) # similar to function calc_L3 a complex 4x4 matrix is created. For that, the eigenvalues and eigenvectors of another complex valued 4x4 matrix have to be calculated
    # some transformation in order to obtain the same matrix as in the experiment. In these functions 2x2 and 4x4 matrixes. These runtime of these functions are negligible compared to those of L2 and L3
    mat1 = calc_mat1( L1 * L2 * L3 )
    theo_mat = convert_mat1( jones )

    # occupy the model vector so that it can be compared with the experimental one
    nr_elements = Int(data_orient[i,7])
    model[index:index+nr_elements-2] = reshape(transpose(SMatrix{4,4}(theo_mat)),1,16)[2:nr_elements]
    index = index + nr_elements - 1
return model


function calc_mse(parameter)

df, dicke, phi0 = calc_phys_properties(parameter) #calculate the complex-valued quantities   

mse = sum( abs.((calc_model(df, df_surf, phi0, dicke ) - data_exp[:,1] ) ./ data_exp[:,2]) ) / LaengeExp
return mse


The main problem is, that I need the eigenvalue and eigenvector of a 4x4 complex matrix which is neither symmetric nor Hermitian. Thus I use dummy, eigenval, dummy, eigenvec = LAPACK.geevx!('B', 'N', 'V', 'N', copy(delta)) in order to calculate these properties. (calling the eig() function is much slower). As far as I know DiffForward will not work with complex matrixes.

I have also some global variables which I use. But I declare them as constants so I thought this should not be affect the speed so much.

If further information are wanted I can prepare a better description of the code.


ForwardDiff will not help you here, since it still requires work comparable with 6k times the cost of one evaluation. Instead you should compute the gradient through the adjoint method (see e.g. https://math.mit.edu/~stevenj/18.336/adjoint.pdf), which might be very annoying to derive by hand but is at least in principle doable, and reduces the cost to one comparable with one evaluation of the objective function. In particular, differentiating through eigenvector-taking is not harder than any other operations (you have to be careful in case of repeated or almost-repeated eigenvalues, as well as almost-non-diagonalizable matrices, but if your objective function is well-behaved that should take care of itself in the end). In theory, reverse-mode autodiff is a way of doing all this automatically, but in practice it’s very hard to get it working…


I recommend that people who use first-order methods always start with L-BFGS + BackTracking(order=3), and then start experimenting with line searches and other optimizers if the performance is not satisfactory.
(@pkofod if you agree then maybe we should state that somewhere in the documentation?)

There are two reasons why BackTracking often is faster than HagerZhang when it is used with (L)BFGS:

  1. The default line search stop tolerances are less stringent in BackTracking.
  2. HagerZhang evaluates both the gradient and objective at each line search iteration, but BT only evaluates the objective.

The implication of point 1 is that HZ is more “robust”; it helps the optimizer by finding “better” step lengths. This helps when the objective is poorly scaled or in other ways is not well-behaved, and is why HZ is the default even if it often is more costly.
Regarding point 2, the cost of the extra gradient call does not need to be very costly if the user provides gradients and possibly the fg! method. It seems to me that many users write expensive gradient methods or use finite differences, which adds much time to the wall-time in HZ.


@antoine-levitt, those notes are amazing.

@itzui, footnote 1 is probably the key to translating the results in the notes to complex matrices.


@antoine-levitt tanks for the pdf. I will try to implement this.
@anriseth Thanks for the explanation