Optim finite differences use a vectorized version

JADekker · March 17, 2025, 10:25am

Hi, I’m in an optimization setting where I have to minimize a function f, where I have a “vector version” and a “matrix version” of the function “f” such that “calling the vector version N times” is typically slower than “calling the matrix version with N rows”. My question is whether I can use this “matrix version” to compute finite differences more efficiently when using Optim.

See the following MWE to convey the main idea: running the vectorized version once costs about 6 ns, while running the matrix version with 100 columns costs 160 ns, which is roughly 25x more expensive while computing 100x more values! It seems to me that a good finite-difference stencil (e.g. for the Hessian) may thus be computed more efficiently using this matrix-version than repeated calls to the vector version. Is this correct, and if so, is there a way to make Optim use this?

(I’ve looked into automatic differentiation for my application, this seems to be roughly on par with finite differences. I can probably squeeze out some improvement there, but this would probably take more man-hours than I can afford at this point).

using BenchmarkTools, LinearAlgebra, Optim, Random

function f(x::Vector{T}, a::Vector{Float64}) where {T<:Real}
    f_out = zero(promote_type(T, Float64))
    for xi in x
        for (i, ai) in enumerate(a)
            if i % 2 == 0
                f_out *= ai
            else
                f_out += abs(xi[1])
            end
        end
    end
    return f_out
end

function f(x::Matrix{T}, a::Vector{Float64}) where {T<:Real}
    f_out = zeros(promote_type(T, Float64), size(x, 1))
    for xi in eachcol(x)
        for (i, ai) in enumerate(a)
            if i % 2 == 0
                f_out .*= ai
            else
                f_out .+= xi
            end
        end
    end
    return f_out
end

function run_opts(d::Int64)
    Random.seed!(42)
    x_init = ones(Float64, d)
    a_vec = [1., 2., 3., 4., 5.]
    display(f(x_init, a_vec))
    x_rand_mat = rand(100, d)
    x_rand = x_rand_mat[1, :]
    display(@benchmark f($x_rand, $a_vec))
    display(@benchmark f($x_rand_mat, $a_vec))
    opt = Optim.optimize(x -> f(x, a_vec), x_init, LBFGS())
    return opt
end
run_opts(2)

gives

117.0
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  5.916 ns … 33.667 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.085 ns ±  0.621 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▅▃                                                      ▁
  █████▆▁▁▅█▆▆▅▃▅▇▆▆▅▅▁▅▄▄▄▄▁▃▃▁▁▁▃▁▄▄▃▃▁▄▃▁▃▅▁▃▁▃▄▁▃▅▆▄▅▄▅▅ █
  5.92 ns      Histogram: log(frequency) by time     8.46 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 829 evaluations per sample.
 Range (min … max):  153.649 ns …  40.801 μs  ┊ GC (min … max): 0.00% … 99.46%
 Time  (median):     160.887 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   174.660 ns ± 408.535 ns  ┊ GC (mean ± σ):  4.90% ±  7.39%

  █▆▇▄▄▃▃▃▃▁▁                                                   ▂
  ████████████▇▇▆▆▆▅▅▄▃▄▃▃▁▁▃▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▁▁▁▄▃▅▅▆▆▅▆ █
  154 ns        Histogram: log(frequency) by time        478 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.
 * Status: success

 * Candidate solution
    Final objective value:     2.271107e-20

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.40e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 1.71e+15 ≰ 0.0e+00
    |f(x) - f(x')|         = 6.94e-06 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 3.06e+14 ≰ 0.0e+00
    |g(x)|                 = 0.00e+00 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    20
    f(x) calls:    127
    ∇f(x) calls:   127

stevengj · March 17, 2025, 11:21am

If you have n parameters, using finite differences or forward-mode AD to compute the gradient has cost proportional to n \times the objective function evaluation.

However, if you use reverse-mode AD (or manual reverse mode / backpropagation / adjoint methods), the cost of the gradient is proportional to one additional function evaluation. That’s why most optimization uses reverse-mode differentiation algorithms.

gdalle · March 17, 2025, 12:14pm

Note that reverse-mode autodiff with arbitrary backends is ready to be integrated into Optim.jl. All it takes is a final review on this PR:

github.com/JuliaNLSolvers/NLSolversBase.jl

Use DifferentiationInterface for autodiff, allow ADTypes

master ← gdalle:gd/di

opened 12:02PM - 04 Dec 24 UTC

gdalle

+138 -333

>[!WARNING] >This PR bumps the Julia compat to 1.10 instead of 1.5. This PR …introduces the use of ADTypes for autodiff package selection + DifferentiationInterface for autodiff calls (fixes #132). - [x] bump package version to 7.9.0 - [x] bump Julia compat to 1.10 - [x] add ADTypes and DifferentiationInterface to dependencies - [x] modify `oncedifferentiable.jl`, `twicedifferentiable.jl` and `constraints.jl` - [x] add docs for passing `autodiff=ADTypes.AutoSomething()` - [ ] adjust count of function/gradient calls? Tests pass locally but I might be misinterpreting the interface somehow so please double-check. I think we can handle performance improvements in a follow-up PR.

Gentle ping to @pkofod, sorry to be such a bother but I think this could really be a blessing for the entire ecosystem.

pkofod · April 1, 2025, 8:08am

It’s in now

Topic		Replies	Views
Automatic Differentiation Slow (Slower than Finite-Differences) Optimization (Mathematical)	18	3470	May 31, 2018
Optim: What optimiser is best if your gradient computation is slow? Optimization (Mathematical) optim	13	3154	August 22, 2017
Optim, Forward Differentiation and OnceDifferentiable with MLE Optimization (Mathematical) optim	18	2723	November 28, 2020
Why Optim.jl does not allow for backwards mode autodifferentiation? Optimization (Mathematical) autodiff	6	588	June 15, 2023
How to optimize Jacobian ForwardDiff.jl Modelling & Simulations forwarddiff	2	420	March 20, 2023

Optim finite differences use a vectorized version

Related topics