Optim finite differences use a vectorized version

Hi, I’m in an optimization setting where I have to minimize a function f, where I have a “vector version” and a “matrix version” of the function “f” such that “calling the vector version N times” is typically slower than “calling the matrix version with N rows”. My question is whether I can use this “matrix version” to compute finite differences more efficiently when using Optim.

See the following MWE to convey the main idea: running the vectorized version once costs about 6 ns, while running the matrix version with 100 columns costs 160 ns, which is roughly 25x more expensive while computing 100x more values! It seems to me that a good finite-difference stencil (e.g. for the Hessian) may thus be computed more efficiently using this matrix-version than repeated calls to the vector version. Is this correct, and if so, is there a way to make Optim use this?

(I’ve looked into automatic differentiation for my application, this seems to be roughly on par with finite differences. I can probably squeeze out some improvement there, but this would probably take more man-hours than I can afford at this point).

using BenchmarkTools, LinearAlgebra, Optim, Random

function f(x::Vector{T}, a::Vector{Float64}) where {T<:Real}
    f_out = zero(promote_type(T, Float64))
    for xi in x
        for (i, ai) in enumerate(a)
            if i % 2 == 0
                f_out *= ai
            else
                f_out += abs(xi[1])
            end
        end
    end
    return f_out
end

function f(x::Matrix{T}, a::Vector{Float64}) where {T<:Real}
    f_out = zeros(promote_type(T, Float64), size(x, 1))
    for xi in eachcol(x)
        for (i, ai) in enumerate(a)
            if i % 2 == 0
                f_out .*= ai
            else
                f_out .+= xi
            end
        end
    end
    return f_out
end

function run_opts(d::Int64)
    Random.seed!(42)
    x_init = ones(Float64, d)
    a_vec = [1., 2., 3., 4., 5.]
    display(f(x_init, a_vec))
    x_rand_mat = rand(100, d)
    x_rand = x_rand_mat[1, :]
    display(@benchmark f($x_rand, $a_vec))
    display(@benchmark f($x_rand_mat, $a_vec))
    opt = Optim.optimize(x -> f(x, a_vec), x_init, LBFGS())
    return opt
end
run_opts(2)

gives

117.0
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  5.916 ns … 33.667 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.085 ns ±  0.621 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▅▃                                                      ▁
  █████▆▁▁▅█▆▆▅▃▅▇▆▆▅▅▁▅▄▄▄▄▁▃▃▁▁▁▃▁▄▄▃▃▁▄▃▁▃▅▁▃▁▃▄▁▃▅▆▄▅▄▅▅ █
  5.92 ns      Histogram: log(frequency) by time     8.46 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 829 evaluations per sample.
 Range (min … max):  153.649 ns …  40.801 μs  ┊ GC (min … max): 0.00% … 99.46%
 Time  (median):     160.887 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   174.660 ns ± 408.535 ns  ┊ GC (mean ± σ):  4.90% ±  7.39%

  █▆▇▄▄▃▃▃▃▁▁                                                   ▂
  ████████████▇▇▆▆▆▅▅▄▃▄▃▃▁▁▃▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▁▁▁▄▃▅▅▆▆▅▆ █
  154 ns        Histogram: log(frequency) by time        478 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.
 * Status: success

 * Candidate solution
    Final objective value:     2.271107e-20

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.40e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 1.71e+15 ≰ 0.0e+00
    |f(x) - f(x')|         = 6.94e-06 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 3.06e+14 ≰ 0.0e+00
    |g(x)|                 = 0.00e+00 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    20
    f(x) calls:    127
    ∇f(x) calls:   127

1 Like

If you have n parameters, using finite differences or forward-mode AD to compute the gradient has cost proportional to n \times the objective function evaluation.

However, if you use reverse-mode AD (or manual reverse mode / backpropagation / adjoint methods), the cost of the gradient is proportional to one additional function evaluation. That’s why most optimization uses reverse-mode differentiation algorithms.

1 Like

Note that reverse-mode autodiff with arbitrary backends is ready to be integrated into Optim.jl. All it takes is a final review on this PR:

Gentle ping to @pkofod, sorry to be such a bother but I think this could really be a blessing for the entire ecosystem.

3 Likes