Slow L-BFGS code

yhuang · November 7, 2023, 8:28pm

I’m customizing Burer-Monteiro method on a specific problem, which involves solving many subproblems using L-BFGS. However, I noticed that my L-BFGS runs slow. The critical operation of computing the descent direction is about 2x slower than I expected.

I noticed there are mainly two performance-critical operations in my descent direction computation, one is dot(A, B) where A, B are matrices and one is @. A += alpha * B, or in other words daxpy operation. I benchmarked them separately and I found the running time of computing the descent direction is 2x slower than the summation of running time of all performance-critical operations.

Here is one MWE:

using LinearAlgebra, SparseArrays, BenchmarkTools
using Random


# for reproducing
Random.seed!(11235813)

"""
Vector of L-BFGS
"""
struct LBFGSVector{T <: AbstractFloat}
    # notice that we use matrix instead 
    # of vector to store s and y because our
    # decision variables are matrices
    # s = xₖ₊₁ - xₖ 
    s::Matrix{T}
    # y = ∇ f(xₖ₊₁) - ∇ f(xₖ)
    y::Matrix{T}
    # ρ = 1/(⟨y, s⟩)
    ρ::Base.RefValue{T}
    # temporary variable
    a::Base.RefValue{T}
end

"""
History of l-bfgs vectors
"""
struct LBFGSHistory{Ti <: Integer, Tv <: AbstractFloat}
    # number of l-bfgs vectors
    m::Ti
    vecs::Vector{LBFGSVector{Tv}}
    # the index of the latest l-bfgs vector
    # we use a cyclic array to store l-bfgs vectors
    latest::Base.RefValue{Ti}
end


Base.:length(lbfgshis::LBFGSHistory) = lbfgshis.m


"""
Computing the descent direction, here I omit some details like 
negating the direction to highlight the main performance issue.
"""
function LBFGS_dir!(
    dir::Matrix{Tv},
    lbfgshis::LBFGSHistory{Ti, Tv};
) where {Ti <: Integer, Tv <: AbstractFloat}
    m = lbfgshis.m
    lst = lbfgshis.latest[]
    #here, dir, s and y are all matrices
    j = lst
    for i = 1:m 
        α = lbfgshis.vecs[j].ρ[] * dot(lbfgshis.vecs[j].s, dir)
        @. dir -= lbfgshis.vecs[j].y * α 
        lbfgshis.vecs[j].a[] = α
        j -= 1
        if j == 0
            j = m
        end
    end

    j = mod(lst, m) + 1
    for i = 1:m 
        β = lbfgshis.vecs[j].ρ[] * dot(lbfgshis.vecs[j].y, dir)
        γ = lbfgshis.vecs[j].a[] - β
        @. dir += lbfgshis.vecs[j].s * γ 
        j += 1
        if j == m + 1
            j = 1
        end
    end
end

# Benchmark code
numlbfgsvecs = 4 
n = 8000
r = 41
R = randn(n, r)
dir = randn(n, r)
lbfgshis = LBFGSHistory{Int64, Float64}(numlbfgsvecs, LBFGSVector{Float64}[], Ref(numlbfgsvecs))

for i = 1:numlbfgsvecs
    push!(lbfgshis.vecs, 
        LBFGSVector(similar(R), similar(R), Ref(randn(Float64)), Ref(randn(Float64))))
end

@benchmark LBFGS_dir!($dir, $lbfgshis)

My benchmark results looks like:

BenchmarkTools.Trial: 864 samples with 1 evaluation.
 Range (min … max):  5.739 ms …  11.223 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.775 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.785 ms ± 187.210 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▃▂▄▄█▅▂▂▄▁▁▂                                  
  ▃▁▂▂▃▃▃▃▄▃▄▆▇▇▆███████████████▆▅▆▃▅▄▃▃▃▂▂▂▂▂▂▂▂▁▂▁▃▂▁▂▂▂▁▁▃ ▄
  5.74 ms         Histogram: frequency by time        5.84 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

and I benchmarked the critical operations, below are the results.

@benchmark dot($R, $dir)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.322 μs … 257.907 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.815 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.321 μs ±   3.010 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▂▃▃▆▆█▆▇▆▆▃▂                                     
  ▁▁▁▁▁▁▂▃▃▅▅▆██████████████▇▅▄▄▃▃▂▂▂▂▁▂▂▂▂▃▃▃▃▄▄▄▄▄▃▄▃▃▂▂▂▂▂▂ ▃
  11.3 μs         Histogram: frequency by time         18.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

function operator2!(
    C::Matrix{Tv},
    A::Matrix{Tv},
    alpha::Tv,
) where {Tv <: AbstractFloat}
    @. C += alpha * A
end


function operator3!(
    C::Matrix{Tv},
    A::Matrix{Tv},
    alpha::Tv,
) where {Tv <: AbstractFloat}
    @. C -= alpha * A
end

alpha = randn(Float64)

@benchmark operator2!($dir, $R, $alpha)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  307.719 μs … 881.443 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     308.976 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   309.498 μs ±   7.404 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▃▅▆▇███▇▇▅▄▂                             ▁ ▁▁▂▁▂▂▁▁▁▁▁    ▃
  ▅▁▇██████████████▅▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▅▆▇█████████████████ █
  308 μs        Histogram: log(frequency) by time        315 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark operator3!($dir, $R, $alpha)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  307.490 μs …  4.655 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     308.765 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   309.785 μs ± 43.610 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▄▆▇████▇▆▄▂▂▁▂▄▄▄▄▃▂▁                   ▁▁▁▂▁▂▁▂▁▁▁      ▃
  ▅▇███████████████████████▆▅▁▃▁▁▁▁▁▁▃▃▁▅▅▇▇████████████████▇█ █
  307 μs        Histogram: log(frequency) by time       316 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Since my code did 8 dot and 8 daxpy, I would expect it to have a running time slightly more than 2.5 ms, maybe 3ms. But now it’s 5.775ms. I suspect it’s because of the way I declare and instantiate LBFGSHistory but I’m not quite sure.

Any advice would be really helpful.

gdalle · November 7, 2023, 9:18pm

Did you profile the code to see which other functions you spend time in?
When I do it, nearly all of the time is spent in these two lines:

@. dir -= lbfgshis.vecs[j].y * α 
@. dir += lbfgshis.vecs[j].s * γ

Replacing these operations with the more optimized

LinearAlgebra.axpy!(-α, dir, lbfgshis.vecs[j].y)
LinearAlgebra.axpy!(γ, dir, lbfgshis.vecs[j].s)

yields a significant speedup (x3)

Oscar_Smith · November 7, 2023, 9:42pm

is the problem here that this code is missing the fastmath flag needed to vectorize and fma?

yhuang · November 7, 2023, 9:49pm

My main concern here was that daxpy is the most time-consuming operation I did and I did exactly 8 of them. And I benchmarked the way I wrote the daxpy, which took 370us on two matrices with shape 8000 * 41, but the whole procedure takes > 5ms, which is >> 370us * 8, so I was confused.

yhuang · November 7, 2023, 9:52pm

Thanks for the tip! The last two arguments of axpy! need to be exchanged.

stevengj · November 7, 2023, 10:08pm

No, it looks like it’s that axpy! is multithreaded. If I do:

using BenchmarkTools
using LinearAlgebra: axpy!

function foo1!(dir, α, γ, y, s)
    return @. dir += s * γ - y * α
end

@fastmath function foo2!(dir, α, γ, y, s)
    return @. dir += s * γ - y * α
end

function foo3!(dir, α, γ, y, s)
    return axpy!(γ, s, axpy!(-α, y, dir))
end

dir = zeros(8000*41); y = copy(dir); s = copy(dir);

it gives:

julia> @btime foo1!($dir, 0.1, 0.2, $y, $s);
  162.673 μs (0 allocations: 0 bytes)

julia> @btime foo2!($dir, 0.1, 0.2, $y, $s); # uses @fastmath
  158.207 μs (0 allocations: 0 bytes)

julia> @btime foo3!($dir, 0.1, 0.2, $y, $s); # uses axpy!
  85.101 μs (0 allocations: 0 bytes)

This is with the default LinearAlgebra.BLAS.get_num_threads() == 3 on my computer. However, if I turn off BLAS multi-threading, it is slower:

julia> LinearAlgebra.BLAS.set_num_threads(1);

julia> @btime foo3!($dir, 0.1, 0.2, $y, $s);
  221.990 μs (0 allocations: 0 bytes)

probably because I did a single fused loop in foo1! and foo2!, unlike axpy! which requires two loops (and unlike the original code by @yhuang above).

If I use LoopVectorization.jl to multi-thread it:

using LoopVectorization

function foo4!(dir, α, γ, y, s)
    length(dir) == length(y) == length(s) || throw(DimensionMismatch())
    @tturbo for i = 1:length(dir)
        dir[i] += s[i] * γ - y[i] * α
    end
    return dir
end

then (with julia -t 3 to also use 3 threads), I get:

julia> @btime foo4!($dir, 0.1, 0.2, $y, $s);
  59.584 μs (0 allocations: 0 bytes)

which is the fastest yet.

Topic		Replies	Views
Making LBFGS in `Optim.jl` as fast as in `NLopt.jl` Optimization (Mathematical)	13	1969	December 6, 2022
Slow lower triangular solves when compared to full Cholesky Numerics numerics	6	707	September 14, 2021
Preconditioner for L-BFGS in Optim.jl Optimization (Mathematical) question , optimization , numerics	17	3399	May 11, 2020
Newbie Question: Julia optimization multiprocessing Optimization (Mathematical)	10	2357	October 30, 2019
Performance regression in 1.0.1 Performance	5	872	October 2, 2018

Slow L-BFGS code

Related topics