Iβm customizing Burer-Monteiro method on a specific problem, which involves solving many subproblems using L-BFGS. However, I noticed that my L-BFGS runs slow. The critical operation of computing the descent direction is about 2x slower than I expected.
I noticed there are mainly two performance-critical operations in my descent direction computation, one is dot(A, B)
where A, B
are matrices and one is @. A += alpha * B
, or in other words daxpy
operation. I benchmarked them separately and I found the running time of computing the descent direction is 2x slower than the summation of running time of all performance-critical operations.
Here is one MWE:
using LinearAlgebra, SparseArrays, BenchmarkTools
using Random
# for reproducing
Random.seed!(11235813)
"""
Vector of L-BFGS
"""
struct LBFGSVector{T <: AbstractFloat}
# notice that we use matrix instead
# of vector to store s and y because our
# decision variables are matrices
# s = xβββ - xβ
s::Matrix{T}
# y = β f(xβββ) - β f(xβ)
y::Matrix{T}
# Ο = 1/(β¨y, sβ©)
Ο::Base.RefValue{T}
# temporary variable
a::Base.RefValue{T}
end
"""
History of l-bfgs vectors
"""
struct LBFGSHistory{Ti <: Integer, Tv <: AbstractFloat}
# number of l-bfgs vectors
m::Ti
vecs::Vector{LBFGSVector{Tv}}
# the index of the latest l-bfgs vector
# we use a cyclic array to store l-bfgs vectors
latest::Base.RefValue{Ti}
end
Base.:length(lbfgshis::LBFGSHistory) = lbfgshis.m
"""
Computing the descent direction, here I omit some details like
negating the direction to highlight the main performance issue.
"""
function LBFGS_dir!(
dir::Matrix{Tv},
lbfgshis::LBFGSHistory{Ti, Tv};
) where {Ti <: Integer, Tv <: AbstractFloat}
m = lbfgshis.m
lst = lbfgshis.latest[]
#here, dir, s and y are all matrices
j = lst
for i = 1:m
Ξ± = lbfgshis.vecs[j].Ο[] * dot(lbfgshis.vecs[j].s, dir)
@. dir -= lbfgshis.vecs[j].y * Ξ±
lbfgshis.vecs[j].a[] = Ξ±
j -= 1
if j == 0
j = m
end
end
j = mod(lst, m) + 1
for i = 1:m
Ξ² = lbfgshis.vecs[j].Ο[] * dot(lbfgshis.vecs[j].y, dir)
Ξ³ = lbfgshis.vecs[j].a[] - Ξ²
@. dir += lbfgshis.vecs[j].s * Ξ³
j += 1
if j == m + 1
j = 1
end
end
end
# Benchmark code
numlbfgsvecs = 4
n = 8000
r = 41
R = randn(n, r)
dir = randn(n, r)
lbfgshis = LBFGSHistory{Int64, Float64}(numlbfgsvecs, LBFGSVector{Float64}[], Ref(numlbfgsvecs))
for i = 1:numlbfgsvecs
push!(lbfgshis.vecs,
LBFGSVector(similar(R), similar(R), Ref(randn(Float64)), Ref(randn(Float64))))
end
@benchmark LBFGS_dir!($dir, $lbfgshis)
My benchmark results looks like:
BenchmarkTools.Trial: 864 samples with 1 evaluation.
Range (min β¦ max): 5.739 ms β¦ 11.223 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 5.775 ms β GC (median): 0.00%
Time (mean Β± Ο): 5.785 ms Β± 187.210 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
ββββββ
ββββββββββββββββββββββββββββββββ
βββ
ββββββββββββββββββββββββ β
5.74 ms Histogram: frequency by time 5.84 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
and I benchmarked the critical operations, below are the results.
@benchmark dot($R, $dir)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 11.322 ΞΌs β¦ 257.907 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 13.815 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 14.321 ΞΌs Β± 3.010 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββββ
ββββββββββ
β
βββββββββββββββββ
ββββββββββββββββββββββββββββββββ β
11.3 ΞΌs Histogram: frequency by time 18.5 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
function operator2!(
C::Matrix{Tv},
A::Matrix{Tv},
alpha::Tv,
) where {Tv <: AbstractFloat}
@. C += alpha * A
end
function operator3!(
C::Matrix{Tv},
A::Matrix{Tv},
alpha::Tv,
) where {Tv <: AbstractFloat}
@. C -= alpha * A
end
alpha = randn(Float64)
@benchmark operator2!($dir, $R, $alpha)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 307.719 ΞΌs β¦ 881.443 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 308.976 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 309.498 ΞΌs Β± 7.404 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββββββ
ββ β βββββββββββ β
β
βββββββββββββββββ
ββββββββββββββββββββββββ
βββββββββββββββββββ β
308 ΞΌs Histogram: log(frequency) by time 315 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark operator3!($dir, $R, $alpha)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 307.490 ΞΌs β¦ 4.655 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 308.765 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 309.785 ΞΌs Β± 43.610 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββββββββββββββ βββββββββββ β
β
ββββββββββββββββββββββββββ
ββββββββββββ
β
ββββββββββββββββββββ β
307 ΞΌs Histogram: log(frequency) by time 316 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Since my code did 8 dot
and 8 daxpy
, I would expect it to have a running time slightly more than 2.5 ms, maybe 3ms. But now itβs 5.775ms. I suspect itβs because of the way I declare and instantiate LBFGSHistory
but Iβm not quite sure.
Any advice would be really helpful.