I have a B \ A
calculation I need to solve repeatedly (10^4β10^5 times) with
- the same
B
, - ideally without allocation, that includes the result
-
A
is a mutable buffer I want to write to - the elements of
A
may beForwardDiff.Dual
I narrowed down a solution that I like (factorize, go static), but I would appreciate suggestions, so I am posting self-contained benchmarks. Julia 1.8, latest package versions.
Some very simple benchmarks that reflect the dimensions of the problem.
Setup
using StaticArrays, BenchmarkTools, LinearAlgebra
n = 49
m = 6
T = Float64
B = rand(T, n, n);
A = rand(T, n, m);
B_lu = lu(B);
A_s = MMatrix{n,m}(A); # recall, I will construct A again and again
B_s = SMatrix{n,n}(B);
B_s_lu = lu(B_s);
# lu with overwritten A which ends up static
function f(::Val{M}, ::Val{N}, A, B) where {M,N}
ldiv!(B, A)
SMatrix{M,N}(A)
end
The benchmarks with Float64
julia> @benchmark $B \ $A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 22.032 ΞΌs β¦ 3.351 ms β GC (min β¦ max): 0.00% β¦ 90.74%
Time (median): 24.108 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 26.990 ΞΌs Β± 47.428 ΞΌs β GC (mean Β± Ο): 2.24% Β± 1.28%
ββββββββ
β
βββββββββββββ βββββββββ ββ ββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββ
β
β
ββ β
22 ΞΌs Histogram: log(frequency) by time 46.2 ΞΌs <
Memory estimate: 21.73 KiB, allocs estimate: 4.
julia> @benchmark $B_lu \ $A # factorize
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min β¦ max): 4.294 ΞΌs β¦ 796.204 ΞΌs β GC (min β¦ max): 0.00% β¦ 92.55%
Time (median): 5.023 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 5.380 ΞΌs Β± 7.989 ΞΌs β GC (mean Β± Ο): 1.37% Β± 0.93%
ββββββββ
ββββ
βββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ β
4.29 ΞΌs Histogram: frequency by time 8.34 ΞΌs <
Memory estimate: 2.44 KiB, allocs estimate: 1.
julia> @benchmark $B_s_lu \ $A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min β¦ max): 8.937 ΞΌs β¦ 90.601 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.157 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 9.725 ΞΌs Β± 1.634 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββββββββββ βββ βββ β ββββ β
βββββββββββββββββββββββββββββββββββ
βββββββββββββββ
ββββββ
ββ β
8.94 ΞΌs Histogram: log(frequency) by time 14.5 ΞΌs <
Memory estimate: 2.38 KiB, allocs estimate: 1.
julia> @benchmark f($Val(m), $Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min β¦ max): 5.372 ΞΌs β¦ 877.653 ΞΌs β GC (min β¦ max): 0.00% β¦ 92.07%
Time (median): 5.996 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 6.576 ΞΌs Β± 15.013 ΞΌs β GC (mean Β± Ο): 3.68% Β± 1.60%
βββββββ
ββ
ββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββ β
5.37 ΞΌs Histogram: frequency by time 9.89 ΞΌs <
Memory estimate: 2.38 KiB, allocs estimate: 1.
The benchmarks with ForwardDiff.Dual
A bit of setup, then rerun the same code (incl setup above):
import ForwardDiff
_dual(x) = ForwardDiff.Dual(x, Tuple(randn(5)))
A = _dual.(A);
julia> @benchmark $B \ $A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 115.994 ΞΌs β¦ 3.557 ms β GC (min β¦ max): 0.00% β¦ 84.93%
Time (median): 128.175 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 136.982 ΞΌs Β± 64.630 ΞΌs β GC (mean Β± Ο): 1.25% Β± 2.80%
βββββ
ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββ
β
β
ββ
βββ
β
β
β
βββ
ββ
ββ
ββ β
116 ΞΌs Histogram: log(frequency) by time 247 ΞΌs <
Memory estimate: 145.84 KiB, allocs estimate: 6.
julia> @benchmark $B_lu \ $A # factorize
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 96.735 ΞΌs β¦ 7.046 ms β GC (min β¦ max): 0.00% β¦ 84.50%
Time (median): 111.906 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 121.081 ΞΌs Β± 90.169 ΞΌs β GC (mean Β± Ο): 1.71% Β± 2.82%
β
ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ
β
β
βββ
β
ββββ
βββ
ββ
βββββ
β
ββ
ββ β
96.7 ΞΌs Histogram: log(frequency) by time 251 ΞΌs <
Memory estimate: 126.55 KiB, allocs estimate: 3.
julia> @benchmark $B_s_lu \ $A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min β¦ max): 8.938 ΞΌs β¦ 183.102 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.117 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 9.752 ΞΌs Β± 2.930 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββββ βββββββββββ ββ β β ββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββ
β
ββ
β
8.94 ΞΌs Histogram: log(frequency) by time 14.1 ΞΌs <
Memory estimate: 2.38 KiB, allocs estimate: 1.
julia> @benchmark f($Val(m), $Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 27.163 ΞΌs β¦ 2.311 ms β GC (min β¦ max): 0.00% β¦ 89.31%
Time (median): 28.529 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 30.160 ΞΌs Β± 23.206 ΞΌs β GC (mean Β± Ο): 0.68% Β± 0.89%
βββββ
β
ββββββββββ ββ ββ β β βββ β
ββββββββββββββββββββββββββββββββββββββββββββββ
βββββββ
ββ
ββ
ββ
β
27.2 ΞΌs Histogram: log(frequency) by time 46 ΞΌs <
Memory estimate: 13.88 KiB, allocs estimate: 1.