I was surprised to find out LAPACK and BLAS have different triangular solves:
julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== BLAS.trsv('U', 'N', 'N', A, b))
false
julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== LAPACK.trtrs!('U', 'N', 'N', A, copy(b)))
true
I’m curious if anyone knows why triangular solves use the LAPACK version, not the BLAS version? The BLAS version looks to be 3x faster:
julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark BLAS.trsv!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 97.687 μs (0.00% GC)
median time: 103.377 μs (0.00% GC)
mean time: 107.281 μs (0.00% GC)
maximum time: 380.694 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 314.879 μs (0.00% GC)
median time: 366.067 μs (0.00% GC)
mean time: 375.021 μs (0.00% GC)
maximum time: 1.174 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)