I was surprised to find out LAPACK and BLAS have different triangular solves:
julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== BLAS.trsv('U', 'N', 'N', A, b))
false
julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== LAPACK.trtrs!('U', 'N', 'N', A, copy(b)))
true
I’m curious if anyone knows why triangular solves use the LAPACK version, not the BLAS version? The BLAS version looks to be 3x faster:
julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark BLAS.trsv!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 97.687 μs (0.00% GC)
median time: 103.377 μs (0.00% GC)
mean time: 107.281 μs (0.00% GC)
maximum time: 380.694 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 314.879 μs (0.00% GC)
median time: 366.067 μs (0.00% GC)
mean time: 375.021 μs (0.00% GC)
maximum time: 1.174 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
trtrs is trsv+“check for singularity”. BLAS doesn’t support reporting singularity (there is no info argument) and therefore trsv will just go ahead and divide with zero. When I wrote the triangular code, I thought it was better to report the singularity. It makes sense that it is a bit slower but mainly for small matrices since it will just have to pass through the diagonal.
I can see the logic.
Could we have a Macro for low overhead mode?
Something in the sense of Debug vs. Release.
So if I feel my code is mature enough I can raise a flags (Hopefully will come a Julia standard) and all function will go to a mode which doesn’t add those “Sanity Checks”.
It’s not valid for @inbounds as the cost for checking for singularity grows like n for n^2 entries while the cost for checking inbounds grows like n^2 for n^2 entries
FWIW, the “use staticarrays for small arrays” argument is valid but not always easy to setup: you often don’t know in advance the size of arrays, StaticArrays have limited functionality, and writing code that is agnostic to the type of arrays is tricky (esp. higher dimensionality arrays). It is always a good idea for julia routines to behave as safely as possible by default, but sometimes you know that your code works fine and just want to speed it up - and you call -O3 --check-bounds=no --math-mode=fast. It would be nice to call the faster routine under fast-math, but I don’t think fast-math is allowed to assume no division by zero?
Also, there seems to something fishy, because n=1000 is pretty big and:
using LinearAlgebra
using BenchmarkTools
n = 1000; A = randn(n,n); b = randn(n);
@btime BLAS.trsv!('U', 'N', 'N', A, copy(b))
@btime LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
@btime all(diag(A) .!= 0.0)