Why `LAPACK.trtrs!`, not `BLAS.trsv!`?

I was surprised to find out LAPACK and BLAS have different triangular solves:

julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== BLAS.trsv('U', 'N', 'N', A, b))
false

julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== LAPACK.trtrs!('U', 'N', 'N', A, copy(b)))
true

I’m curious if anyone knows why triangular solves use the LAPACK version, not the BLAS version? The BLAS version looks to be 3x faster:

julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark BLAS.trsv!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     97.687 μs (0.00% GC)
  median time:      103.377 μs (0.00% GC)
  mean time:        107.281 μs (0.00% GC)
  maximum time:     380.694 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     314.879 μs (0.00% GC)
  median time:      366.067 μs (0.00% GC)
  mean time:        375.021 μs (0.00% GC)
  maximum time:     1.174 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
3 Likes

trtrs is trsv+“check for singularity”. BLAS doesn’t support reporting singularity (there is no info argument) and therefore trsv will just go ahead and divide with zero. When I wrote the triangular code, I thought it was better to report the singularity. It makes sense that it is a bit slower but mainly for small matrices since it will just have to pass through the diagonal.

9 Likes

On a related topic, do you know why I get this warning:

julia> n = 10000; A = randn(n,n); b = randn(n); BLAS.trmv!('U', 'N', 'N', A, b);
WARNING unrolling of the trmv_U loop may give wrong results

I googled for that warning but nothing came up.

I haven’t seen it before but I’m able to reproduce and I was able to find the source. It’s from OpenBLAS. More specifically https://github.com/xianyi/OpenBLAS/blob/e11126b26ada8d97b4a522e461ca92311653bfc6/driver/level2/trmv_U.c#L66

Interesting. This seems like a bug, so maybe I’ll file an OpenBLAS issue? Though perhaps it should be checked with Apple BLAS or MKL first.

I can see the logic.
Could we have a Macro for low overhead mode?

Something in the sense of Debug vs. Release.
So if I feel my code is mature enough I can raise a flags (Hopefully will come a Julia standard) and all function will go to a mode which doesn’t add those “Sanity Checks”.

Just checked with Julia+MKL. No such warning. (I could reproduce as well for Julia+OpenBLAS)

1 Like

I don’t think it is worth the trouble. If you work with so small matrices that it matters, you’d be better off with StaticArrays.

1 Like

A better option to a macro would be a special type, say, NonsingularUpperTriangular that calls the BLAS version.

1 Like

Performance wise it always worth doing that.
That’s one of the reasons people usually turn to low level languages.

The nice thing about Julia it gives its user low level control.

Your argument would be valid for @inbound as well but we know how popular it is.

It’s not valid for @inbounds as the cost for checking for singularity grows like n for n^2 entries while the cost for checking inbounds grows like n^2 for n^2 entries

FWIW, the “use staticarrays for small arrays” argument is valid but not always easy to setup: you often don’t know in advance the size of arrays, StaticArrays have limited functionality, and writing code that is agnostic to the type of arrays is tricky (esp. higher dimensionality arrays). It is always a good idea for julia routines to behave as safely as possible by default, but sometimes you know that your code works fine and just want to speed it up - and you call -O3 --check-bounds=no --math-mode=fast. It would be nice to call the faster routine under fast-math, but I don’t think fast-math is allowed to assume no division by zero?

1 Like

Also, there seems to something fishy, because n=1000 is pretty big and:

using LinearAlgebra
using BenchmarkTools
n = 1000; A = randn(n,n); b = randn(n);
@btime BLAS.trsv!('U', 'N', 'N', A, copy(b))
@btime LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
@btime all(diag(A) .!= 0.0)
  91.324 μs (1 allocation: 7.94 KiB)
  405.328 μs (1 allocation: 7.94 KiB)
  3.878 μs (6 allocations: 12.41 KiB)

so there’s something else going on than just checking for singularity?

2 Likes
julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> n = 10000; A = randn(n,n); b = randn(n); BLAS.trmv!('U', 'N', 'N', A, b);

julia> 

Maybe this is related: lapack performance used to be bad, and maybe still is: