Why `LAPACK.trtrs!`, not `BLAS.trsv!`?

dlfivefifty · August 29, 2018, 12:30pm

I was surprised to find out LAPACK and BLAS have different triangular solves:

julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== BLAS.trsv('U', 'N', 'N', A, b))
false

julia> A = randn(5,5); b = randn(5); all((UpperTriangular(A)\b) .=== LAPACK.trtrs!('U', 'N', 'N', A, copy(b)))
true

I’m curious if anyone knows why triangular solves use the LAPACK version, not the BLAS version? The BLAS version looks to be 3x faster:

julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark BLAS.trsv!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     97.687 μs (0.00% GC)
  median time:      103.377 μs (0.00% GC)
  mean time:        107.281 μs (0.00% GC)
  maximum time:     380.694 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> n = 1000; A = randn(n,n); b = randn(n); @benchmark LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     314.879 μs (0.00% GC)
  median time:      366.067 μs (0.00% GC)
  mean time:        375.021 μs (0.00% GC)
  maximum time:     1.174 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

andreasnoack · August 29, 2018, 3:05pm

trtrs is trsv+“check for singularity”. BLAS doesn’t support reporting singularity (there is no info argument) and therefore trsv will just go ahead and divide with zero. When I wrote the triangular code, I thought it was better to report the singularity. It makes sense that it is a bit slower but mainly for small matrices since it will just have to pass through the diagonal.

dlfivefifty · August 30, 2018, 8:09am

On a related topic, do you know why I get this warning:

julia> n = 10000; A = randn(n,n); b = randn(n); BLAS.trmv!('U', 'N', 'N', A, b);
WARNING unrolling of the trmv_U loop may give wrong results

I googled for that warning but nothing came up.

andreasnoack · August 30, 2018, 2:13pm

I haven’t seen it before but I’m able to reproduce and I was able to find the source. It’s from OpenBLAS. More specifically OpenBLAS/trmv_U.c at e11126b26ada8d97b4a522e461ca92311653bfc6 · xianyi/OpenBLAS · GitHub

dlfivefifty · August 30, 2018, 2:22pm

Interesting. This seems like a bug, so maybe I’ll file an OpenBLAS issue? Though perhaps it should be checked with Apple BLAS or MKL first.

RoyiAvital · August 30, 2018, 3:50pm

I can see the logic.
Could we have a Macro for low overhead mode?

Something in the sense of Debug vs. Release.
So if I feel my code is mature enough I can raise a flags (Hopefully will come a Julia standard) and all function will go to a mode which doesn’t add those “Sanity Checks”.

carstenbauer · August 30, 2018, 3:53pm

Just checked with Julia+MKL. No such warning. (I could reproduce as well for Julia+OpenBLAS)

andreasnoack · August 30, 2018, 5:17pm

I don’t think it is worth the trouble. If you work with so small matrices that it matters, you’d be better off with StaticArrays.

dlfivefifty · August 30, 2018, 5:23pm

A better option to a macro would be a special type, say, NonsingularUpperTriangular that calls the BLAS version.

RoyiAvital · August 30, 2018, 5:48pm

Performance wise it always worth doing that.
That’s one of the reasons people usually turn to low level languages.

The nice thing about Julia it gives its user low level control.

Your argument would be valid for @inbound as well but we know how popular it is.

dlfivefifty · August 30, 2018, 5:50pm

It’s not valid for @inbounds as the cost for checking for singularity grows like n for n^2 entries while the cost for checking inbounds grows like n^2 for n^2 entries

antoine-levitt · August 30, 2018, 6:08pm

FWIW, the “use staticarrays for small arrays” argument is valid but not always easy to setup: you often don’t know in advance the size of arrays, StaticArrays have limited functionality, and writing code that is agnostic to the type of arrays is tricky (esp. higher dimensionality arrays). It is always a good idea for julia routines to behave as safely as possible by default, but sometimes you know that your code works fine and just want to speed it up - and you call -O3 --check-bounds=no --math-mode=fast. It would be nice to call the faster routine under fast-math, but I don’t think fast-math is allowed to assume no division by zero?

antoine-levitt · August 30, 2018, 6:25pm

Also, there seems to something fishy, because n=1000 is pretty big and:

using LinearAlgebra
using BenchmarkTools
n = 1000; A = randn(n,n); b = randn(n);
@btime BLAS.trsv!('U', 'N', 'N', A, copy(b))
@btime LAPACK.trtrs!('U', 'N', 'N', A, copy(b))
@btime all(diag(A) .!= 0.0)

  91.324 μs (1 allocation: 7.94 KiB)
  405.328 μs (1 allocation: 7.94 KiB)
  3.878 μs (6 allocations: 12.41 KiB)

so there’s something else going on than just checking for singularity?

Elrod · August 30, 2018, 8:33pm

julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> n = 10000; A = randn(n,n); b = randn(n); BLAS.trmv!('U', 'N', 'N', A, b);

julia>

dlfivefifty · September 3, 2018, 6:23am

https://github.com/xianyi/OpenBLAS/issues/1748

juthohaegeman · September 3, 2018, 9:43am

Maybe this is related: lapack performance used to be bad, and maybe still is:

https://github.com/JuliaLang/julia/issues/18371

Topic		Replies	Views
Slow lower triangular solves when compared to full Cholesky Numerics numerics	6	716	September 14, 2021
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13551	October 30, 2019
Linear least-squares solution occasionally much slower than what is achievable using LAPACK based wrappers General Usage	0	435	July 19, 2020
Alternate BLAS libraries? General Usage blas	22	2915	July 4, 2020
Blas version on MacOS Internals & Design	2	1355	June 7, 2018

Why `LAPACK.trtrs!`, not `BLAS.trsv!`?

Related topics