AOCL (not MKL) acceleration on AMD Ryzen CPU's

aasdelat · April 12, 2024, 9:50pm

Hi:
I am interested in calling a lapack routine, dgesvd, that makes a singular value decomposition of a matrix.
The library LinearAlgebra.jl already does it, but it is a generic library that does not take full advantage of the concrete CPU you use.

For Intel CPU’s, there is MKL, that can be used in Julia by means of the MKL.jl library. In order to use it, you simply import these libraries in the following order:

using MKL
using LinearAlgebra

In this way, the library LinearAlgebra will call the routines in the mkl library, instead of the default lapack library.
This library (mkl) seems to also boost the performance of AMD CPU’s, but also seems to not to take full advantage of them.

On the other hand, AMD has developed the AOCL libraries, that replace LAPACK, BLAS, etc. taking full advantage of AMD’s CPUs.
In order to use them in Julia, the appropriate thing would be to have an AOCL.jl library that you could use in the same manner as the MKL.jl library is being used.
It would also be nice that I had the abilities needed to develop such a julia library, but I do not have them, so I have to wait for somebody to make it.
Meanwhile, I want to face a much easier problem that may be enough for my current needs.

This easier problem is to call the dgesv routine from the libflame.so library (aocl replacement for liblapack.so) by means of cccall.
For this purpose, I have inspected the source file lapack.jl, which is the interface to the lapack library, and have copied a piece of code and adapted it. In order to find the correct symbol in the shared library, that corresponds to the dgesvd function, I have used:

nm -D libflame.so | grep dgesvd

and got a bunch of possible symbols (23 symbols):

000000000032b2a0 T dgesvd
0000000000703460 T dgesvd_
000000000053b330 T dgesvd2x2
00000000007155e0 T dgesvd_check
...... a lot more .......

I have tested several of them, begining with dgesvd. This one seems to be the appropriate, but when I run dgesvd!(jobu, jobvt, A), I get the error:

~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/bin/julia: symbol lookup error: /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so: undefined symbol: dnrm2_

If I use other symbols, the same or other errors appear.
The code I have written so far is:
test_lapack_svd_call.jl (5.7 KB)
You can test it by means of:

julia> include("test_lapack_svd_call.jl")
julia> dgesvd!(jobu, jobvt, A)

The matrix A is generated inside the script, but the script does not call the function, so you call it after loading the script.
Can anybody help me, please?

Thanks a lot in advance.

Elrod · April 12, 2024, 10:16pm

Did you try

nm -D libflame.so | grep dnrm2_

?

Given that it is provided by libblis (these are 64 bit integers, hence the extra 64 suffix):

> nm -D ~/.julia/artifacts/3d9c1a6f0bd9ed317a031702dc47068a05ac7a9c/lib/libblis.so | grep dnrm2                                                                                                                                                                                                                                                                                                                                                (base) 
0000000000a798b0 T cblas_dnrm2
0000000000a6b630 T dnrm2_64_
0000000000a740a0 T dnrm2sub_64_

I’m guessing it isn’t provided by libflame, and instead that it expects you to (dynamically?) link libblis.
Related: Calling code from a library (.so): works on one system, fails on another - #3 by m-j-w

abulak · April 12, 2024, 10:22pm

I myself quickly looked into building AOCL jlls. You need to link all of them against each other as indicated in their build instructions. The above suggestion most probably pins the problem.

aasdelat · April 13, 2024, 2:10pm

Wow!, great!. Thank you for your job in advance. I am sure many people will be grateful for it.

aasdelat · April 13, 2024, 4:23pm

Thank you, @Elrod .
First of all, an internet search on dnrm2, leads to some pages where dnrm2 is described as:

 *> DNRM2 returns the euclidean norm of a vector via the function
   27 *> name, so that
   28 *>
   29 *>    DNRM2 := sqrt( x'*x )

And it is part of the LAPACK, but in the BLAS library.

Inspecting the libflame.so as suggested:

$ nm -D libflame.so | grep dnrm2
00000000002a73f0 T bl1_dnrm2
                 U dnrm2_

According to the nm doumentation, U means “The symbol is undefined”, what, in turn, I think, means that the code for that function is not in libflame.so. This is what I expected, if dnrm2 is in BLAS. So, I inspected the libblis.so (single thread) and libblis-mt.so (multi threaded) libraries, and got:

$ nm -D libblis.so | grep dnrm2_
0000000000930100 T dnrm2_
0000000000930090 T dnrm2_blis_impl

and

$ nm -D libblis-mt.so | grep dnrm2
000000000094a7d0 T cblas_dnrm2
00000000009a9fb0 T dnrm2
000000000092dd70 T dnrm2_
000000000092dd00 T dnrm2_blis_impl
00000000009ab780 T dnrm2sub
0000000000944aa0 T dnrm2sub_
0000000000944a90 T dnrm2sub_blis_impl

respectively. According to the nm doumentation, “T” or “t” means: “The symbol is in the text (code) section.”

So it exists, but not in libflame.so, but in libblis.so and libblis-mt.so. So, I have to link to any of these libraries (well I am interested int he multi threaded one).

As suggested in the suggested thread, I have tested my LD_LIBRARY_PATH, getting:

/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib

Going to the first listed directory:

$ ls
cmake                     libblis-mt.so            libfftw3l_mpi.la         libfftw3q_omp.so.3
libalcp.a                 libblis-mt.so.4          libfftw3l_mpi.so         libfftw3q_omp.so.3.6.10
libalcp.so                libblis-mt.so.4.2.0      libfftw3l_mpi.so.3       libfftw3q.so
libalm.a                  libblis.so               libfftw3l_mpi.so.3.6.10  libfftw3q.so.3
libalmfast.a              libblis.so.4             libfftw3l_omp.a          libfftw3q.so.3.6.10
libalmfast.so             libblis.so.4.2.0         libfftw3l_omp.la         libfftw3.so
libalm.so                 libbz2.so                libfftw3l_omp.so         libfftw3.so.3
libamdlibm.a              libfftw3.a               libfftw3l_omp.so.3       libfftw3.so.3.6.10
libamdlibmfast.a          libfftw3f.a              libfftw3l_omp.so.3.6.10  libflame.a
libamdlibmfast.so         libfftw3f.la             libfftw3l.so             libflame.so
libamdlibm.so             libfftw3f_mpi.a          libfftw3l.so.3           libipp-compat.so
libamdsecrng.a            libfftw3f_mpi.la         libfftw3l.so.3.6.10      liblz4.so
libamdsecrng.so           libfftw3f_mpi.so         libfftw3_mpi.a           liblzma.so
libamdsecrng.so.4.2       libfftw3f_mpi.so.3       libfftw3_mpi.la          libopenssl-compat.so
libamdsecrng.so.4.2.0     libfftw3f_mpi.so.3.6.10  libfftw3_mpi.so          librng_amd.a
libaocl_compression.a     libfftw3f_omp.a          libfftw3_mpi.so.3        librng_amd.so
libaocl_compression.so    libfftw3f_omp.la         libfftw3_mpi.so.3.6.10   librng_amd.so.4.2
libaocl-libmem.a          libfftw3f_omp.so         libfftw3_omp.a           librng_amd.so.4.2.0
libaocl-libmem.so         libfftw3f_omp.so.3       libfftw3_omp.la          libscalapack.a
libaocl-libmem.so.4.2.0   libfftw3f_omp.so.3.6.10  libfftw3_omp.so          libscalapack.so
libaoclsparse.a           libfftw3f.so             libfftw3_omp.so.3        libsnappy.so
libaoclsparse.so          libfftw3f.so.3           libfftw3_omp.so.3.6.10   libz.so
libaoclsparse.so.4.2.0.0  libfftw3f.so.3.6.10      libfftw3q.a              libzstd.so
libaoclutils.a            libfftw3.la              libfftw3q.la             pkgconfig
libaoclutils.so           libfftw3l.a              libfftw3q_omp.a
libblis.a                 libfftw3l.la             libfftw3q_omp.la
libblis-mt.a              libfftw3l_mpi.a          libfftw3q_omp.so

We can see libblis.so and libblis-mt.so, so the LD_LIBRARY_PATH is correct.

I addition, lets see if libflame.so knows where libblis is:

$ ldd libflame.so 
	linux-vdso.so.1 (0x00007ffd9d4eb000)
	libaoclutils.so => /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so (0x0000737c0e67c000)
	libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x0000737c0d200000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x0000737c0e677000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x0000737c0ce00000)
	/lib64/ld-linux-x86-64.so.2 (0x0000737c0e688000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000737c0ca00000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x0000737c0d519000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000737c0e655000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x0000737c0e650000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x0000737c0e64b000)

There are not unresolved directions, but it does not show any dependency on libblis.so or libblis-mt.so (well, my understanding on ldd is not precisely expert, but I think this is the interpretation).

So, the last option, I think, is to explicitly link the libblis-mt.so. According to the AOCL documentation:

To use AOCL-LAPACK in your application, link with AOCL-LAPACK, AOCL-BLAS, and AOCL-Utils libraries while building the application.

AOCL-Utils library has libstdc++ library dependency. As AOCL-LAPACK is dependent on AOCL-Utils, applications must link with libstdc++(-lstdc++) as well.

But, How do I link to several shared libraries in Julia? Is there a way to do this with ccall or do I have to make weird compilation or pre-compilation things?

ufechner7 · April 13, 2024, 5:20pm

I mean, what should always work is to write a tiny wrapper in C that imports the required libraries and exports just the function that you want to use from Julia…

But perhaps there is another, easier way…

aasdelat · April 13, 2024, 9:35pm

Wow!, I have been struggling with the way to make this, and, finally, the following worked. I have created the file dgesvd_wrapper.f90:

module dgesvd_wrapper

PUBLIC :: my_dgesvd

contains

function my_dgesvd( JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
                   WORK, LWORK, INFO )

      ! .. Scalar Arguments ..
      CHARACTER          JOBU, JOBVT
      INTEGER            INFO, LDA, LDU, LDVT, LWORK, M, N

      !.. Array Arguments ..
      DOUBLE PRECISION   A( LDA, * ), S( * ), U( LDU, * ),          &
                         VT( LDVT, * ), WORK( * )
                         
    call dgesvd(JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
           WORK, LWORK, INFO )
           
end function my_dgesvd

end module dgesvd_wrapper

And created the shared library by:

$ flang -g -shared /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.a /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libblis-mt.a /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.a /usr/lib/x86_64-linux-gnu/libstdc++.so.6 -lm -fopenmp dgesvd_wrapper.f90 -o desvd_wrapper.so

It is important to note the -lm -fopenmp options, that are documented in table 6 of section “4.2.3 Linking Application with AOCL-BLAS” of the AOCL documentation.

$ nm -D desvd_wrapper.so | grep dgesvd
00000000000d4900 T dgesvd_
00000000000d4840 T __dgesvd_wrapper_
00000000008eed40 B _dgesvd_wrapper_0_
00000000000d4850 T dgesvd_wrapper_my_dgesvd_
000000000071fd70 T fla_dgesvd_nn_small10
0000000000726830 T fla_dgesvd_nn_small10_avx2
000000000071fdb0 T fla_dgesvd_nn_small1T
0000000000735b90 T fla_dgesvd_nn_small1T_avx2
000000000071fd90 T fla_dgesvd_small6
00000000007335d0 T fla_dgesvd_small6_avx2
000000000071fdd0 T fla_dgesvd_small6T
00000000007373c0 T fla_dgesvd_small6T_avx2
00000000008c1870 T lapack_dgesvd

Now, it makes the SVD decomposition when I use any of the symbols: dgesvd_, lapack_dgesvd or dgesvd_wrapper_my_dgesvd_
In order to use more than one thread, I also have to:

export OMP_NUM_THREADS=30

It uses all 30 threads for a while, then one thread, and, lastly, all 30 threads again. I do not now the reason for this behavior an I have to make more tests.

abulak · April 13, 2024, 11:41pm

what does ldd on libflame show from within julia (in shell mode)? Did you start julia with LD_LIBRARY_PATH modified?

ufechner7 · April 14, 2024, 12:24pm

Any benchmarks results already?

aasdelat · April 15, 2024, 6:28pm

I have to say that there is a mistake in the Fortran code: it should say “subroutine” where it says “function”. Once corrected in my code and recompiled, the behavior does not change at all.

aasdelat · April 15, 2024, 9:12pm

ENVIRONMENT

First of all, I will tell more about my environment:

$ uname -a
Linux ryzen-casa 6.5.0-27-generic #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I have installed the AOCL libraries from the binaries: aocl-linux-aocc-4.2.0_1_amd64.deb.
I also have installed the AOCC compiler suite from the binaries: aocc-compiler-4.2.0_1_amd64.deb

I set the necessary environment variables for these packages by means of the respective scripts provided by each package, and that I call from my ~/.profile
So, the last two lines of my ~/.profile are:

source /opt/AMD/aocc-compiler-4.2.0/setenv_AOCC.sh
source /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/amd-libs.cfg

Regarding my LD_LIBRARY_PATH, in a shell just before starting Julia:

$ export | grep 
declare -x LD_LIBRARY_PATH="/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib:"

In a shell from Julia:

shell> ldd /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so
	linux-vdso.so.1 (0x00007ffc9bf98000)
	libaoclutils.so => /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so (0x0000718616c4f000)
	libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x0000718615800000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x0000718616c4a000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x0000718615400000)
	/lib64/ld-linux-x86-64.so.2 (0x0000718616c5b000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000718615000000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x0000718615b19000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000718616c28000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x0000718616c23000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x0000718616c1e000)

shell> ldd desvd_wrapper.so
	linux-vdso.so.1 (0x00007ffe41b6d000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ef349c00000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007ef349f19000)
	libflang.so => /opt/AMD/aocc-compiler-4.2.0/lib/libflang.so (0x00007ef349600000)
	libflangrti.so => /opt/AMD/aocc-compiler-4.2.0/lib/libflangrti.so (0x00007ef34a928000)
	libpgmath.so => /opt/AMD/aocc-compiler-4.2.0/lib/libpgmath.so (0x00007ef349200000)
	libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007ef349ed1000)
	libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x00007ef348e00000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ef34a906000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007ef348a00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ef34a937000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007ef34a901000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ef34a8fa000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007ef349ecc000)

Where desvd_wrapper.so is the shared library I created containing libblame, blis and others. Recall that this has been necessary because libblame calls routines from blis (and others), and passing libflame to Julia’s ccall complaints about those routines as missing.
However, from a shell from Julia:

shell> echo $LD_LIBRARY_PATH
ERROR: UndefVarError: `LD_LIBRARY_PATH` not defined
Stacktrace:
 [1] top-level scope
   @ none:1

But, as I have stated above, if I test for LD_LIBRARY_PATH in the shell just before starting Julia, I can see that it is correctly set.

BENCHMARKING

The calculation has a strange behavior: it starts using all (30 assigned) threads, then uses only one thread for a long time, and, at last, uses all threads again until it finishes and returns the result.

I do not know how to separately measure these different periods of the run time, so I resolved to time them with a hand held stopwatch (well, in fact it is a smartphone ) while I keep an eye on genome-system-monitor, hitting the lapse button of the chronometer each time I see all threads rising up or dropping down.

The relevant lines of the script are:

A = rand(100_000, 5_000)
...
@benchmark U, S, VT = dgesvd!(jobu, jobvt, A)

The results of the (hand) timing are the following ones. Due to the measurement method, the times are not accurate and can be one or two seconds shorter than specified:

Start   00:00
        all threads working (at 100% or nearly) for 01:24
-       01:24
        Only one thread working (100%) for 05:10
-       06:35
        all threads working again (at 100% or nearly) for 02:48
End     09:23

And the output is:

julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 80.796 s (0.01% GC) to evaluate,
 with a memory estimate of 4.10 GiB, over 60 allocations.

If I use @time U, S, VT = dgesvd!(jobu, jobvt, A) instead of @benchmark U, S, VT = dgesvd!(jobu, jobvt, A):

julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
409.325711 seconds (39.38 k allocations: 4.103 GiB, 0.01% gc time, 0.01% compilation time)

409.325711 seconds ~ 7 iminutes

Let’s compare it whith the svd provided by LinearAlgebra.jl. In this case, the script is simple and looks like:

using LinearAlgebra
using BenchmarkTools

println("Generating random matrix ...")
A = rand(100_000, 5_000)

println("Making the SVD ...")
@benchmark F = svd(A)

There is no strange behavior, in the sense that all (30 assigned) threads are working from the beginning to the end.
And the result is:

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 83.899 s (0.27% GC) to evaluate,
 with a memory estimate of 8.38 GiB, over 13 allocations.

For the time, I do not have to measure separate periods, so I launch the modified scrtipt:

using LinearAlgebra
using BenchmarkTools

println("Generating random matrix ...")
A = rand(100_000, 5_000)

println("Making the SVD ...")
@btime F = svd(A)

And I get:

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
 85.518271 seconds (110.90 k allocations: 8.390 GiB, 0.08% gc time, 0.05% compilation time)

Finally, let’s use a much smaller matrix in order to get benchmark statistics.
Now, I use for my dgesvd:

A = rand(100, 50)
...
@benchmark U, S, VT = dgesvd!(jobu, jobvt, A)

And the output is:

julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  195.509 μs …  1.472 ms  ┊ GC (min … max): 0.00% … 83.37%
 Time  (median):     199.216 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.562 μs ± 22.550 μs  ┊ GC (mean ± σ):  0.40% ±  2.74%

       ▂▅▇██▇▅▃                                                 
  ▂▂▃▅██████████▇▆▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂ ▃
  196 μs          Histogram: frequency by time          218 μs <

 Memory estimate: 105.59 KiB, allocs estimate: 39.

And the output with LinearAlgebra’s svd:

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  428.609 μs …  2.338 ms  ┊ GC (min … max): 0.00% … 76.33%
 Time  (median):     441.087 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   450.616 μs ± 55.806 μs  ┊ GC (mean ± σ):  0.60% ±  3.42%

    ▁▃▅▇██▇▆▅▃▂▂▂▂▂▂▂▃▄▄▃▃▁▁                  ▂▃▃▂▁            ▃
  ▅█████████████████████████▇▇▆▆▆▅▃▄▁▄▃▃▄▄▁▆▇███████▇▆▆▅▆▃▃▅▆▇ █
  429 μs        Histogram: log(frequency) by time       527 μs <

 Memory estimate: 182.59 KiB, allocs estimate: 11.

Conclussion

It seems that, for small matrices, my dgesvd behaves better than LinearAlgebra's svd. For large matrices, my dgesvd not only takes more time, but also has a strange behavior. Could this strange behavior be solved and, hence, the times improved?

Elrod · April 16, 2024, 12:21am

Try like this:

julia> ENV["LD_LIBRARY_PATH"]
"/usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/:/usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/"

Mind trying LinearAlgebra after using MKL?
That’s likely to do better at small sizes than the default OpenBLAS, even on AMD.

aasdelat · April 16, 2024, 8:44am

julia> ENV["LD_LIBRARY_PATH"]
"/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib:"

This way shows it!

aasdelat · April 16, 2024, 12:40pm

When I sue the MKL.jl library, it uses more cpu threads, but I set:

N_THREADS=30
export JULIA_NUM_THREADS=1
export OMP_NUM_THREADS=$N_THREADS
export OPENBLAS_NUM_THREADS=$N_THREADS
export MKL_NUM_THREADS=$N_THREADS
export VECLIB_MAXIMUM_THREADS=$N_THREADS
export NUMEXPR_NUM_THREADS=$N_THREADS

The script is now:

using MKL
using LinearAlgebra
using BenchmarkTools

println("Generating random matrix ...")
A = rand(100_000, 5_000) # Big matrix
#A = rand(100, 50)        # Small matrix

println("Making the SVD ...")
@benchmark F = svd(A)

and watching gnome-system-monitor it seems that only 12 cpu threads are 100%. There are 4 threads at >60%. This makes nearly 15 threads. Could this mean that cpu multi-trheading is not very performant for this application and only the real cores are worth?.
The result is (begins at 12:53 and ends at 12:55, id est, lasts for 2 minutes):

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 42.031 s (0.56% GC) to evaluate,
 with a memory estimate of 8.38 GiB, over 13 allocations.

Despite the fact that it does not use 30 threads, it lasts only for 2 minutes, which is much better than my dgesvd.

And now, with a small matrix, A = rand(100, 50):

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  433.429 μs …  1.910 ms  ┊ GC (min … max): 0.00% … 72.58%
 Time  (median):     448.015 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   453.471 μs ± 39.689 μs  ┊ GC (mean ± σ):  0.41% ±  2.91%

         ▂▅▇██▆▄▂                                               
  ▂▂▂▃▄▅▇█████████▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▃▃▂▃▃▃▃▂▂▂▂ ▃
  433 μs          Histogram: frequency by time          509 μs <

 Memory estimate: 182.59 KiB, allocs estimate: 11.

aasdelat · April 17, 2024, 9:35pm

About the strange behavior of the dgesvd function, in order to test if it is a Julia or a aocl problem, I have made a fortran program that uses that function, and the result is that it behaves the same: Starts using all threads, then uses only one thread for a long time, and, finally, all threads again. The overall time is too far from the 2 minutes lasted by the mkl library.
Here is the code:

program use_dgesvd_aocl

implicit none

double precision    ::  A(100000, 5000) ! Big matrix
!double precision    ::  A(100, 50)      ! Small matrix

! .. Scalar Arguments ..
CHARACTER        ::  jobu, jobvt
INTEGER          ::  info, lda, ldu, ldvt, lwork, m, n, minmn

               
double precision, allocatable, dimension(:,:)   :: U, VT
double precision, allocatable, dimension(:)     :: S, work

                 
call RANDOM_NUMBER(A)

jobu  = 'S' ! 'S':  the first min(m,n) columns of U (the left singular
            !       vectors) are returned in the array U
jobvt = 'S' ! 'S':  the first min(m,n) rows of V**T (the right singular
            !       vectors) are returned in the array VT
m = size(A,1) ! number of rows of submatrix A used in the computation.
n = size(A,2) ! number of columns of submatrix A used in the computation.

minmn  = min(m, n)
lda = max(m, 1)
ldu = m
ldvt = minmn

allocate(S(minmn))
allocate(work(1))
allocate(U(ldu, minmn))
allocate(VT(ldvt, n))


! First call: get the best length for array work, in the integer lwork
lwork = -1 ! Query
call dgesvd(JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
       WORK, LWORK, INFO )
if (info == 0) then
    lwork = work(1)
    deallocate(work)
    allocate(work(lwork))
else
    write(*,*) "Error: ", info
end if

call dgesvd(JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
       WORK, LWORK, INFO )
       
write(*,*) "info: ", info
write(*,*) "          = 0:  successful exit."
write(*,*) "          < 0:  if INFO = -i, the i-th argument had an illegal value."
write(*,*) "          > 0:  if DBDSQR did not converge, INFO specifies how many"
write(*,*) "                superdiagonals of an intermediate bidiagonal form B"
write(*,*) "                did not converge to zero. See the description of WORK"
write(*,*) "                above for details."

end program use_dgesvd_aocl

I compile it by means of:

flang -g /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libblis-mt.so /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so /usr/lib/x86_64-linux-gnu/libstdc++.so.6 -lm -fopenmp use_dgesvd_aocl.f90 -o use_dgesvd_aocl

MKL outperforms aocl, but I do not think that the solution is to use mkl libraries, because they do not take full advantage from AMD’s CPU. So, I think this is an issue for AOCL technical support, if there is not a better idea.

aasdelat · March 14, 2025, 4:34pm

There has been a misleading, because I have been using the dgesvd while I have should be using dgesdd. As stated in the LAPACK documentation for the singular value decomposition:

There are two types of driver routines for the SVD. Originally LAPACK had just the simple driver described below, and the other one was added after an improved algorithm was discovered.

a simple driver xGESVD computes all the singular values and (optionally) left and/or right singular vectors.

a divide and conquer driver xGESDD solves the same problem as the simple driver. It is much faster than the simple driver for large matrices, but uses more workspace. The name divide-and-conquer refers to the underlying algorithm (see sections 2.4.4 and 3.4.3).

In fact, Julia’s LinearAlgebra svd(), calls dgesdd and not, dgesvd. So, I have imported the dgesdd and tested it, resulting the following times:

Julia LinearAlgebra   without mkl     without aocl    87 seconds (calls dgesdd, not dgesvd)
Julia LinearAlgebra   with mkl        without aocl    43 seconds
Julia calling AOCL dgesvd with ccall                  6 min 32 seconds (and strange behaviour)
Fortran calling dgesvd from libblame                  6 min 42 seconds (and strange behaviour)
Julia calling AOCL dgesdd with ccall                  84 seconds (NO strange behavior)

So it behaves the same as Julia’s LinearAlgebra svd().
I am glad there is no strange behavior, but Intel’s MKL still beats all other ones. Is it taking advantage of AMD’s CPU’s peculiarities?
Is AMD’s AOCL taking full advantage of their processors?

CVish · May 12, 2025, 11:25am

Pls contact toolchainsupport@amd.com to request for AMD optimized libraries.

Roberto_Bagyo · May 13, 2025, 8:24pm

new here in julia. i have tested AOCL, Openblas, intelMKL in my ryzen 3900x home computer. this the benchmark result by fortran and juliat version. as you can see AOCL actualy quite good. but still mkl wins eventhough in AMD. julia blas mkl is faster in dgetrf/dgetri than fortran but same performance in dgesv. hope there is chance AOCL lib getting involved in julia.

C:\Temp>aocl
 Inside arraySub() allocating memory ...
 Memory allocation sufficient
  Benchmark running, hopefully as only ACTIVE task
 Number of OpenMP threads:                       24
 Using system-configured threads:                       24
 Final BLAS thread count:                       24
Test1: Gauss1OMP- 1000 (   250x   250) inverts in   0.675 seconds  Err=  0.6357568671408161E-14
Test2: CROUT1OMP- 1000 (   250x   250) inverts in   0.967 seconds  Err=  0.7115810242203004E-14
 Calling CroutMT- Bigsize - Inverse1
 Calling CroutMT- Bigsize - Inverse2
Test3: CROUTMT  -    2 (  2000x  2000) inverts in   2.811 seconds  Err=  0.4444341847960061E-09
Test4: DGETRF   -    2 (  2000x  2000) inverts in   0.238 seconds  Err=  0.1480195511849177E-11
Test5: DGETRF2  -    2 (  2000x  2000) inverts in   0.232 seconds  Err=  0.1480195511849177E-11
Test6: DGESV    -    2 (  2000x  2000) inverts in   0.245 seconds  Err=  0.2163458817503340E-11
                             Total =  5.2 sec


C:\Temp>openblas
 Inside arraySub() allocating memory ...
 Memory allocation sufficient
  Benchmark running, hopefully as only ACTIVE task
 Number of OpenMP threads:                       24
 Using system-configured threads:                       24
 Final BLAS thread count:                       24
Test1: Gauss1OMP- 1000 (   250x   250) inverts in   0.559 seconds  Err=  0.4285718182520538E-14
Test2: CROUT1OMP- 1000 (   250x   250) inverts in   0.665 seconds  Err=  0.1317388121697570E-13
 Calling CroutMT- Bigsize - Inverse1
 Calling CroutMT- Bigsize - Inverse2
Test3: CROUTMT  -    2 (  2000x  2000) inverts in   2.823 seconds  Err=  0.1646278156525469E-09
Test4: DGETRF   -    2 (  2000x  2000) inverts in   0.520 seconds  Err=  0.3912745006479850E-12
Test5: DGETRF2  -    2 (  2000x  2000) inverts in   0.357 seconds  Err=  0.3746751382358156E-12
Test6: DGESV    -    2 (  2000x  2000) inverts in   0.282 seconds  Err=  0.4674058187435756E-12
                             Total =  5.2 sec


C:\Temp>mkl
 Inside arraySub() allocating memory ...
 Memory allocation sufficient
  Benchmark running, hopefully as only ACTIVE task
 Number of OpenMP threads:                       24
 Using system-configured threads:                       24
 Final BLAS thread count:                       24
Test1: Gauss1OMP- 1000 (   250x   250) inverts in   0.668 seconds  Err=  0.5603592445228368E-14
Test2: CROUT1OMP- 1000 (   250x   250) inverts in   0.821 seconds  Err=  0.7494986656780616E-14
 Calling CroutMT- Bigsize - Inverse1
 Calling CroutMT- Bigsize - Inverse2
Test3: CROUTMT  -    2 (  2000x  2000) inverts in   2.818 seconds  Err=  0.7692472547307751E-09
Test4: DGETRF   -    2 (  2000x  2000) inverts in   0.282 seconds  Err=  0.1694665710140913E-11
Test5: DGETRF2  -    2 (  2000x  2000) inverts in   0.273 seconds  Err=  0.1791571555309279E-11
Test6: DGESV    -    2 (  2000x  2000) inverts in   0.161 seconds  Err=  0.1742011662433139E-11
                             Total =  5.0 sec
C:\Temp>julia -t 24 -O2 test_fpu7.jl.txt

Threading Summary:

Base.Threads Configuration:
Threads.nthreads(): 24

BLAS Configuration:
BLAS vendor: lbt
BLAS.get_num_threads(): 12
BLAS backend libraries:
LBTConfig([ILP64] mkl_rt.2.dll, [LP64] mkl_rt.2.dll)
Benchmark running, hopefully as only ACTIVE task
Current time: 2025-05-14T02:26:42.205
Test1: GaussST  - 1000 (250x250) inverts in 6.253 seconds  Err= 4.2673054395647724e-15
Test2: CroutST  - 1000 (250x250) inverts in 2.243 seconds  Err= 1.5946885695561353e-12
Test3: RecursiST- 1000 (250x250) inverts in 1.293 seconds  Err= 2.4250148167423610e-14
Test4: CroutMT  - 1000 (250x250) inverts in 4.399 seconds  Err= 1.5946885695561353e-12
Test5: CroutMT  - 2 (2000x2000)  inverts in 3.565 seconds  Err= 3.4721243365073723e-10
Test6: DGETRF/I - 2 (2000x2000)  inverts in 0.196 seconds  Err= 5.2270673897888225e-13
Test7: DGESV    - 2 (2000x2000)  inverts in 0.166 seconds  Err= 5.2866519987893978e-13
Test8: inv(A)   - 2 (2000x2000)  inverts in 0.196 seconds  Err= 5.2270673897888225e-13
Test9: RecursiMT- 2 (2000x2000)  inverts in 1.008 seconds  Err= 5.7157070595421282e-13
                             Total = 19.3 sec

Fortran version is TEST_CPU Multithread Version of Polyhedron Benchmark - Pastebin.com

And julia version

using Random, LinearAlgebra, Dates, Printf
using LoopVectorization
using .Threads
using LinearAlgebra.BLAS
BLAS.set_num_threads(Sys.CPU_THREADS)
using RecursiveFactorization
using MKL


println("\nThreading Summary:")

println("\nBase.Threads Configuration:")
println("Threads.nthreads(): ", Threads.nthreads())  # Julia's multithreading

# Check which BLAS library is being used
println("\nBLAS Configuration:")
println("BLAS vendor: ", BLAS.vendor())  # MKL, OpenBLAS, etc.

# Get BLAS thread count (works for MKL/OpenBLAS)
println("BLAS.get_num_threads(): ", BLAS.get_num_threads())
println("BLAS backend libraries:")
println(BLAS.get_config())


using LinearAlgebra, LoopVectorization

function croutmt_inverse_optimized(A::Matrix{Float64})
    n = size(A, 1)
    L = zeros(n, n)
    U = Matrix{Float64}(I, n, n)  # U starts as identity

    # Crout factorization (partially parallelized where possible)
    @inbounds for j in 1:n
        # Compute column j of L (can be partially vectorized)
        @turbo for i in j:n
            sum_val = zero(Float64)
            for k in 1:j-1
                sum_val += L[i,k] * U[k,j]
            end
            L[i,j] = A[i,j] - sum_val
        end

        # Check for singularity
        if abs(L[j,j]) < eps(Float64) * n * sqrt(n)
            error("Matrix is numerically singular during Crout factorization at step $j")
        end

        # Compute row j of U (right of diagonal)
        inv_Ljj = 1.0 / L[j,j]
        @turbo for i in j+1:n
            sum_val = zero(Float64)
            for k in 1:j-1
                sum_val += L[j,k] * U[k,i]
            end
            U[j,i] = (A[j,i] - sum_val) * inv_Ljj
        end
    end

    # Parallel inverse computation with better memory patterns
    Ainv = zeros(n, n)
    Threads.@threads for i in 1:n  # Column-wise parallelization
        e = zeros(n)
        e[i] = 1.0

        # Forward substitution (L * y = e)
        y = zeros(n)
        @inbounds @simd for row in 1:n
            sum_val = zero(Float64)
            @turbo for col in 1:row-1
                sum_val += L[row, col] * y[col]
            end
            y[row] = (e[row] - sum_val) / L[row, row]
        end

        # Backward substitution (U * x = y)
        x = zeros(n)
        @inbounds for row in n:-1:1
            sum_val = zero(Float64)
            @turbo for col in row+1:n
                sum_val += U[row, col] * x[col]
            end
            x[row] = (y[row] - sum_val) / U[row, row]
        end
        @turbo for k in 1:n  # Faster column assignment
            Ainv[k,i] = x[k]
        end
    end

    return Ainv
end

function check_threads()
    println("Number of threads being used: ", nthreads())
    println("(Configure with Threads.nthreads() = N or JULIA_NUM_THREADS env var)")
end

function crout_inverse_single_threaded(A::Matrix{Float64})
    n = size(A, 1)
    L = zeros(n, n)
    U = Matrix{Float64}(I, n, n)
    block_size = 64  # Cache-friendly block size

    # Optimized Crout factorization
    @inbounds for j in 1:n
        # Compute column j of L (cache-blocked)
        for i_block in j:block_size:n
            i_end = min(i_block + block_size - 1, n)
            @turbo for i in i_block:i_end
                sum_val = 0.0
                for k in 1:j-1
                    sum_val += L[i,k] * U[k,j]
                end
                L[i,j] = A[i,j] - sum_val
            end
        end

        # Check singularity
        if abs(L[j,j]) < eps(Float64) * n * sqrt(n)
            error("Matrix is numerically singular")
        end

        # Compute row j of U (cache-blocked)
        inv_Ljj = 1.0 / L[j,j]
        for i_block in j+1:block_size:n
            i_end = min(i_block + block_size - 1, n)
            @turbo for i in i_block:i_end
                sum_val = 0.0
                for k in 1:j-1
                    sum_val += L[j,k] * U[k,i]
                end
                U[j,i] = (A[j,i] - sum_val) * inv_Ljj
            end
        end
    end

    # Single-threaded solve for inverse columns
    Ainv = zeros(n, n)
    for i in 1:n
        e = zeros(n)
        e[i] = 1.0
        y = zeros(n)
        x = zeros(n)

        # Forward substitution (optimized)
        @inbounds for row in 1:n
            sum_val = 0.0
            @turbo for col in 1:row-1
                sum_val += L[row,col] * y[col]
            end
            y[row] = (e[row] - sum_val) / L[row,row]
        end

        # Backward substitution (optimized)
        @inbounds for row in n:-1:1
            sum_val = 0.0
            @turbo for col in row+1:n
                sum_val += U[row,col] * x[col]
            end
            x[row] = (y[row] - sum_val) / U[row,row]
        end

        Ainv[:,i] .= x
    end

    return Ainv
end

# Thread-safe Gauss-Jordan implementation
function gauss_safe!(A::Matrix{Float64})
    n = size(A, 1)
    B = copy(A)
    I_matrix = Matrix{Float64}(I, n, n)
    block_size = 64  # Cache-friendly block size

    @inbounds for k in 1:n
        # Pivoting (sequential)
        pivot_row = k
        for i in k+1:n
            if abs(B[i, k]) > abs(B[pivot_row, k])
                pivot_row = i
            end
        end

        if pivot_row != k
            B[k, :], B[pivot_row, :] = B[pivot_row, :], B[k, :]
            I_matrix[k, :], I_matrix[pivot_row, :] = I_matrix[pivot_row, :], I_matrix[k, :]
        end

        pivot = B[k, k]
        if abs(pivot) < eps(Float64) * n * sqrt(n)
            error("Matrix is numerically singular")
        end

        # Scale pivot row (parallel safe)
        scale = 1.0 / pivot
        @inbounds @simd for j in 1:n
            B[k, j] *= scale
            I_matrix[k, j] *= scale
        end

        # Elimination (parallel safe)
        @inbounds for i in 1:n
            if i != k
                factor = B[i, k]
                @simd for j in 1:n
                    B[i, j] -= factor * B[k, j]
                    I_matrix[i, j] -= factor * I_matrix[k, j]
                end
            end
        end
    end
    return I_matrix
end

# === Selected Implementations ===

# Standard Serial Gauss-Jordan Elimination based Inverse
using LoopVectorization

# Function to compute the inverse using LAPACK's dgetrf! and dgetri!
function lapack_inverse(A::Matrix{Float64})
    n = size(A, 1)
    n != size(A, 2) && error("Matrix must be square")

    # Create a copy since we modify in-place
    A_copy = copy(A)

    # Step 1: LU factorization (getrf!)
    result = LinearAlgebra.LAPACK.getrf!(A_copy)
    A_lu = result[1]
    ipiv = result[2]
    info_getrf = result[3]

    if info_getrf > 0
        error("Matrix is singular in getrf! at position $info_getrf")
    elseif info_getrf < 0
        error("Invalid argument in getrf! at position $(-info_getrf)")
    end

    # Step 2: Matrix inversion (getri!)
    A_inv = LinearAlgebra.LAPACK.getri!(A_lu, ipiv)
end

# Function to compute the inverse using LAPACK's dgesv! gesv!(A, B) -> (B, A, ipiv)
function lapack_inverse2(A::Matrix{Float64})
    n = size(A, 1)
    # Create a copy of A (since gesv! destroys it) and the identity matrix B
    A_copy = copy(A)
    B = Matrix{Float64}(I, n, n)  # Identity matrix of same size
    
    # Solve A_copy * X = B, which gives X = A⁻¹
    # gesv! modifies A_copy (LU factors) and B (becomes the solution X)
    result = LinearAlgebra.LAPACK.gesv!(A_copy,B)

    A_lu  = result[1]
    A_inv = result[2]
    
    return B  # The inverse
end



using RecursiveFactorization, LinearAlgebra

function recursivefactor_inverse(A::Matrix{Float64})
    n = size(A, 1)
    # LU factorization with recursive blocking
    F = RecursiveFactorization.lu!(copy(A))
    # Allocate output matrix
    Ainv = similar(A)
    # Solve A * Ainv[:,i] = I[:,i] for each column
    for i in 1:n
        e = zeros(n)
        e[i] = 1.0
        Ainv[:, i] = F \ e  # Forward/backward substitution
    end
    return Ainv
end

function recursivefactormt_inverse(A::Matrix{Float64})
    n = size(A, 1)
    # LU factorization (recursive, single-threaded)
    F = RecursiveFactorization.lu!(copy(A))
    Ainv = similar(A)

    # Multithreaded column-wise back-substitution
    Threads.@threads for i in 1:n
        e = zeros(n)
        e[i] = 1.0
        Ainv[:, i] = F \ e  # Each thread solves one column
    end

    return Ainv
end


using StaticArrays, LinearAlgebra

function staticarray_inverse(A::Matrix{Float64})
    # Convert to StaticArray (compile-time optimized)
    SA = SMatrix{size(A,1),size(A,2)}(A)
    # Compute inverse (fully unrolled at compile time)
    SA_inv = inv(SA)
    # Convert back to regular Matrix (if needed)
    return Matrix(SA_inv)
end



function test_fpu()
    # Constants
    smallsize = 250
    smallits = 1000
    bigsize = 2000

    println("Benchmark running, hopefully as only ACTIVE task")
    println("Current time: ", now())

    # Warm-up runs
    warmup = rand(10,10)
    gauss_safe!(copy(warmup))
    crout_inverse_single_threaded(copy(warmup))

    # Initialize random number pools
    Random.seed!(1234)  # Fixed seed for reproducibility
    pool = rand(smallsize, smallsize, smallits)
    pool3 = rand(bigsize, bigsize)
    
    # Working matrices
    a = zeros(smallsize, smallsize)
    a3 = zeros(bigsize, bigsize)
    c  = zeros(smallsize, smallsize)

    # Timing results
    dt = zeros(10)
    
    for n in 1:10
        clock1 = time_ns()
        
        if n == 1
            # Test 1: Optimized Gauss inversion
            # Parallelize across matrices (thread-safe)
            Threads.@threads for i in 1:smallits
             local_a = copy(@view pool[:, :, i])
             local_a = gauss_safe!(local_a)  # invert
             local_a = gauss_safe!(local_a)  # invert back
        
             if i == smallits
               a .= local_a  # Only last iteration for error check
             end
            end
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test1: GaussST  - %d (%dx%d) inverts in %.3f seconds  Err= %.16e\n",
                    smallits, smallsize, smallsize, dt[n], avg_err)

        elseif n == 2
            # Test 2: Crout inversion crout_inverse_single_threaded
            Threads.@threads for i in 1:smallits
             local_a = copy(@view pool[:, :, i])
             local_a = crout_inverse_single_threaded(local_a)  # invert
             local_a = crout_inverse_single_threaded(local_a)  # invert back
        
             if i == smallits
               a .= local_a  # Only last iteration for error check
             end
            end
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test2: CroutST  - %d (%dx%d) inverts in %.3f seconds  Err= %.16e\n",
                    smallits, smallsize, smallsize, dt[n], avg_err)



        elseif n == 3
            # Test 3:  recursivefactor_inverse
            Threads.@threads  for i in 1:smallits
             local_a = copy(@view pool[:, :, i])
             local_a =  recursivefactor_inverse(local_a)  # invert
             local_a =  recursivefactor_inverse(local_a)  # invert back
        
             if i == smallits
               a .= local_a  # Only last iteration for error check
             end
            end
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test3: RecursiST- %d (%dx%d) inverts in %.3f seconds  Err= %.16e\n",
                    smallits, smallsize, smallsize, dt[n], avg_err)



        elseif n == 4
            # Test 4: Crout mt inversion crout_inverse_threaded
              @inbounds  for i in 1:smallits
             local_a = copy(@view pool[:, :, i])
             local_a = croutmt_inverse_optimized(local_a)  # invert
             local_a = croutmt_inverse_optimized(local_a)  # invert back
        
             if i == smallits
               a .= local_a  # Only last iteration for error check
             end
            end
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test4: CroutMT  - %d (%dx%d) inverts in %.3f seconds  Err= %.16e\n",
                    smallits, smallsize, smallsize, dt[n], avg_err)

        elseif n == 5
            # Test 5 Crout inversion Largesize
            a3 = copy(pool3)
            a3 = croutmt_inverse_optimized(a3)  # invert a
            a3 = croutmt_inverse_optimized(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test5: CroutMT  - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)


        elseif n == 6
            # Test 6: Lapack inversion Largesize
            a3 = copy(pool3)
            a3 = lapack_inverse(a3)  # invert a
            a3 = lapack_inverse(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test6: DGETRF/I - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        elseif n == 7
            # Test 7: Lapack2 inversion Largesize
            a3 = copy(pool3)
            a3 = lapack_inverse2(a3)  # invert a
            a3 = lapack_inverse2(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test7: DGESV    - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        elseif n == 8
            # Test 8: native
            a3 = copy(pool3)
            a3 = inv(a3)  # invert a
            a3 = inv(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test8: inv(A)   - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        elseif n == 9
            # Test 9: recursive
            a3 = copy(pool3)
            a3 = recursivefactormt_inverse(a3)  # invert a
            a3 = recursivefactormt_inverse(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test9: RecursiMT- %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)



        end
    end
    
    @printf("                             Total = %.1f sec\n", sum(dt))
    println()
end



# Run the benchmark
test_fpu()

when i checked double inverse singular matrix, error is still resonable. === Testing Singular/Ill-Conditioned Matrices (250×250) ===

Matrix: Singular
CroutMT : Error = 1.138e-02 (inverted → inverted-back)
LAPACK : Error = 2.081e-02 (inverted → inverted-back)
RecursiveMT : Error = 6.575e-02 (inverted → inverted-back)

aasdelat · May 20, 2025, 6:10pm

I contacted toolchainsupport@amd.com one week ago, but I still haven’t received any answer.

aasdelat · May 20, 2025, 6:11pm

Sorry for my delay, I will return to this as soon as possible.

Topic		Replies	Views
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36489	June 19, 2020
How to call blas getrf, getri properly ? i want to create a benchmark inverse matrix using gauss, crout, native julia inv(A) and BLAS direct New to Julia question , blas , mkl , benchmark , openblas	5	128	May 7, 2025
LU factorization performance issue New to Julia linearalgebra	30	718	June 6, 2022
Acceleration of Intel MKL on AMD Ryzen CPU's Performance performance , mkl , linearalgebra	34	7049	May 9, 2024
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5785	July 5, 2020

AOCL (not MKL) acceleration on AMD Ryzen CPU's

ENVIRONMENT

BENCHMARKING

Conclussion

Related topics