AOCL (not MKL) acceleration on AMD Ryzen CPU's

Hi:
I am interested in calling a lapack routine, dgesvd, that makes a singular value decomposition of a matrix.
The library LinearAlgebra.jl already does it, but it is a generic library that does not take full advantage of the concrete CPU you use.

For Intel CPU’s, there is MKL, that can be used in Julia by means of the MKL.jl library. In order to use it, you simply import these libraries in the following order:

using MKL
using LinearAlgebra

In this way, the library LinearAlgebra will call the routines in the mkl library, instead of the default lapack library.
This library (mkl) seems to also boost the performance of AMD CPU’s, but also seems to not to take full advantage of them.

On the other hand, AMD has developed the AOCL libraries, that replace LAPACK, BLAS, etc. taking full advantage of AMD’s CPUs.
In order to use them in Julia, the appropriate thing would be to have an AOCL.jl library that you could use in the same manner as the MKL.jl library is being used.
It would also be nice that I had the abilities needed to develop such a julia library, but I do not have them, so I have to wait for somebody to make it.
Meanwhile, I want to face a much easier problem that may be enough for my current needs.

This easier problem is to call the dgesv routine from the libflame.so library (aocl replacement for liblapack.so) by means of cccall.
For this purpose, I have inspected the source file lapack.jl, which is the interface to the lapack library, and have copied a piece of code and adapted it. In order to find the correct symbol in the shared library, that corresponds to the dgesvd function, I have used:

nm -D libflame.so | grep dgesvd

and got a bunch of possible symbols (23 symbols):

000000000032b2a0 T dgesvd
0000000000703460 T dgesvd_
000000000053b330 T dgesvd2x2
00000000007155e0 T dgesvd_check
...... a lot more .......

I have tested several of them, begining with dgesvd. This one seems to be the appropriate, but when I run dgesvd!(jobu, jobvt, A), I get the error:

~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/bin/julia: symbol lookup error: /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so: undefined symbol: dnrm2_

If I use other symbols, the same or other errors appear.
The code I have written so far is:
test_lapack_svd_call.jl (5.7 KB)
You can test it by means of:

julia> include("test_lapack_svd_call.jl")
julia> dgesvd!(jobu, jobvt, A)

The matrix A is generated inside the script, but the script does not call the function, so you call it after loading the script.
Can anybody help me, please?

Thanks a lot in advance.

3 Likes

Did you try

nm -D libflame.so | grep dnrm2_

?

Given that it is provided by libblis (these are 64 bit integers, hence the extra 64 suffix):

> nm -D ~/.julia/artifacts/3d9c1a6f0bd9ed317a031702dc47068a05ac7a9c/lib/libblis.so | grep dnrm2                                                                                                                                                                                                                                                                                                                                                (base) 
0000000000a798b0 T cblas_dnrm2
0000000000a6b630 T dnrm2_64_
0000000000a740a0 T dnrm2sub_64_

I’m guessing it isn’t provided by libflame, and instead that it expects you to (dynamically?) link libblis.
Related: Calling code from a library (.so): works on one system, fails on another - #3 by m-j-w

2 Likes

I myself quickly looked into building AOCL jlls. You need to link all of them against each other as indicated in their build instructions. The above suggestion most probably pins the problem.

1 Like

Wow!, great!. Thank you for your job in advance. I am sure many people will be grateful for it.

Thank you, @Elrod .
First of all, an internet search on dnrm2, leads to some pages where dnrm2 is described as:

 *> DNRM2 returns the euclidean norm of a vector via the function
   27 *> name, so that
   28 *>
   29 *>    DNRM2 := sqrt( x'*x )

And it is part of the LAPACK, but in the BLAS library.

Inspecting the libflame.so as suggested:

$ nm -D libflame.so | grep dnrm2
00000000002a73f0 T bl1_dnrm2
                 U dnrm2_

According to the nm doumentation, U means β€œThe symbol is undefined”, what, in turn, I think, means that the code for that function is not in libflame.so. This is what I expected, if dnrm2 is in BLAS. So, I inspected the libblis.so (single thread) and libblis-mt.so (multi threaded) libraries, and got:

$ nm -D libblis.so | grep dnrm2_
0000000000930100 T dnrm2_
0000000000930090 T dnrm2_blis_impl

and

$ nm -D libblis-mt.so | grep dnrm2
000000000094a7d0 T cblas_dnrm2
00000000009a9fb0 T dnrm2
000000000092dd70 T dnrm2_
000000000092dd00 T dnrm2_blis_impl
00000000009ab780 T dnrm2sub
0000000000944aa0 T dnrm2sub_
0000000000944a90 T dnrm2sub_blis_impl

respectively. According to the nm doumentation, β€œT” or β€œt” means: β€œThe symbol is in the text (code) section.”

So it exists, but not in libflame.so, but in libblis.so and libblis-mt.so. So, I have to link to any of these libraries (well I am interested int he multi threaded one).

As suggested in the suggested thread, I have tested my LD_LIBRARY_PATH, getting:

/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib

Going to the first listed directory:

$ ls
cmake                     libblis-mt.so            libfftw3l_mpi.la         libfftw3q_omp.so.3
libalcp.a                 libblis-mt.so.4          libfftw3l_mpi.so         libfftw3q_omp.so.3.6.10
libalcp.so                libblis-mt.so.4.2.0      libfftw3l_mpi.so.3       libfftw3q.so
libalm.a                  libblis.so               libfftw3l_mpi.so.3.6.10  libfftw3q.so.3
libalmfast.a              libblis.so.4             libfftw3l_omp.a          libfftw3q.so.3.6.10
libalmfast.so             libblis.so.4.2.0         libfftw3l_omp.la         libfftw3.so
libalm.so                 libbz2.so                libfftw3l_omp.so         libfftw3.so.3
libamdlibm.a              libfftw3.a               libfftw3l_omp.so.3       libfftw3.so.3.6.10
libamdlibmfast.a          libfftw3f.a              libfftw3l_omp.so.3.6.10  libflame.a
libamdlibmfast.so         libfftw3f.la             libfftw3l.so             libflame.so
libamdlibm.so             libfftw3f_mpi.a          libfftw3l.so.3           libipp-compat.so
libamdsecrng.a            libfftw3f_mpi.la         libfftw3l.so.3.6.10      liblz4.so
libamdsecrng.so           libfftw3f_mpi.so         libfftw3_mpi.a           liblzma.so
libamdsecrng.so.4.2       libfftw3f_mpi.so.3       libfftw3_mpi.la          libopenssl-compat.so
libamdsecrng.so.4.2.0     libfftw3f_mpi.so.3.6.10  libfftw3_mpi.so          librng_amd.a
libaocl_compression.a     libfftw3f_omp.a          libfftw3_mpi.so.3        librng_amd.so
libaocl_compression.so    libfftw3f_omp.la         libfftw3_mpi.so.3.6.10   librng_amd.so.4.2
libaocl-libmem.a          libfftw3f_omp.so         libfftw3_omp.a           librng_amd.so.4.2.0
libaocl-libmem.so         libfftw3f_omp.so.3       libfftw3_omp.la          libscalapack.a
libaocl-libmem.so.4.2.0   libfftw3f_omp.so.3.6.10  libfftw3_omp.so          libscalapack.so
libaoclsparse.a           libfftw3f.so             libfftw3_omp.so.3        libsnappy.so
libaoclsparse.so          libfftw3f.so.3           libfftw3_omp.so.3.6.10   libz.so
libaoclsparse.so.4.2.0.0  libfftw3f.so.3.6.10      libfftw3q.a              libzstd.so
libaoclutils.a            libfftw3.la              libfftw3q.la             pkgconfig
libaoclutils.so           libfftw3l.a              libfftw3q_omp.a
libblis.a                 libfftw3l.la             libfftw3q_omp.la
libblis-mt.a              libfftw3l_mpi.a          libfftw3q_omp.so

We can see libblis.so and libblis-mt.so, so the LD_LIBRARY_PATH is correct.

I addition, lets see if libflame.so knows where libblis is:

$ ldd libflame.so 
	linux-vdso.so.1 (0x00007ffd9d4eb000)
	libaoclutils.so => /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so (0x0000737c0e67c000)
	libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x0000737c0d200000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x0000737c0e677000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x0000737c0ce00000)
	/lib64/ld-linux-x86-64.so.2 (0x0000737c0e688000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000737c0ca00000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x0000737c0d519000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000737c0e655000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x0000737c0e650000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x0000737c0e64b000)

There are not unresolved directions, but it does not show any dependency on libblis.so or libblis-mt.so (well, my understanding on ldd is not precisely expert, but I think this is the interpretation).

So, the last option, I think, is to explicitly link the libblis-mt.so. According to the AOCL documentation:

To use AOCL-LAPACK in your application, link with AOCL-LAPACK, AOCL-BLAS, and AOCL-Utils libraries while building the application.

AOCL-Utils library has libstdc++ library dependency. As AOCL-LAPACK is dependent on AOCL-Utils, applications must link with libstdc++(-lstdc++) as well.

But, How do I link to several shared libraries in Julia? Is there a way to do this with ccall or do I have to make weird compilation or pre-compilation things?

2 Likes

I mean, what should always work is to write a tiny wrapper in C that imports the required libraries and exports just the function that you want to use from Julia…

But perhaps there is another, easier way…

Wow!, I have been struggling with the way to make this, and, finally, the following worked. I have created the file dgesvd_wrapper.f90:

module dgesvd_wrapper

PUBLIC :: my_dgesvd

contains

function my_dgesvd( JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
                   WORK, LWORK, INFO )

      ! .. Scalar Arguments ..
      CHARACTER          JOBU, JOBVT
      INTEGER            INFO, LDA, LDU, LDVT, LWORK, M, N

      !.. Array Arguments ..
      DOUBLE PRECISION   A( LDA, * ), S( * ), U( LDU, * ),          &
                         VT( LDVT, * ), WORK( * )
                         
    call dgesvd(JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
           WORK, LWORK, INFO )
           
end function my_dgesvd

end module dgesvd_wrapper

And created the shared library by:

$ flang -g -shared /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.a /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libblis-mt.a /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.a /usr/lib/x86_64-linux-gnu/libstdc++.so.6 -lm -fopenmp dgesvd_wrapper.f90 -o desvd_wrapper.so

It is important to note the -lm -fopenmp options, that are documented in table 6 of section β€œ4.2.3 Linking Application with AOCL-BLAS” of the AOCL documentation.

$ nm -D desvd_wrapper.so | grep dgesvd
00000000000d4900 T dgesvd_
00000000000d4840 T __dgesvd_wrapper_
00000000008eed40 B _dgesvd_wrapper_0_
00000000000d4850 T dgesvd_wrapper_my_dgesvd_
000000000071fd70 T fla_dgesvd_nn_small10
0000000000726830 T fla_dgesvd_nn_small10_avx2
000000000071fdb0 T fla_dgesvd_nn_small1T
0000000000735b90 T fla_dgesvd_nn_small1T_avx2
000000000071fd90 T fla_dgesvd_small6
00000000007335d0 T fla_dgesvd_small6_avx2
000000000071fdd0 T fla_dgesvd_small6T
00000000007373c0 T fla_dgesvd_small6T_avx2
00000000008c1870 T lapack_dgesvd

Now, it makes the SVD decomposition when I use any of the symbols: dgesvd_, lapack_dgesvd or dgesvd_wrapper_my_dgesvd_
In order to use more than one thread, I also have to:

export OMP_NUM_THREADS=30

It uses all 30 threads for a while, then one thread, and, lastly, all 30 threads again. I do not now the reason for this behavior an I have to make more tests.

1 Like

what does ldd on libflame show from within julia (in shell mode)? Did you start julia with LD_LIBRARY_PATH modified?

1 Like

Any benchmarks results already?

I have to say that there is a mistake in the Fortran code: it should say β€œsubroutine” where it says β€œfunction”. Once corrected in my code and recompiled, the behavior does not change at all.

ENVIRONMENT

First of all, I will tell more about my environment:
$ uname -a
Linux ryzen-casa 6.5.0-27-generic #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I have installed the AOCL libraries from the binaries: aocl-linux-aocc-4.2.0_1_amd64.deb.
I also have installed the AOCC compiler suite from the binaries: aocc-compiler-4.2.0_1_amd64.deb

I set the necessary environment variables for these packages by means of the respective scripts provided by each package, and that I call from my ~/.profile
So, the last two lines of my ~/.profile are:

source /opt/AMD/aocc-compiler-4.2.0/setenv_AOCC.sh
source /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/amd-libs.cfg

Regarding my LD_LIBRARY_PATH, in a shell just before starting Julia:

$ export | grep 
declare -x LD_LIBRARY_PATH="/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib:"

In a shell from Julia:

shell> ldd /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so
	linux-vdso.so.1 (0x00007ffc9bf98000)
	libaoclutils.so => /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so (0x0000718616c4f000)
	libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x0000718615800000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x0000718616c4a000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x0000718615400000)
	/lib64/ld-linux-x86-64.so.2 (0x0000718616c5b000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000718615000000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x0000718615b19000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000718616c28000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x0000718616c23000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x0000718616c1e000)

shell> ldd desvd_wrapper.so
	linux-vdso.so.1 (0x00007ffe41b6d000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ef349c00000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007ef349f19000)
	libflang.so => /opt/AMD/aocc-compiler-4.2.0/lib/libflang.so (0x00007ef349600000)
	libflangrti.so => /opt/AMD/aocc-compiler-4.2.0/lib/libflangrti.so (0x00007ef34a928000)
	libpgmath.so => /opt/AMD/aocc-compiler-4.2.0/lib/libpgmath.so (0x00007ef349200000)
	libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007ef349ed1000)
	libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x00007ef348e00000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ef34a906000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007ef348a00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ef34a937000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007ef34a901000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ef34a8fa000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007ef349ecc000)

Where desvd_wrapper.so is the shared library I created containing libblame, blis and others. Recall that this has been necessary because libblame calls routines from blis (and others), and passing libflame to Julia’s ccall complaints about those routines as missing.
However, from a shell from Julia:

shell> echo $LD_LIBRARY_PATH
ERROR: UndefVarError: `LD_LIBRARY_PATH` not defined
Stacktrace:
 [1] top-level scope
   @ none:1

But, as I have stated above, if I test for LD_LIBRARY_PATH in the shell just before starting Julia, I can see that it is correctly set.

BENCHMARKING

The calculation has a strange behavior: it starts using all (30 assigned) threads, then uses only one thread for a long time, and, at last, uses all threads again until it finishes and returns the result.

I do not know how to separately measure these different periods of the run time, so I resolved to time them with a hand held stopwatch (well, in fact it is a smartphone :slight_smile:) while I keep an eye on genome-system-monitor, hitting the lapse button of the chronometer each time I see all threads rising up or dropping down.

The relevant lines of the script are:

A = rand(100_000, 5_000)
...
@benchmark U, S, VT = dgesvd!(jobu, jobvt, A)

The results of the (hand) timing are the following ones. Due to the measurement method, the times are not accurate and can be one or two seconds shorter than specified:

Start   00:00
        all threads working (at 100% or nearly) for 01:24
-       01:24
        Only one thread working (100%) for 05:10
-       06:35
        all threads working again (at 100% or nearly) for 02:48
End     09:23

And the output is:

julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 80.796 s (0.01% GC) to evaluate,
 with a memory estimate of 4.10 GiB, over 60 allocations.

If I use @time U, S, VT = dgesvd!(jobu, jobvt, A) instead of @benchmark U, S, VT = dgesvd!(jobu, jobvt, A):

julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
409.325711 seconds (39.38 k allocations: 4.103 GiB, 0.01% gc time, 0.01% compilation time)

409.325711 seconds ~ 7 iminutes

Let’s compare it whith the svd provided by LinearAlgebra.jl. In this case, the script is simple and looks like:

using LinearAlgebra
using BenchmarkTools

println("Generating random matrix ...")
A = rand(100_000, 5_000)

println("Making the SVD ...")
@benchmark F = svd(A)

There is no strange behavior, in the sense that all (30 assigned) threads are working from the beginning to the end.
And the result is:

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 83.899 s (0.27% GC) to evaluate,
 with a memory estimate of 8.38 GiB, over 13 allocations.

For the time, I do not have to measure separate periods, so I launch the modified scrtipt:

using LinearAlgebra
using BenchmarkTools

println("Generating random matrix ...")
A = rand(100_000, 5_000)

println("Making the SVD ...")
@btime F = svd(A)

And I get:

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
 85.518271 seconds (110.90 k allocations: 8.390 GiB, 0.08% gc time, 0.05% compilation time)

Finally, let’s use a much smaller matrix in order to get benchmark statistics.
Now, I use for my dgesvd:

A = rand(100, 50)
...
@benchmark U, S, VT = dgesvd!(jobu, jobvt, A)

And the output is:

julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  195.509 ΞΌs …  1.472 ms  β”Š GC (min … max): 0.00% … 83.37%
 Time  (median):     199.216 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   200.562 ΞΌs Β± 22.550 ΞΌs  β”Š GC (mean Β± Οƒ):  0.40% Β±  2.74%

       β–‚β–…β–‡β–ˆβ–ˆβ–‡β–…β–ƒ                                                 
  β–‚β–‚β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚ β–ƒ
  196 ΞΌs          Histogram: frequency by time          218 ΞΌs <

 Memory estimate: 105.59 KiB, allocs estimate: 39.

And the output with LinearAlgebra’s svd:

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  428.609 ΞΌs …  2.338 ms  β”Š GC (min … max): 0.00% … 76.33%
 Time  (median):     441.087 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   450.616 ΞΌs Β± 55.806 ΞΌs  β”Š GC (mean Β± Οƒ):  0.60% Β±  3.42%

    β–β–ƒβ–…β–‡β–ˆβ–ˆβ–‡β–†β–…β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–„β–„β–ƒβ–ƒβ–β–                  ▂▃▃▂▁            β–ƒ
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–…β–ƒβ–„β–β–„β–ƒβ–ƒβ–„β–„β–β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–†β–ƒβ–ƒβ–…β–†β–‡ β–ˆ
  429 ΞΌs        Histogram: log(frequency) by time       527 ΞΌs <

 Memory estimate: 182.59 KiB, allocs estimate: 11.

Conclussion

It seems that, for small matrices, my dgesvd behaves better than LinearAlgebra's svd. For large matrices, my dgesvd not only takes more time, but also has a strange behavior. Could this strange behavior be solved and, hence, the times improved?

Try like this:

julia> ENV["LD_LIBRARY_PATH"]
"/usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/:/usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/"

Mind trying LinearAlgebra after using MKL?
That’s likely to do better at small sizes than the default OpenBLAS, even on AMD.

1 Like
julia> ENV["LD_LIBRARY_PATH"]
"/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib:"

This way shows it!

When I sue the MKL.jl library, it uses more cpu threads, but I set:

N_THREADS=30
export JULIA_NUM_THREADS=1
export OMP_NUM_THREADS=$N_THREADS
export OPENBLAS_NUM_THREADS=$N_THREADS
export MKL_NUM_THREADS=$N_THREADS
export VECLIB_MAXIMUM_THREADS=$N_THREADS
export NUMEXPR_NUM_THREADS=$N_THREADS

The script is now:

using MKL
using LinearAlgebra
using BenchmarkTools

println("Generating random matrix ...")
A = rand(100_000, 5_000) # Big matrix
#A = rand(100, 50)        # Small matrix

println("Making the SVD ...")
@benchmark F = svd(A)

and watching gnome-system-monitor it seems that only 12 cpu threads are 100%. There are 4 threads at >60%. This makes nearly 15 threads. Could this mean that cpu multi-trheading is not very performant for this application and only the real cores are worth?.
The result is (begins at 12:53 and ends at 12:55, id est, lasts for 2 minutes):

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 42.031 s (0.56% GC) to evaluate,
 with a memory estimate of 8.38 GiB, over 13 allocations.

Despite the fact that it does not use 30 threads, it lasts only for 2 minutes, which is much better than my dgesvd.

And now, with a small matrix, A = rand(100, 50):

julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  433.429 ΞΌs …  1.910 ms  β”Š GC (min … max): 0.00% … 72.58%
 Time  (median):     448.015 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   453.471 ΞΌs Β± 39.689 ΞΌs  β”Š GC (mean Β± Οƒ):  0.41% Β±  2.91%

         β–‚β–…β–‡β–ˆβ–ˆβ–†β–„β–‚                                               
  β–‚β–‚β–‚β–ƒβ–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–‚β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚ β–ƒ
  433 ΞΌs          Histogram: frequency by time          509 ΞΌs <

 Memory estimate: 182.59 KiB, allocs estimate: 11.
1 Like

About the strange behavior of the dgesvd function, in order to test if it is a Julia or a aocl problem, I have made a fortran program that uses that function, and the result is that it behaves the same: Starts using all threads, then uses only one thread for a long time, and, finally, all threads again. The overall time is too far from the 2 minutes lasted by the mkl library.
Here is the code:

program use_dgesvd_aocl

implicit none

double precision    ::  A(100000, 5000) ! Big matrix
!double precision    ::  A(100, 50)      ! Small matrix

! .. Scalar Arguments ..
CHARACTER        ::  jobu, jobvt
INTEGER          ::  info, lda, ldu, ldvt, lwork, m, n, minmn

               
double precision, allocatable, dimension(:,:)   :: U, VT
double precision, allocatable, dimension(:)     :: S, work

                 
call RANDOM_NUMBER(A)

jobu  = 'S' ! 'S':  the first min(m,n) columns of U (the left singular
            !       vectors) are returned in the array U
jobvt = 'S' ! 'S':  the first min(m,n) rows of V**T (the right singular
            !       vectors) are returned in the array VT
m = size(A,1) ! number of rows of submatrix A used in the computation.
n = size(A,2) ! number of columns of submatrix A used in the computation.

minmn  = min(m, n)
lda = max(m, 1)
ldu = m
ldvt = minmn

allocate(S(minmn))
allocate(work(1))
allocate(U(ldu, minmn))
allocate(VT(ldvt, n))


! First call: get the best length for array work, in the integer lwork
lwork = -1 ! Query
call dgesvd(JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
       WORK, LWORK, INFO )
if (info == 0) then
    lwork = work(1)
    deallocate(work)
    allocate(work(lwork))
else
    write(*,*) "Error: ", info
end if

call dgesvd(JOBU, JOBVT, M, N, A, LDA, S, U, LDU, VT, LDVT,  &
       WORK, LWORK, INFO )
       
write(*,*) "info: ", info
write(*,*) "          = 0:  successful exit."
write(*,*) "          < 0:  if INFO = -i, the i-th argument had an illegal value."
write(*,*) "          > 0:  if DBDSQR did not converge, INFO specifies how many"
write(*,*) "                superdiagonals of an intermediate bidiagonal form B"
write(*,*) "                did not converge to zero. See the description of WORK"
write(*,*) "                above for details."

end program use_dgesvd_aocl

I compile it by means of:

flang -g /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libblis-mt.so /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so /usr/lib/x86_64-linux-gnu/libstdc++.so.6 -lm -fopenmp use_dgesvd_aocl.f90 -o use_dgesvd_aocl

MKL outperforms aocl, but I do not think that the solution is to use mkl libraries, because they do not take full advantage from AMD’s CPU. So, I think this is an issue for AOCL technical support, if there is not a better idea.