ENVIRONMENT
First of all, I will tell more about my environment:
$ uname -a
Linux ryzen-casa 6.5.0-27-generic #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
I have installed the AOCL libraries from the binaries: aocl-linux-aocc-4.2.0_1_amd64.deb
.
I also have installed the AOCC compiler suite from the binaries: aocc-compiler-4.2.0_1_amd64.deb
I set the necessary environment variables for these packages by means of the respective scripts provided by each package, and that I call from my ~/.profile
So, the last two lines of my ~/.profile are:
source /opt/AMD/aocc-compiler-4.2.0/setenv_AOCC.sh
source /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/amd-libs.cfg
Regarding my LD_LIBRARY_PATH, in a shell just before starting Julia:
$ export | grep
declare -x LD_LIBRARY_PATH="/opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib:/opt/AMD/aocc-compiler-4.2.0/ompd:/opt/AMD/aocc-compiler-4.2.0/lib:/opt/AMD/aocc-compiler-4.2.0/lib32:/usr/lib/x86_64-linux-gnu:/usr/lib64:/usr/lib32:/usr/lib:"
In a shell from Julia:
shell> ldd /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libflame.so
linux-vdso.so.1 (0x00007ffc9bf98000)
libaoclutils.so => /opt/AMD/aocl/aocl-linux-aocc-4.2.0/aocc/lib/libaoclutils.so (0x0000718616c4f000)
libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x0000718615800000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x0000718616c4a000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x0000718615400000)
/lib64/ld-linux-x86-64.so.2 (0x0000718616c5b000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000718615000000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x0000718615b19000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000718616c28000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x0000718616c23000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x0000718616c1e000)
shell> ldd desvd_wrapper.so
linux-vdso.so.1 (0x00007ffe41b6d000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ef349c00000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007ef349f19000)
libflang.so => /opt/AMD/aocc-compiler-4.2.0/lib/libflang.so (0x00007ef349600000)
libflangrti.so => /opt/AMD/aocc-compiler-4.2.0/lib/libflangrti.so (0x00007ef34a928000)
libpgmath.so => /opt/AMD/aocc-compiler-4.2.0/lib/libpgmath.so (0x00007ef349200000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007ef349ed1000)
libomp.so => /opt/AMD/aocc-compiler-4.2.0/lib/libomp.so (0x00007ef348e00000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ef34a906000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007ef348a00000)
/lib64/ld-linux-x86-64.so.2 (0x00007ef34a937000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007ef34a901000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ef34a8fa000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007ef349ecc000)
Where desvd_wrapper.so
is the shared library I created containing libblame, blis and others. Recall that this has been necessary because libblame calls routines from blis (and others), and passing libflame to Juliaβs ccall complaints about those routines as missing.
However, from a shell from Julia:
shell> echo $LD_LIBRARY_PATH
ERROR: UndefVarError: `LD_LIBRARY_PATH` not defined
Stacktrace:
[1] top-level scope
@ none:1
But, as I have stated above, if I test for LD_LIBRARY_PATH in the shell just before starting Julia, I can see that it is correctly set.
BENCHMARKING
The calculation has a strange behavior: it starts using all (30 assigned) threads, then uses only one thread for a long time, and, at last, uses all threads again until it finishes and returns the result.
I do not know how to separately measure these different periods of the run time, so I resolved to time them with a hand held stopwatch (well, in fact it is a smartphone ) while I keep an eye on genome-system-monitor
, hitting the lapse button of the chronometer each time I see all threads rising up or dropping down.
The relevant lines of the script are:
A = rand(100_000, 5_000)
...
@benchmark U, S, VT = dgesvd!(jobu, jobvt, A)
The results of the (hand) timing are the following ones. Due to the measurement method, the times are not accurate and can be one or two seconds shorter than specified:
Start 00:00
all threads working (at 100% or nearly) for 01:24
- 01:24
Only one thread working (100%) for 05:10
- 06:35
all threads working again (at 100% or nearly) for 02:48
End 09:23
And the output is:
julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 80.796 s (0.01% GC) to evaluate,
with a memory estimate of 4.10 GiB, over 60 allocations.
If I use @time U, S, VT = dgesvd!(jobu, jobvt, A)
instead of @benchmark U, S, VT = dgesvd!(jobu, jobvt, A)
:
julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
409.325711 seconds (39.38 k allocations: 4.103 GiB, 0.01% gc time, 0.01% compilation time)
409.325711 seconds ~ 7 iminutes
Letβs compare it whith the svd provided by LinearAlgebra.jl. In this case, the script is simple and looks like:
using LinearAlgebra
using BenchmarkTools
println("Generating random matrix ...")
A = rand(100_000, 5_000)
println("Making the SVD ...")
@benchmark F = svd(A)
There is no strange behavior, in the sense that all (30 assigned) threads are working from the beginning to the end.
And the result is:
julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 83.899 s (0.27% GC) to evaluate,
with a memory estimate of 8.38 GiB, over 13 allocations.
For the time, I do not have to measure separate periods, so I launch the modified scrtipt:
using LinearAlgebra
using BenchmarkTools
println("Generating random matrix ...")
A = rand(100_000, 5_000)
println("Making the SVD ...")
@btime F = svd(A)
And I get:
julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
85.518271 seconds (110.90 k allocations: 8.390 GiB, 0.08% gc time, 0.05% compilation time)
Finally, letβs use a much smaller matrix in order to get benchmark statistics.
Now, I use for my dgesvd:
A = rand(100, 50)
...
@benchmark U, S, VT = dgesvd!(jobu, jobvt, A)
And the output is:
julia> include("test_lapack_svd_call.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 195.509 ΞΌs β¦ 1.472 ms β GC (min β¦ max): 0.00% β¦ 83.37%
Time (median): 199.216 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 200.562 ΞΌs Β± 22.550 ΞΌs β GC (mean Β± Ο): 0.40% Β± 2.74%
ββ
βββββ
β
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
196 ΞΌs Histogram: frequency by time 218 ΞΌs <
Memory estimate: 105.59 KiB, allocs estimate: 39.
And the output with LinearAlgebraβs svd:
julia> include("test_svd_julia.jl")
Generating random matrix ...
Making the SVD ...
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 428.609 ΞΌs β¦ 2.338 ms β GC (min β¦ max): 0.00% β¦ 76.33%
Time (median): 441.087 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 450.616 ΞΌs Β± 55.806 ΞΌs β GC (mean Β± Ο): 0.60% Β± 3.42%
βββ
ββββββ
βββββββββββββββ βββββ β
β
βββββββββββββββββββββββββββββββ
ββββββββββββββββββββββ
ββββ
ββ β
429 ΞΌs Histogram: log(frequency) by time 527 ΞΌs <
Memory estimate: 182.59 KiB, allocs estimate: 11.
Conclussion
It seems that, for small matrices, my dgesvd behaves better than LinearAlgebra's svd. For large matrices, my dgesvd not only takes more time, but also has a strange behavior. Could this strange behavior be solved and, hence, the times improved?