Acceleration of Intel MKL on AMD Ryzen CPU's

ufechner7 · April 8, 2024, 4:33pm

The svd() function of LinearAlgebra works on variables that are StaticArrays or StaticVectors. To which degrees this is speed optimized I don’t know, try it yourself…

aasdelat · April 8, 2024, 4:34pm

So I can use StaticArrays but not LoopVectorization?

maxfreu · April 9, 2024, 8:20pm

Yes, you can use them independently.

aasdelat · April 10, 2024, 2:48pm

Please, correct me if I am mistaken:
Well, it seems that the answer is that svd from LinearAlgebra converts the input matrices into static matrices, and then passes them to the svd routine of LAPACK. So it is taking advantage of static matrices.
Regarding the LoopVectorization, svd from LinearAlgebra does not use it. But LinearAlgebra uses the ScaLAPACK, which is multi-threaded, and hence, has the same advantage than LoopVectorization.
On the other hand, none of this can be changed if I put

using StaticArrays or using LoopVectorization
using LinearAlgebra

id est, the behavior of LinearAlgebra will not change if I import any of StaticArrays or LoopVectorization before importing LinearAlgebra.

ufechner7 · April 10, 2024, 2:59pm

This is a misunderstanding. The main advantage of LoopVectorization is that it is using SIMD instructions (single instruction, multiple data) very effectively, which is orthogonal to multithreading.

aasdelat · April 10, 2024, 3:01pm

Ok, but, does ScaLAPCK also use SIMD?, does LoopVectorization also use multi-threading?

ufechner7 · April 10, 2024, 3:04pm

It can also use multi-threading, if that is what you want:

https://juliasimd.github.io/LoopVectorization.jl/stable/examples/multithreading/#Multithreading

aasdelat · April 10, 2024, 3:25pm

Ok, I understand now.
But the help on LoopVectorization is not clear about the purpose of this library. I think this is a general issue about help on any Julia package. I think that the first things that the help on LoopVectorization should say are that it purpose is to provide matrix operations that use SIMD and multi-threading, and also in which libraries (ScaLAPACK?) it is based in order to accomplish those purposes.
My question about the use of SIMD by ScaLAPCK, still remains.

Elrod · April 10, 2024, 3:51pm

I’m not familiar with ScaLAPACK, but any half-decent normal LAPACK library uses SIMD. Reference BLAS + Netlib might not.
I would assume the same for ScaLAPACK.

No, they don’t convert to static matrices.

aasdelat · April 11, 2024, 10:48am

So, svd from LinearAlgebra, does not take advantage from static matrices, but it seems to take advantage from StridedMatrix. It converts the input matrix into a StridedMatrix before passing it to the LAPACK.gesvd rouotine, that makes the svd. And I think the goals of StridedMatrix are similar to that of static matrices.

aasdelat · April 12, 2024, 9:58pm

I want to use one aocl routine to accelerate svd on Ryzen, but this seems not to be the topic for this thread, so I have opened another one. For the interested ones:

aasdelat · April 25, 2024, 9:44am

It converts to strided matrices, but does not use the StridedMatrix library. Julia Base library implements this type of matrix in julia/base/base/reinterpretarray.jl
This type is based on DenseMatrices and FastContiguousSubArray. I know what DenseMatrices are, but not what FastContiguousSubArray are.

Elrod · April 26, 2024, 6:18pm

It doesn’t convert to StridedArray; it works on things that already are.

It is more general than FastContiguousSubArray. FastContiguousSubArray is about supporting linear indexing:

julia> A = rand(5,5);

julia> view(A, :, 1:4) isa Base.FastContiguousSubArray
true

julia> view(A, 1:5, 1:4) isa Base.FastContiguousSubArray
false

julia> Base.IndexStyle(typeof(view(A, :, 1:4)))
IndexLinear()

julia> Base.IndexStyle(typeof(view(A, 1:4, 1:4)))
IndexCartesian()

That is, there is no gap between columns of a FastContiguousSubArray, so you can efficiently iterate over all elements without needing nested loops or cartesian indices; eachindex, for example, returns just OneTo(length(A)):

julia> eachindex(view(A, :, 1:4))
Base.OneTo(20)

julia> eachindex(view(A, 1:4, 1:4))
CartesianIndices((4, 4))

Many BLAS and LAPACK operations are more general than that. If they need to view an array as a two-dimensional object instead of one-dimensional anyway, the cost is low for them to support stride(A,2) != size(A,1).
In fact, this can be important to support as an implementation detail, as algorithms often work blockwise, meaning to implement them, they’ll be taking views of the full matrix anyway; whether that full matrix was itself a submatrix to begin with wasn’t so important (or alternatively, you may want to pad columns to align them).

So, StridedArray is more general, as a subarray can still be strided, even if not contiguous:

julia> view(A, 1:4, 1:4) isa Base.FastContiguousSubArray
false

julia> view(A, 1:4, 1:4) isa Base.StridedArray
true

julia> view(A, 1:2:4, 1:4) isa Base.StridedArray
true

julia> view(A, [1,3], 1:2:4) isa Base.StridedArray
false

photor · April 27, 2024, 2:21am

Very clear explanation!

aasdelat · May 9, 2024, 12:07pm

Thank you very much!
Very clear!
I now understand a lot more. I think this should be included somewhere in the documentation.

Topic		Replies	Views
Hack: AMD Ryzen/TR/Epyc + Intel Math Kernel Library (MKL) Performance	0	1459	November 18, 2019
How to circumvent Intel's AMD discrimination in MKL from v1.7 onwards? General Usage mkl , linearalgebra	2	3073	May 16, 2025
Is MKL performance on AMD no longer crippled? General Usage	4	1937	May 11, 2024
MKLSparse with AMD cpu Performance	5	1743	December 11, 2024
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36408	June 19, 2020

Acceleration of Intel MKL on AMD Ryzen CPU's

Related topics