Acceleration of Intel MKL on AMD Ryzen CPU's

The svd() function of LinearAlgebra works on variables that are StaticArrays or StaticVectors. To which degrees this is speed optimized I don’t know, try it yourself…

So I can use StaticArrays but not LoopVectorization?

Yes, you can use them independently.

Please, correct me if I am mistaken:
Well, it seems that the answer is that svd from LinearAlgebra converts the input matrices into static matrices, and then passes them to the svd routine of LAPACK. So it is taking advantage of static matrices.
Regarding the LoopVectorization, svd from LinearAlgebra does not use it. But LinearAlgebra uses the ScaLAPACK, which is multi-threaded, and hence, has the same advantage than LoopVectorization.
On the other hand, none of this can be changed if I put

using StaticArrays or using LoopVectorization
using LinearAlgebra

id est, the behavior of LinearAlgebra will not change if I import any of StaticArrays or LoopVectorization before importing LinearAlgebra.

This is a misunderstanding. The main advantage of LoopVectorization is that it is using SIMD instructions (single instruction, multiple data) very effectively, which is orthogonal to multithreading.

Ok, but, does ScaLAPCK also use SIMD?, does LoopVectorization also use multi-threading?

It can also use multi-threading, if that is what you want:

https://juliasimd.github.io/LoopVectorization.jl/stable/examples/multithreading/#Multithreading

Ok, I understand now.
But the help on LoopVectorization is not clear about the purpose of this library. I think this is a general issue about help on any Julia package. I think that the first things that the help on LoopVectorization should say are that it purpose is to provide matrix operations that use SIMD and multi-threading, and also in which libraries (ScaLAPACK?) it is based in order to accomplish those purposes.
My question about the use of SIMD by ScaLAPCK, still remains.

I’m not familiar with ScaLAPACK, but any half-decent normal LAPACK library uses SIMD. Reference BLAS + Netlib might not.
I would assume the same for ScaLAPACK.

No, they don’t convert to static matrices.

So, svd from LinearAlgebra, does not take advantage from static matrices, but it seems to take advantage from StridedMatrix. It converts the input matrix into a StridedMatrix before passing it to the LAPACK.gesvd rouotine, that makes the svd. And I think the goals of StridedMatrix are similar to that of static matrices.

I want to use one aocl routine to accelerate svd on Ryzen, but this seems not to be the topic for this thread, so I have opened another one. For the interested ones:

It converts to strided matrices, but does not use the StridedMatrix library. Julia Base library implements this type of matrix in julia/base/base/reinterpretarray.jl
This type is based on DenseMatrices and FastContiguousSubArray. I know what DenseMatrices are, but not what FastContiguousSubArray are.

It doesn’t convert to StridedArray; it works on things that already are.

It is more general than FastContiguousSubArray. FastContiguousSubArray is about supporting linear indexing:

julia> A = rand(5,5);

julia> view(A, :, 1:4) isa Base.FastContiguousSubArray
true

julia> view(A, 1:5, 1:4) isa Base.FastContiguousSubArray
false

julia> Base.IndexStyle(typeof(view(A, :, 1:4)))
IndexLinear()

julia> Base.IndexStyle(typeof(view(A, 1:4, 1:4)))
IndexCartesian()

That is, there is no gap between columns of a FastContiguousSubArray, so you can efficiently iterate over all elements without needing nested loops or cartesian indices; eachindex, for example, returns just OneTo(length(A)):

julia> eachindex(view(A, :, 1:4))
Base.OneTo(20)

julia> eachindex(view(A, 1:4, 1:4))
CartesianIndices((4, 4))

Many BLAS and LAPACK operations are more general than that. If they need to view an array as a two-dimensional object instead of one-dimensional anyway, the cost is low for them to support stride(A,2) != size(A,1).
In fact, this can be important to support as an implementation detail, as algorithms often work blockwise, meaning to implement them, they’ll be taking views of the full matrix anyway; whether that full matrix was itself a submatrix to begin with wasn’t so important (or alternatively, you may want to pad columns to align them).

So, StridedArray is more general, as a subarray can still be strided, even if not contiguous:

julia> view(A, 1:4, 1:4) isa Base.FastContiguousSubArray
false

julia> view(A, 1:4, 1:4) isa Base.StridedArray
true

julia> view(A, 1:2:4, 1:4) isa Base.StridedArray
true

julia> view(A, [1,3], 1:2:4) isa Base.StridedArray
false
3 Likes

Very clear explanation!

Thank you very much!
Very clear!
I now understand a lot more. I think this should be included somewhere in the documentation.