MKLSparse with AMD cpu

Bruno_Amorim · November 22, 2020, 12:01am

As discussed in other threads

Intel MKL descriminates against AMD cpu’s (although this might be changing).

Does this affect the performance of MKLSparse.jl on AMD cpu’s? Does anyone have experinece in this?

Ideally, one would have a AOCLSparse.jl to use AMD’s AOCL for amd cpu’s. But that doesn’t seem to exist yet.

pablosanjose · November 23, 2020, 12:41pm

I’m also very interested in this. In particular, can AMD chips (I’m thinking Zen3 in particular) multithread sparse mat-vec multiplication efficiently using MKL?

jamblejoe · October 14, 2021, 8:23pm

I am also very interested in that. For the exact same reason as @pablosanjose . Any updates?

Bruno_Amorim · October 15, 2021, 10:25am

I have an update on this. I got a desktop with a Ryzen 7 4750G cpu (Zen 2, 8 cores, 3.6-4.4GHz, 8MB L3 cache). Appatenly MKLSparse does work with AMD cpus and without any hacking now.

By simply doing
pkg> add MKLSpare

and then testing:

using SparseArrays, BenchmarkTools, LinearAlgebra
A = sprand(10_000, 10_000, 0.001)
v = rand(10_000)

Without MKLSparse I got the timings:

@btime A*v;
  105.259 μs (2 allocations: 78.20 KiB)

@btime A'v;
  97.819 μs (3 allocations: 78.25 KiB)

With MKLSparse:

using MKLSparse
@btime A*v;
  62.430 μs (2 allocations: 78.20 KiB)
@btime A'v;
  13.730 μs (3 allocations: 78.25 KiB)

MKLSparse is using the 8 cores of the cpu.

I also tested against ThreadedSparseArrays.jl

using ThreadedSparseArrays
tA = ThreadedSparseMatrixCSC(A);
@btime tA*v;
  126.399 μs (20 allocations: 79.25 KiB)
@btime tA'v;
  22.820 μs (63 allocations: 83.20 KiB)

Here the mat-vec multiplication does not appear to be using threads, while the adjoint(mat)-vec did use the 8 cores (but MKLSparse still beats it).

Bruno_Amorim · October 15, 2021, 10:36am

Interesting… the time difference betweenThreadedSparseArrays.jl vs MKLSparse.jl for the adjoint(mat)-vec goes away if we consider larger matrices:

using SparseArrays, BenchmarkTools

A = sprand(100_000, 100_000, 0.001)
v = rand(100_000)

using  ThreadedSparseArrays
tA = ThreadedSparseMatrixCSC(A);
@btime tA'v;
  5.495 ms (64 allocations: 786.33 KiB)

using MKLSparse
@btime A'v;
  5.449 ms (3 allocations: 781.38 KiB)

So probably the slowness of ThreadedSparseArrays in the previous example is due to the overhead of Julia threads?

I wonder if Polyester.jl could improve things

freestatelabs · December 11, 2024, 8:25pm

I’ve been looking into this recently (I wanted to experiment with AOCL vs MKL benchmarks on my AMD PC) - has there been any change since you originally posted this thread in 2020?

Topic		Replies	Views
How to utilize "MKLSparse.jl"? General Usage question	14	1434	September 1, 2022
ANN: MKLSparse Community mkl , iterative , sparse	8	2506	September 4, 2021
Is MKL performance on AMD no longer crippled? General Usage	4	2102	May 11, 2024
Performance discrepancy in solving sparse SPD\dense rhs between X86 and Apple M Performance sparsearrays	2	102	July 4, 2025
MKL and sparse solvers General Usage	1	251	June 6, 2022

MKLSparse with AMD cpu

Related topics