MKLSparse with AMD cpu

As discussed in other threads

Intel MKL descriminates against AMD cpu’s (although this might be changing).

Does this affect the performance of MKLSparse.jl on AMD cpu’s? Does anyone have experinece in this?

Ideally, one would have a AOCLSparse.jl to use AMD’s AOCL for amd cpu’s. But that doesn’t seem to exist yet.

1 Like

I’m also very interested in this. In particular, can AMD chips (I’m thinking Zen3 in particular) multithread sparse mat-vec multiplication efficiently using MKL?

1 Like

I am also very interested in that. For the exact same reason as @pablosanjose . Any updates?

I have an update on this. I got a desktop with a Ryzen 7 4750G cpu (Zen 2, 8 cores, 3.6-4.4GHz, 8MB L3 cache). Appatenly MKLSparse does work with AMD cpus and without any hacking now.

By simply doing
pkg> add MKLSpare

and then testing:

using SparseArrays, BenchmarkTools, LinearAlgebra
A = sprand(10_000, 10_000, 0.001)
v = rand(10_000)

Without MKLSparse I got the timings:

@btime A*v;
  105.259 μs (2 allocations: 78.20 KiB)

@btime A'v;
  97.819 μs (3 allocations: 78.25 KiB)

With MKLSparse:

using MKLSparse
@btime A*v;
  62.430 μs (2 allocations: 78.20 KiB)
@btime A'v;
  13.730 μs (3 allocations: 78.25 KiB)

MKLSparse is using the 8 cores of the cpu.

I also tested against ThreadedSparseArrays.jl

using ThreadedSparseArrays
tA = ThreadedSparseMatrixCSC(A);
@btime tA*v;
  126.399 μs (20 allocations: 79.25 KiB)
@btime tA'v;
  22.820 μs (63 allocations: 83.20 KiB)

Here the mat-vec multiplication does not appear to be using threads, while the adjoint(mat)-vec did use the 8 cores (but MKLSparse still beats it).

2 Likes

Interesting… the time difference betweenThreadedSparseArrays.jl vs MKLSparse.jl for the adjoint(mat)-vec goes away if we consider larger matrices:

using SparseArrays, BenchmarkTools

A = sprand(100_000, 100_000, 0.001)
v = rand(100_000)

using  ThreadedSparseArrays
tA = ThreadedSparseMatrixCSC(A);
@btime tA'v;
  5.495 ms (64 allocations: 786.33 KiB)

using MKLSparse
@btime A'v;
  5.449 ms (3 allocations: 781.38 KiB)

So probably the slowness of ThreadedSparseArrays in the previous example is due to the overhead of Julia threads?

I wonder if Polyester.jl could improve things

2 Likes

I’ve been looking into this recently (I wanted to experiment with AOCL vs MKL benchmarks on my AMD PC) - has there been any change since you originally posted this thread in 2020?