How to utilize "MKLSparse.jl"?

I have some sparse matrices and I thought using “MKLSparse.jl” could faster the execution, but it did not. However, in the documentation, there is this note

The integer type that should be used in order for MKL to be called is the same as used by the Julia BLAS library, see Base.USE_BLAS64.
I am not sure if it is because of this issue. If so, how to set it given that I am using vs code?

julia> BLAS.lbt_get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] mkl_rt.2.dll

Execution of what exactly?

Works fine for a sparse-dense matrix multiplication for me:

julia> using BenchmarkTools

julia> using SparseArrays

julia> S = sprand(10_000, 10_000, 0.01);

julia> D = rand(10_000, 10_000);

julia> @btime $S * $D;
  10.275 s (2 allocations: 762.94 MiB)

julia> using MKLSparse

julia> @btime $S * $D;
  485.737 ms (2 allocations: 762.94 MiB)
  • I have similar to your results without MKLSparse, but your result is faster with it. Any idea?
  • Does my results below means MKLSparse works correctly in my case?
julia> @btime $S * $D;
  10.912 s (2 allocations: 762.94 MiB)

julia> using MKLSparse

julia> @btime $S * $D;
  2.834 s (2 allocations: 762.94 MiB)

I’d assume yes. But it’s hard to say given that you haven’t provided much information. For example, it would be relevant to know which CPU you’re running on. (My CPU has probably more threads etc. than yours)

Thanks for your reply. Please find my CPU details below.

Intel(R) Core™ i7-10750H CPU @ 2.60GHz 2.59 GHz

julia> Sys.cpu_info()
12-element Vector{Base.Sys.CPUinfo}:
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz: 
        speed         user         nice          sys         idle          irq
     2592 MHz    6221562            0      9454234    218662796      3367234  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz: 
        speed         user         nice          sys         idle          irq
     2592 MHz    6820937            0      5096546    222420734       148750  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz   10452000            0      5743921    218142296       112703  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz    9861718            0      4013812    220462687        82734  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz   12091000            0      4026343    218220875        37468  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz   13246093            0      4133781    216958343        38250  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz   15318750            0      4460359    214559125        34562  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz   15330390            0      4278515    214729312        39500  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz    7359000            0      7165921    219813296       102062  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz    6830562            0      4275578    223232078        99515  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz    7758718            0      4911546    221667953        89359  ticks
 Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz:
        speed         user         nice          sys         idle          irq
     2592 MHz    7871406            0      4968968    221497843       172296  ticks
julia> Threads.nthreads()
6

I changed threads to 12, but still giving me the same performance.

julia> Threads.nthreads()
12

Is there another issue to check the reason of the slow performance?

@carstenbauer Is there another issue to check the reason of the slow performance?
Is there a special functions that I can use in MKLSparse for muliplication rather than * or it is overwrite?

@stevengj
@lmiq
@DNF
@jling
I am sorry if my mention is not proper. I just sucked at this for long time and I did know why I am not having the same fast speed as my colleague after using “MKLSparse”. I really appreciate any help from you. Thank you!

julia> versioninfo()
Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 12 on 12 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

julia> BLAS.lbt_get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll

julia> using MKL

julia> BLAS.lbt_get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] mkl_rt.2.dll

Based on just the benchmark itself without MKL it is likely the case that @carstenbauer has a better CPU, and consequently a CPU that better utilizes MKL functionality over your CPU. We won’t know for sure though unless he posts his CPU info.

From this thread, it seems that he gets a 20x speed up and you get a 4x speed.

The complicated answer I could give you is that you should confirm if you get similar MKL performance gains without using Julia’s MKL package and see if similar times are observed. This would at least rule out if using Julia’s MKL is the cause for a potential slowdown, if any.

The easy answer I can give you though is get a better CPU.

1 Like

Thank you very much for your reply.

Excuse me, I did not get it. From the post above, I have similar results with carstenbauer when not importing using MKLSparse and slower with him when using it.

It might be interesting to try

if your sparse matrix does not change for several iterations. The CSB data structure and code has been observed to be faster than MKL for many sparse matrix families.

Your source codes will require a minor modification or abstraction to process the dense matrices in column batches. I think we compiled it with up to 32 dense columns per call.

1 Like

To be technical, you have slower results. If you had the same or above, the results would be alarming, but that is not the case. You should not compare with @carstenbauer 's results until you have his CPU information to compare with.

There is nothing you can do besides getting a better CPU or by trying to find a bug in MKLSparse.jl by comparing your results with MKLSparse.jl vs the Intel provided MKL Sparse routines.

1 Like

I see. Thank you very much for your reply!

1 Like

Thank you, I will check it

FWIW I get a result similar to yours (~10 s and ~3 s), with:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × 11th Gen Intel(R) Core(TM) i5-11500H @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 12 on 12 virtual cores


2 Likes

Thanks for your feedback! :slight_smile: