Multithreaded LAPACK function in a Threads.@threads loop

Hi everyone,

Problem

I am trying to understand how to optimally set the number of BLAS threads when I am using a multithreaded LAPACK function in a Threads.@threads loop.

More specifically I want to compute the eigendecomposition of a number of independent matrices in parallel as quickly as possible.

System and Setup

My system:

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

and I compiled Julia with MKL.

The function I am calling to compute the eigenvalues and eigenvectors in this MWE is syevr which is multi-threaded in MKL:

function compute_ev_threads(A::Array{Array{Float64, 2}, 1})

	Threads.@threads for k = 1:length(A)
		W, Z = LinearAlgebra.LAPACK.syevr!('V', 'A', 'U', A[k], 0., 0., 0, 0, -1.0)
	end
	return nothing
end

Here A is a list of matrices typically with a dimension in the range of 2 - 1000. But for a specific problem A could hold matrices of very different sizes. The length of A can be up to 1000.

Tuning knobs

  • It seems to me that MKL dynamically chooses the number of threads that it uses for the decomposition, i.e. 1 thread for smaller matrices and more threads for large matrices. That behaviour can be disabled with the environment variable ENV["MKL_DYNAMIC"] = false.
  • I can set the maximum number of threads for BLAS with BLAS.set_num_threads(nthreads). Again if ENV["MKL_DYNAMIC"] = true might decide to use less than that number.
  • JULIA_NUM_THREADS can be used to tell Julia how many threads I want to use in the Threads.@threads loop.

Benchmarks

Setup 1: 50 matrices of dimension 200x200

I am setting JULIA_NUM_THREADS = 4 (number of physical cores)

using LinearAlgebra, BenchmarkTools, Random

ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
	BLAS.set_num_threads(nthreads);
	rng = Random.MersenneTwister(1);
	N = 200;
	A = [rand(rng, N, N) for k = 1:50];
	t = @belapsed compute_ev_threads($A)
	println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
Number of BLAS threads: 1, Time: 0.052786899
Number of BLAS threads: 2, Time: 0.058757388
Number of BLAS threads: 4, Time: 0.115723637

Here it seems like 1 BLAS thread works best. Intuitively I would say this makes sense as each Julia thread can create 1 BLAS thread and work in parallel. If each Julia thread would try to use 4 BLAS threads the CPU threads are oversubscribed and the performance goes down.

Question 1: Does this reasoning make sense? Is BLAS.set_num_threads() setting the global number of BLAS threads that can be used at any one time or per Julia thread? The discussion in Julia Threads vs BLAS Threads seems to say that the setting sets the global thread pool for BLAS. However, this seems to contradict above results.

Setting ENV["MKL_DYNAMIC"] = true doestn’t seem to have any impact. I get:

Number of BLAS threads: 1, Time: 0.05300575
Number of BLAS threads: 2, Time: 0.059146294
Number of BLAS threads: 4, Time: 0.116660464

Setup 2: 100 matrices of dimension 8x8 and one 500x500

ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
	BLAS.set_num_threads(nthreads);
	rng = Random.MersenneTwister(1);
	N = 8;
	A = [rand(rng, N, N) for k = 1:100]
	A = pushfirst!(A, rand(rng, 500, 500))
	t = @belapsed compute_ev_threads($A)
	println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
Number of BLAS threads: 1, Time: 0.047233419
Number of BLAS threads: 2, Time: 0.041688711
Number of BLAS threads: 4, Time: 0.041307225

and with ENV["MKL_DYNAMIC"] = true. Again doesn’t seem to have any impact:

Number of BLAS threads: 1, Time: 0.048276108
Number of BLAS threads: 2, Time: 0.041320485
Number of BLAS threads: 4, Time: 0.040560082

Question 2: I am surprised that the results are so similar. I would have assumed that dynamic 4 threads would perform much better. First attack the 500x500 block with 4 threads and then dynamically do all the other blocks with 1 thread.

Setup 3: 12 matrices of dimension 500x500

ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
	BLAS.set_num_threads(nthreads);
	rng = Random.MersenneTwister(1);
	N = 500;
	A = [rand(rng, N, N) for k = 1:12]
	t = @belapsed compute_ev_threads($A)
	println("Number of BLAS threads: $(nthreads), Time: $(t)")
end

Number of BLAS threads: 1, Time: 0.110499758
Number of BLAS threads: 2, Time: 0.126500489
Number of BLAS threads: 4, Time: 0.168587943

No impact of ENV["MKL_DYNAMIC"] = true ?:

Number of BLAS threads: 1, Time: 0.110146218
Number of BLAS threads: 2, Time: 0.124481478
Number of BLAS threads: 4, Time: 0.171753869

Question 3: I guess this post is more to double-check that my assumptions are correct and that I am using the different available settings correctly. Also, it seems strange that ENV["MKL_DYNAMIC"] = true doesn’t really have any impact. I also assumed the overall difference between the results would be bigger.

Hi @migarstka,

I was wondering if you had made some progress on this. I’m also getting started with multithreading in Julia, and trying to understand how to scale embarrassingly parallel parts of a code.

The MWE I came up with looks very similar to yours: do a few(npb) element-wise multiplication of vectors of size pb in-place, as seen below.

using Base.Threads
using LinearAlgebra
using BenchmarkTools

BLAS.set_num_threads(1)

const T = Float64

# size of each job
pb = 1024

# number of jobs
npb = 128

iter = [(Vector{T}(undef, pb), rand(T, pb), rand(T, pb)) for i in 1:pb]

function foo(iter)
    @threads for (x, y, z) in iter
        z .= x .* y
    end
end

@btime foo($iter)

When I run this with various number of threads

for n in 1 2 4 6; do JULIA_NUM_THREADS=$n julia --check-bounds=no minimal.jl; done

on a 6-Core Intel Core i9 with Julia 1.8.1, I get the following timings

Threads Timing Allocations
1 1.122 ms 6 (512 bytes)
2 888.008 μs 12 (992 bytes)
4 1.110 ms 25 (1.95 KiB)
6 974.436 μs 37 (2.91 KiB)

Not what I was expecting, given that the CPU usage looks coherent with the number of threads used. I tried a few other things (ThreadsX.foreach and FLoops.@floop), and the results are similar. Hence my question:

  1. Am I missing something with Threads?
  2. Should I be looking at something else (e.g. DistributedArrays)?

Any tips will be very much appreciated! Thanks,

V.

Didn’t look at the thread but note that BLAS != BLAS. OpenBLAS behaves very differently than MKL. See e.g. Julia Threads + BLAS Threads · ThreadPinning.jl