Trying to make multithreading to work on applying a function number of times

ahmed · August 23, 2023, 6:05pm

I have a function F that does some computations on its input arrays. First I’m trying to see if I can make this function as fast as possible as this function will be computed 50 times in one of possibly many iterations so making it fast is essential. With the dimension of the matrices I have(~10^4), it takes roughly 60 seconds to compute F. I’m thinking I could use multithreading for running F 50 times as these operations are independent. Here is a MWE for my problem

 #arrays used in the calculation
using LinearAlgebra, BenchmarkTools
phi0 = rand(ComplexF64, 1000, 10, 50)
P = rand(ComplexF64, 10, 10, 50)
K = rand(ComplexF64, 1000, 1000, 50)
#preallocated arrays for the operation of F
F_out = zeros(ComplexF64, 1000, 1000)
F_mul_tmp = zeros(ComplexF64, 10, 1000)
F_mul_tmp2 = zeros(ComplexF64, 1000, 1000)  
#The function F
function F(k, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
    fill!(F_out, 0.0 + 0.0im)
	for q in 1:50
        @views mul!(F_mul_tmp, P[:, :, q], adjoint(phi0[:, :, q])) 
		@views mul!(F_mul_tmp2, phi0[:, :, q], F_mul_tmp)
		@views F_out .+= broadcast!(*, F_mul_tmp2, K[:, :, k], F_mul_tmp2)
    end
    return F_out 
end
@btime F(2, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out) #benchmark F

#Compute F many times in parallel
function F_threading(phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
	output = zeros(ComplexF64, 1000, 1000, 50)
	@Threads.threads for i in 1:50
		@views output[:, :, i] = F(i, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
	end
	return output
end

There is already a problem with my multithreading part of the code as the answer seems to change with every run. I guess this is a case for race conditions but each run of F should be independent of each other. Shouldn’t it be straightforward to do this kind of computation in parallel? Any suggestions for how to optimize F and how to make multithreading work will be appreciated.

DNF · August 23, 2023, 6:12pm

A major issue is that matrix multiplication (both * and mul!) is already using multithreading through BLAS, and those threads and Julia threads do not cooperate well. So you are probably already using all available cores.

You could try to set the number of BLAS threads to 1, though, and see if parallelizing on the Julia level is more efficient.

ahmed · August 23, 2023, 6:21pm

Sorry, I’m not sure I understand what you mean. I should have mentioned I’m relatively new to Julia. Are you saying I could make * and mul! not use threads so that to avoid the clash between them and the multithreading in the loop? How to set the BLAS threads to 1?

DNF · August 23, 2023, 6:31pm

Yes, that is what I’m saying. Matrix multiplication is (mostly) handled by an external BLAS library (written in C/Fortran/Assembly). The implementation is multi-threaded, but you can control the number of threads you allow it access to.

It’s something like LinearAlgebra.BLAS.set_num_threads(1) or something. Going from memory here.

fatteneder · August 23, 2023, 6:41pm

The problem is likely that you use only one set of cache arrays F_mul_tmp, F_mul_tmp2, F_out, ... and pass it to Threads.nthreads() threads at once. This turns your program into a data race because one threads’ results will be constantly overwritten by another one’s.

If you want to run on N threads, you will need to make N copies of the caches and make sure you only ever give one set of caches to one thread.

ahmed · August 24, 2023, 5:29am

Did you mean I have to have sets of cache arrays equal to the number of threads (which is necessarily not 50 as it is equal to the number of cores I have)? Or did you mean that there should be sets of cache arrays for each iteration in the loop and that’s why it’s 50?

fatteneder · August 24, 2023, 6:32am

Ups, sorry, I somehow assumed you are trying to run with 50 threads. I update my initial answer.

Did you mean I have to have sets of cache arrays equal to the number of threads (which is necessarily not 50 as it is equal to the number of cores I have)?

Yes. One set of arrays for each thread.

ahmed · August 24, 2023, 6:54am

@fatteneder I have tried to implement your idea; is this the right execution? (It does seem that the output of F_threading doesn’t change by applying it many times so it’s a sign that there are no race conditions)

function F_threading(phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
	set1 = [zeros(ComplexF64, 10, 1000) for i in 1:Threads.nthreads()]
	set2 = [zeros(ComplexF64, 1000, 1000) for i in 1:Threads.nthreads()]
	set3 = [zeros(ComplexF64, 1000, 1000) for i in 1:Threads.nthreads()]
	output = zeros(ComplexF64, 1000, 1000, 50)
	@Threads.threads for i in 1:50
		F_mul_tmp = set1[Threads.threadid()]
		F_mul_tmp2 = set2[Threads.threadid()]
		F_out = set3[Threads.threadid()]
		@views output[:, :, i] = F(i, phi0, P, K, F_mul_tmp, F_mul_tmp2, F_out)
	end
	return output
end

fatteneder · August 24, 2023, 7:33am

You are almost there.

Unfortunately, the use of Threads.threadid() is no longer recommended due to the dynamical scheduling mechanisms that Threads.@threads offers.

You can read more about here PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads().

To get the version you have working, you have to update the for loop to read

Threads.@threads :static for i in 1:50

However, as the blog post explains this is also not longer recommended and one should use the advises given there.

DNF · August 24, 2023, 8:16am

I’m a bit skeptical that this will be faster than just using multithreaded BLAS, but at least one thing you should not do is to hard-code the sizes of the caches and the loop:

ahmed:

set1 = [zeros(ComplexF64, 10, 1000) for i in 1:Threads.nthreads()]
set2 = [zeros(ComplexF64, 1000, 1000) for i in 1:Threads.nthreads()]
set3 = [zeros(ComplexF64, 1000, 1000) for i in 1:Threads.nthreads()]
output = zeros(ComplexF64, 1000, 1000, 50)
@Threads.threads for i in 1:50

You should express them in terms of the sizes of the input arrays, using size() and axes(). The above is both really hard to understand, will only work for one problem size, and be very bug-prone.

ahmed · August 24, 2023, 9:07pm

@DNF Are you saying that I could potentially increase the performance of mul! and * and hence my computation if I used more cores? Is the default that BLAS uses all available cores?

one thing you should not do is to hard-code the sizes of the caches and the loop

I’m not doing this in my code; I did it this way for the purpose of providing a quick MWE. Thanks for your feedback regardless!

fatteneder · August 25, 2023, 7:51am

If you want to know the truth, then benchmark your code.
Take a look at BenchmarkTools.jl and use it to run your code with different thread and BLAS thread configurations.