Consider a classic loop parallelization scheme: the outer loop multithreads, and the inner loop vectorizes:
tmp = [Vector{T}(undef, length(B)) for _ in 1:Threads.nthreads()]
@inbounds Threads.@threads for ii in eachindex(A)
@inbounds @simd for jj in eachindex(B)
tmp[Threads.threadid()][jj] = fun(A[ii], B[jj])
end
C[ii] = sum(tmp[Threads.threadid()])
end
I have the following questions:
- What is the scope of
@inbounds
and similar marcos? i.e., is the 2nd@inbounds
here redundant? Does it extends into the called functionfun
andsum
? - Is there a way to conditionally use (or not use) multithreading at runtime (essentially the
IF
directive in OpenMP)? whenA
is small, the overhead may not be worth it. - According to the manual,
Threads.threadid()
may change even within a single iteration, so the above code is actually not thread safe. What will be the proper way to create private scratch spaces liketmp
?
Thanks