Consider a classic loop parallelization scheme: the outer loop multithreads, and the inner loop vectorizes:
tmp = [Vector{T}(undef, length(B)) for _ in 1:Threads.nthreads()]
@inbounds Threads.@threads for ii in eachindex(A)
@inbounds @simd for jj in eachindex(B)
tmp[Threads.threadid()][jj] = fun(A[ii], B[jj])
end
C[ii] = sum(tmp[Threads.threadid()])
end
I have the following questions:
- What is the scope of
@inboundsand similar marcos? i.e., is the 2nd@inboundshere redundant? Does it extends into the called functionfunandsum? - Is there a way to conditionally use (or not use) multithreading at runtime (essentially the
IFdirective in OpenMP)? whenAis small, the overhead may not be worth it. - According to the manual,
Threads.threadid()may change even within a single iteration, so the above code is actually not thread safe. What will be the proper way to create private scratch spaces liketmp?
Thanks