Hello everyone,
I am a bit new to multithreading and I would like to know what the best way to multithread a for loop is. The unthreaded for loop is the following, where I’ve included in-place matrix modification and the calculation of a scalar quantity as a simplified version of my intended use case:
Unthreaded code:
ans = 0.0
@inbounds for i in 1:M
@simd for j in 1:N
A[i, j] = f(i, j)
ans += g(A[i, j])
end
end
Here’s an approach with a buffer managed using threadid(). This is kind of what chatGPT suggested.
buffer = zeros(nthreads())
@threads for i in 1:M
tid = threadid()
local_accumulator = 0.0
@inbounds begin
@simd for j in 1:N
A[i, j] = f(i, j)
local_accumulator += g(A[i, j])
end
buffer[tid] += local_accumulator
end
end
ans = sum(buffer)
And here’s an approach based on reading the Julia documentation.
nt = nthreads()
chunks = Iterators.partition(1:M, div(M, nt))
tasks = map(chunks) do chunk
@spawn begin
chunk_sum = 0.0
@inbounds begin
for i in chunk
@simd for j in 1:N
A[i, j] = f(i, j)
chunk_sum += g(A[i, j])
end
end
end
chunk_sum
end
end
chunk_sums = fetch.(tasks)
ans = sum(chunk_sums)
I guess my question is, which approach for multithreading is better? Which is always guaranteed to work? Thank you!
The first one may be bad you need to do @threads :static for...
when you have a buffer like that to make sure it always works (it will freeze the threadid to always refer to the same thread on you machine). It’s in the doc somewhere Multi-Threading · The Julia Language. For the second one I think it’s fine but there may be better ways. Also since most of the work you want parelel on is in the inner loop you could remove the buffer and make an atomic operation at the end instead in this case no need for :static
.
As always benchmarking is your friend
1 Like
Thank you! From what I understand, it’s just not recommended to use the @threads macro?
It’s not recommended to rely on threadid that’s all
obviously it’s always better to use OhMyThreads.jl or ThreadsX.jl but it doesn’t mean it’s not recommended to do otherwise
Assuming that f
and g
are noticeably more expensive than the summation and that M * N
is not larger than you can afford another array of the same size as A
, I would sidestep the threaded reduction and try something like
B = similar(A)
@threads for i in 1:M
@inbounds begin
@simd for j in 1:N
A[i, j] = f(i, j)
B[i, j] = g(A[i, j])
end
end
end
ans = sum(B)
Thank you. In my actual use case, M and N will be quite large and f and g will be pretty inexpensive, but I will keep this in mind for the future.
You can also do a partial reduction inside the threaded loop, which might fit better to your problem.
B = similar(A, M)
@threads for i in 1:M
@inbounds begin
s = 0.0
@simd for j in 1:N
A[i, j] = f(i, j)
s += g(A[i, j])
end
B[i] = s
end
end
ans = sum(B)
The context for this is that tasks can get rescheduled to different threads (which changes their threadid).
At the moment, rescheduling can only happen if the task yields. However, whether f(...)
yields is an implementation detail, and a pretty bad abstraction – you don’t want your code to break because you added a debug log statement in a very rare edge-case that then happens to take a lock for IO that happened to be sufficiently contested that the task yields. But if you must, then it’s still feasible to reason through whether f
can yield (GC doesn’t yield, phew!).
Just in case: you’re putting all these code snippets inside a function, right? Right?
The hardest part about multithreading is that the “optimal” structure is highly task (and system) dependent. So, yes, there are lots of possible idioms. Which one is best for you will depend upon how big your “inner” tasks are, how much coordination the threads require, and how much your system is already limited by things like memory bandwidth.
Haha yes of course, thanks for checking.
1 Like