Multithreaded for loops in Julia

random_guy · July 15, 2025, 5:14am

Hello everyone,

I am a bit new to multithreading and I would like to know what the best way to multithread a for loop is. The unthreaded for loop is the following, where I’ve included in-place matrix modification and the calculation of a scalar quantity as a simplified version of my intended use case:

Unthreaded code:

ans = 0.0
@inbounds for i in 1:M
    @simd for j in 1:N 
        A[i, j] = f(i, j)
        ans += g(A[i, j])
    end
end

Here’s an approach with a buffer managed using threadid(). This is kind of what chatGPT suggested.

buffer = zeros(nthreads())

@threads for i in 1:M
    tid = threadid()
    local_accumulator = 0.0
    @inbounds begin 
        @simd for j in 1:N 
            A[i, j] = f(i, j)
            local_accumulator += g(A[i, j])
        end
        buffer[tid] += local_accumulator 
    end
end
ans = sum(buffer)

And here’s an approach based on reading the Julia documentation.

nt = nthreads()
chunks = Iterators.partition(1:M, div(M, nt))
tasks = map(chunks) do chunk 
    @spawn begin 
        chunk_sum = 0.0
        @inbounds begin 
            for i in chunk 
                @simd for j in 1:N 
                    A[i, j] = f(i, j)
                    chunk_sum += g(A[i, j])
                end                
            end
        end
        chunk_sum
    end
end
chunk_sums = fetch.(tasks)
ans = sum(chunk_sums)

I guess my question is, which approach for multithreading is better? Which is always guaranteed to work? Thank you!

yolhan_mannes · July 15, 2025, 6:02am

The first one may be bad you need to do @threads :static for... when you have a buffer like that to make sure it always works (it will freeze the threadid to always refer to the same thread on you machine). It’s in the doc somewhere Multi-Threading · The Julia Language. For the second one I think it’s fine but there may be better ways. Also since most of the work you want parelel on is in the inner loop you could remove the buffer and make an atomic operation at the end instead in this case no need for :static.
As always benchmarking is your friend

random_guy · July 15, 2025, 6:22am

Thank you!

gdalle · July 15, 2025, 6:50am

See also:

random_guy · July 15, 2025, 7:06am

Thank you! From what I understand, it’s just not recommended to use the @threads macro?

yolhan_mannes · July 15, 2025, 7:09am

It’s not recommended to rely on threadid that’s all obviously it’s always better to use OhMyThreads.jl or ThreadsX.jl but it doesn’t mean it’s not recommended to do otherwise

GunnarFarneback · July 15, 2025, 8:20am

Assuming that f and g are noticeably more expensive than the summation and that M * N is not larger than you can afford another array of the same size as A, I would sidestep the threaded reduction and try something like

B = similar(A)
@threads for i in 1:M
    @inbounds begin 
        @simd for j in 1:N 
            A[i, j] = f(i, j)
            B[i, j] = g(A[i, j])
        end
    end
end
ans = sum(B)

random_guy · July 15, 2025, 8:27am

Thank you. In my actual use case, M and N will be quite large and f and g will be pretty inexpensive, but I will keep this in mind for the future.

GunnarFarneback · July 15, 2025, 8:36am

You can also do a partial reduction inside the threaded loop, which might fit better to your problem.

B = similar(A, M)
@threads for i in 1:M
    @inbounds begin
        s = 0.0 
        @simd for j in 1:N 
            A[i, j] = f(i, j)
            s += g(A[i, j])
        end
        B[i] = s
    end
end
ans = sum(B)

foobar_lv2 · July 15, 2025, 9:31am

The context for this is that tasks can get rescheduled to different threads (which changes their threadid).

At the moment, rescheduling can only happen if the task yields. However, whether f(...) yields is an implementation detail, and a pretty bad abstraction – you don’t want your code to break because you added a debug log statement in a very rare edge-case that then happens to take a lock for IO that happened to be sufficiently contested that the task yields. But if you must, then it’s still feasible to reason through whether f can yield (GC doesn’t yield, phew!).

mbauman · July 15, 2025, 1:40pm

Just in case: you’re putting all these code snippets inside a function, right? Right?

The hardest part about multithreading is that the “optimal” structure is highly task (and system) dependent. So, yes, there are lots of possible idioms. Which one is best for you will depend upon how big your “inner” tasks are, how much coordination the threads require, and how much your system is already limited by things like memory bandwidth.

random_guy · July 15, 2025, 1:52pm

Haha yes of course, thanks for checking.

Topic		Replies	Views
Question on multithreading/vectorizing loops New to Julia simd , multithreading , threads	9	683	March 22, 2023
Parallelization of simple loop: reductions, thread-private variables? Julia at Scale	9	4201	September 30, 2017
Looking for advice on threading General Usage	6	1622	January 15, 2020
Multithreading for nested for loops General Usage parallel , multithreading , threads	13	1759	August 16, 2023
Fastest way to run a for loop in parallel Performance multithreading , threads	4	5418	December 27, 2020

Multithreaded for loops in Julia

Related topics