Behavior of `Threads.@threads for` loop

tkf · February 14, 2022, 10:18am

This is my attempt for clarifying the behavior of the new default scheduling (aka :dynamic):

https://github.com/JuliaLang/julia/pull/44168

ianshmean · February 14, 2022, 10:33am

What if there was a new mechanic like this made up BufferCache

buffers = Base.BufferCache(makebuf, nthreads())
Threads.@threads x in xs 
    Base.acquire(buffers) do buf
        # do something with x and buf
    end
end

tkf · February 14, 2022, 11:08am

It is very inefficient to (re)acquire and release the lock for each iteration. If you don’t mind O(nthreads()) allocations, I recommend FLoops.@init Efficient and safe approaches to mutation in data parallelism (which also works on Distributed and GPU loops). If you want a more package-free approach, have a look at the worker pool pattern in Concurrency patterns for controlled parallelisms where something like BufferCache can be used but only with one acquirer/release for each task.

antoine-levitt · February 14, 2022, 12:00pm

@tkf was kind enough to suggest a solution to the pre-allocated problem in https://github.com/JuliaMolSim/DFTK.jl/issues/588. Quoting from his post:

First of all, here's a trick you can use almost always. If you have this pattern

Threads.@threads for x in xs
    i = Threads.threadid()
    f(x, i)
end
you can mechanically convert this to

n = cld(length(xs), Threads.nthreads())
@sync for (i, chunk) in enumerate(Iterators.partition(xs, n))
    Threads.@spawn for x in chunk
        f(x, i)
    end
end

This is very likely correct if the loop body f only uses threadid() with arrays allocated only for this parallel loop (e.g., pre-1.3 reduction pattern).

The idea is to handle the range splitting into chunks yourself, and use per-chunk (rather than per-thread) buffers.

lmiq · February 14, 2022, 1:54pm

I have been using this pattern in some code:

nchunks = Threads.nthreads() # not necessarily, but most commonly
@sync for ichunck in 1:nchuncks
    Threads.@spawn for i in ichunck:nchuncks:length(x)
        f(x[i],ichunck)
    end
end

The difference is that this patterns makes the access to the elements of x not contiguous (one iterates jumping over in steps of size nchuncks. This is simple but probably is worse if the access of the elements of x is a bottleneck.

What I did observe, and that may be an useful addition here, is that when using these patterns, one can have a performance advantage by setting nchucks > nthreads(), because sometimes one chunck gets overloaded, or stalled because of hardware stuff, and that improves load balancing.

jling · February 14, 2022, 2:06pm

one problem is I don’t want my users to NEED to know all this in order to have multi-threading (they are physicists who barely know what is a macro and Julia would be a “sell” from me…)

one possible workaround I know is to use Polyester.jl or well, write a new macro within the package, but I wish @threads would just work since the underlying logic is really as stupid as possible…

lmiq · February 14, 2022, 2:24pm

I agree. Couldn’t there be an even simpler syntax for that, that handled that properly by converting the loop into the pattern suggested above? Something even more “natural”, as:

@parallel nchuncks = nthreads() for i in 1:length(x)
     result[chunck_index()] = ...
end

where the macro just reinterprets that to something safe?

jling · February 14, 2022, 2:32pm

it’s more subtle than that, basically the current behavior (regardless of :static or :dynamic) agrees with nchunks = nthreads(), as tkf has said:

it’s just that :dynamic would allow a task, which is handling a contiguous chunk, to be run on a different OS thread at some point. nchuncks = nthreads() doesn’t seem to be an extra constrain

lmiq · February 14, 2022, 2:33pm

Yes, I didn’t mean exactly the possibility of that option. The suggestion was because there it is clear that chunck_index() is something not necessarily bound to the threads, but to some arbitrary counter.

A manual entry about that would be quite explicit: “with nchuncks=N one sets on how many threads one wants to split the work, and a buffer split into N chuncks can be updated in a thread-safe manner using chunck_index()”.

tkf · February 14, 2022, 3:42pm

Why not just provide foreach-like API?

jling · February 14, 2022, 4:28pm

how does that solve the problem? I assume foreach() will still yield an lazy collection

Topic		Replies	Views
Behavior of `@time` when using `@spawn` (in Julia 1.8 highlights blog post) New to Julia multithreading	2	405	August 22, 2022
Multi-threading appears to be single thread when some threads cost much more time than the others? General Usage question	13	705	July 10, 2023
@threads for only using master thread? General Usage multithreading	7	1013	November 24, 2019
@threads :Static seems not doing static scheduling? General Usage question	8	419	March 21, 2023
Threads.@threads scheduling puzzle Performance parallel	2	376	December 9, 2021

Behavior of `Threads.@threads for` loop

Related topics