This is my attempt for clarifying the behavior of the new default scheduling (aka :dynamic
):
What if there was a new mechanic like this made up BufferCache
buffers = Base.BufferCache(makebuf, nthreads())
Threads.@threads x in xs
Base.acquire(buffers) do buf
# do something with x and buf
end
end
It is very inefficient to (re)acquire and release the lock for each iteration. If you donāt mind O(nthreads())
allocations, I recommend FLoops.@init
Efficient and safe approaches to mutation in data parallelism (which also works on Distributed and GPU loops). If you want a more package-free approach, have a look at the worker pool pattern in Concurrency patterns for controlled parallelisms where something like BufferCache
can be used but only with one acquirer/release for each task.
@tkf was kind enough to suggest a solution to the pre-allocated problem in https://github.com/JuliaMolSim/DFTK.jl/issues/588. Quoting from his post:
First of all, here's a trick you can use almost always. If you have this pattern
Threads.@threads for x in xs
i = Threads.threadid()
f(x, i)
end
you can mechanically convert this to
n = cld(length(xs), Threads.nthreads())
@sync for (i, chunk) in enumerate(Iterators.partition(xs, n))
Threads.@spawn for x in chunk
f(x, i)
end
end
This is very likely correct if the loop body f only uses threadid() with arrays allocated only for this parallel loop (e.g., pre-1.3 reduction pattern).
The idea is to handle the range splitting into chunks yourself, and use per-chunk (rather than per-thread) buffers.
I have been using this pattern in some code:
nchunks = Threads.nthreads() # not necessarily, but most commonly
@sync for ichunck in 1:nchuncks
Threads.@spawn for i in ichunck:nchuncks:length(x)
f(x[i],ichunck)
end
end
The difference is that this patterns makes the access to the elements of x
not contiguous (one iterates jumping over in steps of size nchuncks
. This is simple but probably is worse if the access of the elements of x
is a bottleneck.
What I did observe, and that may be an useful addition here, is that when using these patterns, one can have a performance advantage by setting nchucks > nthreads()
, because sometimes one chunck gets overloaded, or stalled because of hardware stuff, and that improves load balancing.
one problem is I donāt want my users to NEED to know all this in order to have multi-threading (they are physicists who barely know what is a macro and Julia would be a āsellā from meā¦)
one possible workaround I know is to use Polyester.jl
or well, write a new macro within the package, but I wish @threads
would just work since the underlying logic is really as stupid as possibleā¦
I agree. Couldnāt there be an even simpler syntax for that, that handled that properly by converting the loop into the pattern suggested above? Something even more ānaturalā, as:
@parallel nchuncks = nthreads() for i in 1:length(x)
result[chunck_index()] = ...
end
where the macro just reinterprets that to something safe?
itās more subtle than that, basically the current behavior (regardless of :static
or :dynamic
) agrees with nchunks = nthreads()
, as tkf has said:
itās just that :dynamic
would allow a task, which is handling a contiguous chunk, to be run on a different OS thread at some point. nchuncks = nthreads()
doesnāt seem to be an extra constrain
Yes, I didnāt mean exactly the possibility of that option. The suggestion was because there it is clear that chunck_index()
is something not necessarily bound to the threads, but to some arbitrary counter.
A manual entry about that would be quite explicit: āwith nchuncks=N
one sets on how many threads one wants to split the work, and a buffer split into N
chuncks can be updated in a thread-safe manner using chunck_index()
ā.
Why not just provide foreach
-like API?
how does that solve the problem? I assume foreach()
will still yield an lazy collection