In simulations, one usually has a external loop which runs over time steps, and inner loops that compute, for example, interactions between particles. These last loops are parallelized. Therefore, one has something as:
julia> function f()
s = zeros(Threads.nthreads())
for step in 1:10000
Threads.@threads for j in 1:10
s[Threads.threadid()] += 1
end
end
sum(s)
end
f (generic function with 1 method)
The @threads macro allocates a few things at every step, such that at the end one has many allocations, which are proportional to the number of steps:
julia> @btime f()
60.542 ms (411191 allocations: 35.59 MiB)
100000.0
The outer loop (on steps) is not parallelizable, of course.
Unless if the inner loop is really fast (something that we make the greatest efforts to achieve, but most likely the time for launching the threads is not important), probably that is not a problem. But, anyway, it that the regular approach? Is there some recommended way to deal with those allocations?
The main reason of my concern is not related to performance, but to the fact that it become a little bit more cumbersome to be sure that the code is allocation free where it maters, if the threaded loop is in some inner function (which is the case here).
Have you tried Polyester.jlβs threads? Iβve found it to require much less overhead (and often no allocations) compared to Base.@threads when used in inner loops, although itβs more prone to segfaulting if you violate its assumptions regarding code structure.
julia> function g()
s = zeros(10)
for step in 1:10000
@batch per=thread for j in 1:10
s[j] += 1
end
end
sum(s)
end
It not really an overhead problem, although I should try @batch. But the computations of the inner loop are expensive enough so that Iβm not expecting any significant performance gain there. It is more about those allocations, if I should simply ignore them or if there is some βpreallocationβ setup for the threaded auxiliary vectors, something like that. In everything that concerns my code I have elimited all allocations that would occur in the loops, so I wander if I can get rid of those as well.
The allocations introduced by Threads.@threads were also one of the reasons why I started an older discussion on the overhead of this macro. Finally, we switched to @batch from Polyester.jl (which wasnβt available back then) in Trixi.jl, our package for hyperbolic PDEs. The macro @batch includes a check whether only a single thread is used; if so, it will run the plain serial code, allowing you to see allocations in your inner functions. If multiple threads are used, which can be really helpful when developing code. If multiple threads are used, it usually creates much less allocations than Threads.@threads.
However, let me warn you that @batch isnβt as general as Threads.@threads. Additionally, we ran into some interesting bugs, but Chris Elrod was really helpful and fixed all the problems very fast (as always!).
I tried Polyester now, but it was actually slightly slower than @threads for me (I was not expecting it to be much faster either, as my inner loop is expensive). I had some issues with the progessmeter as well. But I will keep that possibility in mind.
Potentially pre-allocate the threads, have them listen on a channel, and then in the inner loop you just push some messages in the channel to tell the threads what to do:
foo = Channel(2)
julia> @benchmark begin put!(foo,"bar"); take!(foo) end
BenchmarkTools.Trial: 10000 samples with 287 evaluations.
Range (min β¦ max): 278.780 ns β¦ 359.282 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 293.268 ns β GC (median): 0.00%
Time (mean Β± Ο): 293.533 ns Β± 5.484 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ ββββββββ βββ βββ β
βββββββββ ββ βββββ β βββββββββββββββββββββββββ βββββββββββββββββββ β
279 ns Histogram: frequency by time 312 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Alternatively, you could also use a custom macro such as @threaded from Overhead of `Threads.@threads` - #23 by ranocha for development. Then, you will basically run the serial code without further allocations when starting Julia with a single thread, which can be helpful for development. We did that before switching to @batch in Trixi.jl.
One possiblity is that your code benefits from using more threads than the number of physical cores, in which case you can try @batch per=thread for .... The default @batch per=core.
But here is another example where @threads was several times faster than @batch. I havenβt really looked into the issues/causes there, just pointing out that some examples are known.
If you have one thatβs more minimal to investigate, thatβd probably increase my motivation to take a look.
But as a guess, I think itβs because of the unbalanced loads per thread in that example, and that the solution would be to add some more passive waiting option than the active waiting Polyester currently uses and is better for balanced workloads.
For many problems, overhead should be low enough as a percent of total time that it doesnβt make much difference.
If your code has a structure such that there is a parallel loop inside a sequential loop, using barrier-based scheduling may help improve the performance. Hereβs my implementation of various synchronization barriers https://github.com/tkf/SyncBarriers.jl. When there are many cores (say >= 32), I can observe that clever barriers like the dissemination barrier and the static tree barrier help the performance.