In simulations, one usually has a external loop which runs over time steps, and inner loops that compute, for example, interactions between particles. These last loops are parallelized. Therefore, one has something as:
julia> function f()
s = zeros(Threads.nthreads())
for step in 1:10000
Threads.@threads for j in 1:10
s[Threads.threadid()] += 1
f (generic function with 1 method)
The @threads macro allocates a few things at every step, such that at the end one has many allocations, which are proportional to the number of steps:
julia> @btime f()
60.542 ms (411191 allocations: 35.59 MiB)
The outer loop (on steps) is not parallelizable, of course.
Unless if the inner loop is really fast (something that we make the greatest efforts to achieve, but most likely the time for launching the threads is not important), probably that is not a problem. But, anyway, it that the regular approach? Is there some recommended way to deal with those allocations?
The main reason of my concern is not related to performance, but to the fact that it become a little bit more cumbersome to be sure that the code is allocation free where it maters, if the threaded loop is in some inner function (which is the case here).
Have you tried Polyester.jl’s threads? I’ve found it to require much less overhead (and often no allocations) compared to Base.@threads when used in inner loops, although it’s more prone to segfaulting if you violate its assumptions regarding code structure.
julia> function g()
s = zeros(10)
for step in 1:10000
@batch per=thread for j in 1:10
s[j] += 1
It not really an overhead problem, although I should try @batch. But the computations of the inner loop are expensive enough so that I’m not expecting any significant performance gain there. It is more about those allocations, if I should simply ignore them or if there is some “preallocation” setup for the threaded auxiliary vectors, something like that. In everything that concerns my code I have elimited all allocations that would occur in the loops, so I wander if I can get rid of those as well.
The allocations introduced by Threads.@threads were also one of the reasons why I started an older discussion on the overhead of this macro. Finally, we switched to @batch from Polyester.jl (which wasn’t available back then) in Trixi.jl, our package for hyperbolic PDEs. The macro @batch includes a check whether only a single thread is used; if so, it will run the plain serial code, allowing you to see allocations in your inner functions. If multiple threads are used, which can be really helpful when developing code. If multiple threads are used, it usually creates much less allocations than Threads.@threads.
However, let me warn you that @batch isn’t as general as Threads.@threads. Additionally, we ran into some interesting bugs, but Chris Elrod was really helpful and fixed all the problems very fast (as always!).
I tried Polyester now, but it was actually slightly slower than @threads for me (I was not expecting it to be much faster either, as my inner loop is expensive). I had some issues with the progessmeter as well. But I will keep that possibility in mind.
Alternatively, you could also use a custom macro such as @threaded from Overhead of `Threads.@threads` - #23 by ranocha for development. Then, you will basically run the serial code without further allocations when starting Julia with a single thread, which can be helpful for development. We did that before switching to @batch in Trixi.jl.
One possiblity is that your code benefits from using more threads than the number of physical cores, in which case you can try @batch per=thread for .... The default @batch per=core.
But here is another example where @threads was several times faster than @batch. I haven’t really looked into the issues/causes there, just pointing out that some examples are known.
If you have one that’s more minimal to investigate, that’d probably increase my motivation to take a look.
But as a guess, I think it’s because of the unbalanced loads per thread in that example, and that the solution would be to add some more passive waiting option than the active waiting Polyester currently uses and is better for balanced workloads.
For many problems, overhead should be low enough as a percent of total time that it doesn’t make much difference.
If your code has a structure such that there is a parallel loop inside a sequential loop, using barrier-based scheduling may help improve the performance. Here’s my implementation of various synchronization barriers https://github.com/tkf/SyncBarriers.jl. When there are many cores (say >= 32), I can observe that clever barriers like the dissemination barrier and the static tree barrier help the performance.