Allocations of @threads

In simulations, one usually has a external loop which runs over time steps, and inner loops that compute, for example, interactions between particles. These last loops are parallelized. Therefore, one has something as:

julia> function f()
          s = zeros(Threads.nthreads())
          for step in 1:10000
               Threads.@threads for j in 1:10
                  s[Threads.threadid()] += 1
f (generic function with 1 method)

The @threads macro allocates a few things at every step, such that at the end one has many allocations, which are proportional to the number of steps:

julia> @btime f()
  60.542 ms (411191 allocations: 35.59 MiB)

The outer loop (on steps) is not parallelizable, of course.

Unless if the inner loop is really fast (something that we make the greatest efforts to achieve, but most likely the time for launching the threads is not important), probably that is not a problem. But, anyway, it that the regular approach? Is there some recommended way to deal with those allocations?

The main reason of my concern is not related to performance, but to the fact that it become a little bit more cumbersome to be sure that the code is allocation free where it maters, if the threaded loop is in some inner function (which is the case here).


1 Like

Have you tried Polyester.jl’s threads? I’ve found it to require much less overhead (and often no allocations) compared to Base.@threads when used in inner loops, although it’s more prone to segfaulting if you violate its assumptions regarding code structure.

julia> function g()
       s = zeros(10)
       for step in 1:10000
           @batch per=thread for j in 1:10
               s[j] += 1
julia> @btime f()
  3.351 s (420750 allocations: 35.88 MiB) # seconds!!

julia> @btime g()
  4.102 ms (1 allocation: 160 bytes)

It not really an overhead problem, although I should try @batch. But the computations of the inner loop are expensive enough so that I’m not expecting any significant performance gain there. It is more about those allocations, if I should simply ignore them or if there is some β€œpreallocation” setup for the threaded auxiliary vectors, something like that. In everything that concerns my code I have elimited all allocations that would occur in the loops, so I wander if I can get rid of those as well.

The allocations introduced by Threads.@threads were also one of the reasons why I started an older discussion on the overhead of this macro. Finally, we switched to @batch from Polyester.jl (which wasn’t available back then) in Trixi.jl, our package for hyperbolic PDEs. The macro @batch includes a check whether only a single thread is used; if so, it will run the plain serial code, allowing you to see allocations in your inner functions. If multiple threads are used, which can be really helpful when developing code. If multiple threads are used, it usually creates much less allocations than Threads.@threads.
However, let me warn you that @batch isn’t as general as Threads.@threads. Additionally, we ran into some interesting bugs, but Chris Elrod was really helpful and fixed all the problems very fast (as always!).


I tried Polyester now, but it was actually slightly slower than @threads for me (I was not expecting it to be much faster either, as my inner loop is expensive). I had some issues with the progessmeter as well. But I will keep that possibility in mind.

Potentially pre-allocate the threads, have them listen on a channel, and then in the inner loop you just push some messages in the channel to tell the threads what to do:

foo = Channel(2)
julia> @benchmark begin put!(foo,"bar"); take!(foo) end
BenchmarkTools.Trial: 10000 samples with 287 evaluations.
 Range (min … max):  278.780 ns … 359.282 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     293.268 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   293.533 ns Β±   5.484 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                   β–β–β–‚β–†β–…β–ˆβ–†β–ƒβ–†β–ˆβ–‚β–„β–…β–ƒβ–‡β–…β–†β–†β– ▁                         
  β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–„β–…β–ƒβ–…β–†β–†β–„β–†β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–†β–„β–„β–…β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–„β–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚ β–…
  279 ns           Histogram: frequency by time          312 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

1 Like

Alternatively, you could also use a custom macro such as @threaded from
Overhead of `Threads.@threads` - #23 by ranocha for development. Then, you will basically run the serial code without further allocations when starting Julia with a single thread, which can be helpful for development. We did that before switching to @batch in Trixi.jl.

1 Like

One possiblity is that your code benefits from using more threads than the number of physical cores, in which case you can try @batch per=thread for .... The default @batch per=core.
But here is another example where @threads was several times faster than @batch. I haven’t really looked into the issues/causes there, just pointing out that some examples are known.
If you have one that’s more minimal to investigate, that’d probably increase my motivation to take a look.
But as a guess, I think it’s because of the unbalanced loads per thread in that example, and that the solution would be to add some more passive waiting option than the active waiting Polyester currently uses and is better for balanced workloads.

For many problems, overhead should be low enough as a percent of total time that it doesn’t make much difference.


Indeed, with per=thread there is no performance difference.

1 Like

If your code has a structure such that there is a parallel loop inside a sequential loop, using barrier-based scheduling may help improve the performance. Here’s my implementation of various synchronization barriers When there are many cores (say >= 32), I can observe that clever barriers like the dissemination barrier and the static tree barrier help the performance.


The parallel part is the mapping some user defined function through cell lists (GitHub - m3g/CellListMap.jl: Flexible implementation of cell lists to map the calculations of part).

Because the function to be mapped and how it has to be reduced is on the hands of the user, I have for now implemented just a simple threaded loop.

It is working quite ok, to be truth, but I aim to have GPU version and perhaps a distributed one, then the parallelization will be more challenging.