Multithreading broadcast/map calls

I’m working on improving the performance of code and one simple step is making broadcast calls multithreaded.

After seeing that Threads.@threads macro doesn’t work for broadcast lines, I started converting code like

b = foo.(a)

to

b = similar(a, element_type=TypeB)
Threads.@threads for i in eachindex(a)
    b[i] = foo(a[i])
end

This is pretty verbose, but works well for simple cases and I see a nice performance improvement. I’m running into trouble when TypeB is dependent on eltype(a) in some non-trivial way. I then started writing a bunch of little helper functions such as

fooType(::Type{TypeA1}) = TypeB1
fooType(::Type{TypeA2}) = TypeB1
fooType(::Type{TypeA3}) = TypeB2

Then the above code becomes

b = similar(a, element_type=fooType(eltype(a)))
Threads.@threads for i in eachindex(a)
    b[i] = foo(a[i])
end

This seems like a poor solution, especially when foo() takes multiple arguments and this really begins to scale poorly.

I’m aware of base.return_types, but have been warned this isn’t the intended application. I’m also aware of external packages with multithreaded map calls such as KissThreading.jl, Folds.jl, and ThreadsX.jl.

This seems like a relatively simple and hopefully common problem. Is there a generally accepted approach within base? Or advantages of one dependency over the other in this use cases?

1 Like

not really, just use ThreadsX.map. Although yeah, arguably since we have Distributed.pmap in stdblib, maybe we ought to have Threads.map too

1 Like

There is also FastBroadcast.jl.

julia> using FastBroadcast, ThreadsX, BenchmarkTools

julia> @inline myexp(x) = @inline exp(x)
myexp (generic function with 1 method)

julia> function athreads!(f::F, y, x) where {F}
           Threads.@threads for i = eachindex(y,x)
               @inbounds y[i] = f(x[i])
           end
       end
athreads! (generic function with 1 method)

julia> x = rand(10^6); y = similar(x);

julia> @benchmark athreads!(myexp, $y, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  73.365 μs …   7.195 ms  ┊ GC (min … max): 0.00% … 95.32%
 Time  (median):     84.651 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.692 μs ± 157.719 μs  ┊ GC (mean ± σ):  3.84% ±  2.13%

               ▁▁▁▃▄▅▄▆▇▇▆█▇██▇▇▅▇▇▅▅▅▃▃▃▁                      
  ▁▁▁▁▁▂▂▂▃▄▅▅▆██████████████████████████████▇▇▆▆▅▅▄▄▃▄▂▂▂▂▂▂▂ ▅
  73.4 μs         Histogram: frequency by time         97.2 μs <

 Memory estimate: 20.19 KiB, allocs estimate: 213.

julia> @benchmark ThreadsX.map!(myexp, $y, $x)
BenchmarkTools.Trial: 2265 samples with 1 evaluation.
 Range (min … max):  373.886 μs …   7.384 ms  ┊ GC (min … max): 0.00% … 90.95%
 Time  (median):     410.214 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   437.825 μs ± 432.173 μs  ┊ GC (mean ± σ):  6.01% ±  5.72%

                 ▁▁▁▁▁▆▅▃▁▇▅▅▇▇▆█▆▅▇▆▅▂▁▃  ▁▁                    
  ▁▁▁▃▃▃▃▂▅▄▅▆▇▇▅█████████████████████████▇██▇▇▅▇▅▄▅▄▃▃▂▃▂▂▂▂▁▂ ▅
  374 μs           Histogram: frequency by time          449 μs <

 Memory estimate: 170.89 KiB, allocs estimate: 1919.

julia> @benchmark @.. thread=true $y = myexp($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  48.467 μs … 218.581 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.127 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.292 μs ±   2.389 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁▅▆████▅▄▂                                             
  ▂▂▃▃▄▆▇██████████▇▆▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂▂▂▃▂▂▃▃▃▂▂▂▂ ▄
  48.5 μs         Histogram: frequency by time         51.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> using LoopVectorization

julia> @benchmark @tturbo @. $y = myexp($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.095 μs … 57.156 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.793 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.910 μs ±  1.181 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂▅█▇▇▄▁                                             
  ▂▂▂▂▃▃▄▆▇████████▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  26.1 μs         Histogram: frequency by time        29.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark @tturbo @. $y = exp($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.003 μs …  32.287 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.596 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.653 μs ± 436.241 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▃▅▇███▄▃▁                                            
  ▂▂▂▂▃▄▅▇██████████▆▅▄▃▃▂▂▂▂▂▁▂▂▂▁▂▂▂▁▁▁▁▁▂▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  23 μs           Histogram: frequency by time         25.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
3 Likes

Thank you! This is awesome.

It looks like there are some real overhead differences between the different methods - at least when broadcasting a simple function.

Are there other metrics you consider when comparing these options? Or would you expect the benchmark comparison to stack up differently in other circumstances?

The Polyester based versions with the lowest overhead won’t compose as well, i.e. they’ll result in worse performance when nested inside other threaded code.

It looks like that includes both FastBroadcast.jl and LoopVectorization.jl, correct?

Yes.

Just wanted to followup with how this worked out in my application.

  • ThreadsX worked fine, except it seemed to scramble the profiler and isn’t as quick as Threads.@threads
  • FastBroadcast seems to require the output array to be preallocated. My original motivation was to avoid doing this due to not knowing the output type. Perhaps I’m misunderstanding how to use it - I couldn’t find any docs other than the Readme for this package.
  • LoopVectorization seems to have the same problem as FastBroadcast
1 Like