Multithreading broadcast/map calls

bmit · April 26, 2023, 5:16pm

I’m working on improving the performance of code and one simple step is making broadcast calls multithreaded.

After seeing that Threads.@threads macro doesn’t work for broadcast lines, I started converting code like

b = foo.(a)

to

b = similar(a, element_type=TypeB)
Threads.@threads for i in eachindex(a)
    b[i] = foo(a[i])
end

This is pretty verbose, but works well for simple cases and I see a nice performance improvement. I’m running into trouble when TypeB is dependent on eltype(a) in some non-trivial way. I then started writing a bunch of little helper functions such as

fooType(::Type{TypeA1}) = TypeB1
fooType(::Type{TypeA2}) = TypeB1
fooType(::Type{TypeA3}) = TypeB2

…

Then the above code becomes

b = similar(a, element_type=fooType(eltype(a)))
Threads.@threads for i in eachindex(a)
    b[i] = foo(a[i])
end

This seems like a poor solution, especially when foo() takes multiple arguments and this really begins to scale poorly.

I’m aware of base.return_types, but have been warned this isn’t the intended application. I’m also aware of external packages with multithreaded map calls such as KissThreading.jl, Folds.jl, and ThreadsX.jl.

This seems like a relatively simple and hopefully common problem. Is there a generally accepted approach within base? Or advantages of one dependency over the other in this use cases?

jling · April 26, 2023, 5:25pm

not really, just use ThreadsX.map. Although yeah, arguably since we have Distributed.pmap in stdblib, maybe we ought to have Threads.map too

Elrod · April 26, 2023, 5:55pm

There is also FastBroadcast.jl.

julia> using FastBroadcast, ThreadsX, BenchmarkTools

julia> @inline myexp(x) = @inline exp(x)
myexp (generic function with 1 method)

julia> function athreads!(f::F, y, x) where {F}
           Threads.@threads for i = eachindex(y,x)
               @inbounds y[i] = f(x[i])
           end
       end
athreads! (generic function with 1 method)

julia> x = rand(10^6); y = similar(x);

julia> @benchmark athreads!(myexp, $y, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  73.365 μs …   7.195 ms  ┊ GC (min … max): 0.00% … 95.32%
 Time  (median):     84.651 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.692 μs ± 157.719 μs  ┊ GC (mean ± σ):  3.84% ±  2.13%

               ▁▁▁▃▄▅▄▆▇▇▆█▇██▇▇▅▇▇▅▅▅▃▃▃▁                      
  ▁▁▁▁▁▂▂▂▃▄▅▅▆██████████████████████████████▇▇▆▆▅▅▄▄▃▄▂▂▂▂▂▂▂ ▅
  73.4 μs         Histogram: frequency by time         97.2 μs <

 Memory estimate: 20.19 KiB, allocs estimate: 213.

julia> @benchmark ThreadsX.map!(myexp, $y, $x)
BenchmarkTools.Trial: 2265 samples with 1 evaluation.
 Range (min … max):  373.886 μs …   7.384 ms  ┊ GC (min … max): 0.00% … 90.95%
 Time  (median):     410.214 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   437.825 μs ± 432.173 μs  ┊ GC (mean ± σ):  6.01% ±  5.72%

                 ▁▁▁▁▁▆▅▃▁▇▅▅▇▇▆█▆▅▇▆▅▂▁▃  ▁▁                    
  ▁▁▁▃▃▃▃▂▅▄▅▆▇▇▅█████████████████████████▇██▇▇▅▇▅▄▅▄▃▃▂▃▂▂▂▂▁▂ ▅
  374 μs           Histogram: frequency by time          449 μs <

 Memory estimate: 170.89 KiB, allocs estimate: 1919.

julia> @benchmark @.. thread=true $y = myexp($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  48.467 μs … 218.581 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.127 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.292 μs ±   2.389 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁▅▆████▅▄▂                                             
  ▂▂▃▃▄▆▇██████████▇▆▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂▂▂▃▂▂▃▃▃▂▂▂▂ ▄
  48.5 μs         Histogram: frequency by time         51.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> using LoopVectorization

julia> @benchmark @tturbo @. $y = myexp($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.095 μs … 57.156 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.793 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.910 μs ±  1.181 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂▅█▇▇▄▁                                             
  ▂▂▂▂▃▃▄▆▇████████▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  26.1 μs         Histogram: frequency by time        29.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark @tturbo @. $y = exp($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.003 μs …  32.287 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.596 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.653 μs ± 436.241 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▃▅▇███▄▃▁                                            
  ▂▂▂▂▃▄▅▇██████████▆▅▄▃▃▂▂▂▂▂▁▂▂▂▁▂▂▂▁▁▁▁▁▂▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  23 μs           Histogram: frequency by time         25.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

bmit · April 27, 2023, 4:14pm

Thank you! This is awesome.

It looks like there are some real overhead differences between the different methods - at least when broadcasting a simple function.

Are there other metrics you consider when comparing these options? Or would you expect the benchmark comparison to stack up differently in other circumstances?

Elrod · April 27, 2023, 4:15pm

The Polyester based versions with the lowest overhead won’t compose as well, i.e. they’ll result in worse performance when nested inside other threaded code.

bmit · April 27, 2023, 4:25pm

It looks like that includes both FastBroadcast.jl and LoopVectorization.jl, correct?

Elrod · April 27, 2023, 4:39pm

Yes.

bmit · April 30, 2023, 1:34am

Just wanted to followup with how this worked out in my application.

ThreadsX worked fine, except it seemed to scramble the profiler and isn’t as quick as Threads.@threads
FastBroadcast seems to require the output array to be preallocated. My original motivation was to avoid doing this due to not knowing the output type. Perhaps I’m misunderstanding how to use it - I couldn’t find any docs other than the Readme for this package.
LoopVectorization seems to have the same problem as FastBroadcast

Topic		Replies	Views
Multithreaded broadcast? Internals & Design multithreading , broadcast	20	8472	May 6, 2021
Blog post: Loop fusion and vectorization in Julia 0.6 Internals & Design announcement , broadcast	28	8469	May 4, 2017
Help improve my threaded map function? General Usage	0	496	June 6, 2017
V1.3.1 no gain using multithread General Usage multithreading	33	2077	March 4, 2020
Type inference with threads General Usage	2	593	February 10, 2017

Multithreading broadcast/map calls

Related topics