I’m working on improving the performance of code and one simple step is making broadcast calls multithreaded.
After seeing that Threads.@threads macro doesn’t work for broadcast lines, I started converting code like
b = foo.(a)
to
b = similar(a, element_type=TypeB)
Threads.@threads for i in eachindex(a)
b[i] = foo(a[i])
end
This is pretty verbose, but works well for simple cases and I see a nice performance improvement. I’m running into trouble when TypeB is dependent on eltype(a) in some non-trivial way. I then started writing a bunch of little helper functions such as
b = similar(a, element_type=fooType(eltype(a)))
Threads.@threads for i in eachindex(a)
b[i] = foo(a[i])
end
This seems like a poor solution, especially when foo() takes multiple arguments and this really begins to scale poorly.
I’m aware of base.return_types, but have been warned this isn’t the intended application. I’m also aware of external packages with multithreaded map calls such as KissThreading.jl, Folds.jl, and ThreadsX.jl.
This seems like a relatively simple and hopefully common problem. Is there a generally accepted approach within base? Or advantages of one dependency over the other in this use cases?
It looks like there are some real overhead differences between the different methods - at least when broadcasting a simple function.
Are there other metrics you consider when comparing these options? Or would you expect the benchmark comparison to stack up differently in other circumstances?
The Polyester based versions with the lowest overhead won’t compose as well, i.e. they’ll result in worse performance when nested inside other threaded code.
Just wanted to followup with how this worked out in my application.
ThreadsX worked fine, except it seemed to scramble the profiler and isn’t as quick as Threads.@threads
FastBroadcast seems to require the output array to be preallocated. My original motivation was to avoid doing this due to not knowing the output type. Perhaps I’m misunderstanding how to use it - I couldn’t find any docs other than the Readme for this package.
LoopVectorization seems to have the same problem as FastBroadcast