If you’re running on a single machine, I typically use reduce
from ThreadsX.jl. For multi-node parallel reductions, take a look at Base Julia’s @distributed
macro or Dagger.jl.
1 Like
My incomplete view of the thread-parallel ecosystem is that there’s the JuliaSIMD universe (lots of @Elrod 's work) and the JuliaFolds universe (lots of @tkf ). LoopVectorization is very tough to beat for raw speed, but Transducers is great for composability.
ThreadsX is based on Transducers. For a nice high-level API for LoopVectorization, check out @mcabbott 's Tullio.jl.
@stillyslalom @cscherrer that’s really interesting, thanks for sharing.
1 Like
This, alas, is nothing but a shameless plug, but having happened upon this thread the other day, I feel like it’s slightly less shameless.
2 Likes