What is the “standard” way to use multi-threading with dot-calls (vectorized/broadcast)?

All the discussions I’ve found on this topic are years old:

What is currently the most straight-forward way to get multi-threading with operations like these:

a = rand(4000, 4000)
b = rand(4000, 4000)
c = zeros(4000, 4000)
c .= a .+ exp.(b)

Does everyone just use Strided.jl? (I think LoopVectorization.jl also had some kind of threading capability, but the package is now deprecated, and either way, not as general and much more invasive than simply parallelizing independent operations across threads OpenMP-style.)

2 Likes

I use the multithreaded map functions from OhMyThreads.jl or ThreadsX.jl.

2 Likes

Neither of those work with broadcasts though, as far as I can tell?

Right, map is an alternative to broadcast.

I hadn’t heard of this amazing package until you mentioned it. For fun, and to practice my almost-forgotten macro-fu, I made an attempt to write a short macro to save some typing when combining @strided and @.:

module StridedDot

export @sd

using Strided: Strided, _strided, maybestrided, sreshape, sview, maybeunstrided
using MacroTools: @capture, postwalk

function xform(ex)
    postwalk(ex) do x
        @capture(x, Strided) || return x
        return :StridedDot
    end
end

macro sd(ex1)
    ex = Strided._strided(Base.Broadcast.__dot__(ex1))
    esc(xform(ex))
end

end # module

Then the timing comparison on my 8-core Core i7-9700 is:

using BenchmarkTools

a = rand(4000, 4000)
b = rand(4000, 4000)
c = zeros(4000, 4000)

f1!(c, a, b) = @. c = a + exp(b)

@btime f1!($c, $a, $b) # 63.204 ms (0 allocations: 0 bytes)

using .StridedDot
f2!(c, a, b) = @sd c = a + exp(b)
c2 = similar(c)
@btime f2!($c2, $a, $b) # 15.843 ms (125 allocations: 11.42 KiB)

c == c2 # true
2 Likes

Unfortunately, I just noticed the following discussion in Correct way to parallelize this code? · Issue #9 · Jutho/Strided.jl · GitHub

And again the discussion ends around the same time frame, 2021. Not that it’s not a good option in those cases where it works, but maybe not exactly the ideal candidate for a “standard” solution.

1 Like