What is currently the most straight-forward way to get multi-threading with operations like these:
a = rand(4000, 4000)
b = rand(4000, 4000)
c = zeros(4000, 4000)
c .= a .+ exp.(b)
Does everyone just use Strided.jl? (I think LoopVectorization.jl also had some kind of threading capability, but the package is now deprecated, and either way, not as general and much more invasive than simply parallelizing independent operations across threads OpenMP-style.)
I hadn’t heard of this amazing package until you mentioned it. For fun, and to practice my almost-forgotten macro-fu, I made an attempt to write a short macro to save some typing when combining @strided and @.:
module StridedDot
export @sd
using Strided: Strided, _strided, maybestrided, sreshape, sview, maybeunstrided
using MacroTools: @capture, postwalk
function xform(ex)
postwalk(ex) do x
@capture(x, Strided) || return x
return :StridedDot
end
end
macro sd(ex1)
ex = Strided._strided(Base.Broadcast.__dot__(ex1))
esc(xform(ex))
end
end # module
Then the timing comparison on my 8-core Core i7-9700 is:
using BenchmarkTools
a = rand(4000, 4000)
b = rand(4000, 4000)
c = zeros(4000, 4000)
f1!(c, a, b) = @. c = a + exp(b)
@btime f1!($c, $a, $b) # 63.204 ms (0 allocations: 0 bytes)
using .StridedDot
f2!(c, a, b) = @sd c = a + exp(b)
c2 = similar(c)
@btime f2!($c2, $a, $b) # 15.843 ms (125 allocations: 11.42 KiB)
c == c2 # true
And again the discussion ends around the same time frame, 2021. Not that it’s not a good option in those cases where it works, but maybe not exactly the ideal candidate for a “standard” solution.