What about
julia> using ThreadsX, BenchmarkTools
julia> const impl = ThreadsX # Base
ThreadsX
julia> selfdot(x) = impl.mapreduce(abs2 , +, x)
selfdot (generic function with 1 method)
julia> x = rand(10^4);
julia> @code_warntype selfdot(x)
Variables
#self#::Core.Compiler.Const(selfdot, false)
x::Array{Float64,1}
Body::Any
1 ─ %1 = ThreadsX.mapreduce::Core.Compiler.Const(ThreadsX.mapreduce, false)
│ %2 = (%1)(Main.abs2, Main.:+, x)::Any
└── return %2
julia> @btime $x' * $x
1.204 μs (0 allocations: 0 bytes)
3313.78577447527
julia> @btime selfdot($x)
94.007 μs (12187 allocations: 630.94 KiB)
3313.78577447527
The ThreadsX.mapreduce inferred correctly, but the mapreduce itself did not. Hopefully fixing that will make it competitive with the base dot product, although I may have to specify
function mydot(x, y)
init = zero(promote_type(eltype(x), eltype(y)))
ThreadsX.mapreduce(Base.FastMath.mul_fast, Base.FastMath.add_fast, x, y, init = init)
end
If that’s required for SIMD reductions.