[ANN] ThreadsX.jl: Parallelized Base functions

What about

julia> using ThreadsX, BenchmarkTools

julia> const impl = ThreadsX # Base
ThreadsX

julia> selfdot(x) = impl.mapreduce(abs2 , +, x)
selfdot (generic function with 1 method)

julia> x = rand(10^4);

julia> @code_warntype selfdot(x)
Variables
  #self#::Core.Compiler.Const(selfdot, false)
  x::Array{Float64,1}

Body::Any
1 ─ %1 = ThreadsX.mapreduce::Core.Compiler.Const(ThreadsX.mapreduce, false)
│   %2 = (%1)(Main.abs2, Main.:+, x)::Any
└──      return %2

julia> @btime $x' * $x
  1.204 μs (0 allocations: 0 bytes)
3313.78577447527

julia> @btime selfdot($x)
  94.007 μs (12187 allocations: 630.94 KiB)
3313.78577447527

The ThreadsX.mapreduce inferred correctly, but the mapreduce itself did not. Hopefully fixing that will make it competitive with the base dot product, although I may have to specify

function mydot(x, y)
    init = zero(promote_type(eltype(x), eltype(y)))
    ThreadsX.mapreduce(Base.FastMath.mul_fast, Base.FastMath.add_fast, x, y, init = init)
end

If that’s required for SIMD reductions.