There seem to be a couple of independent factors contributing to the performance difference.
Factor 1: <
vs isless
First of all, abs(a[i])>result && (result=abs(a[i]))
used in absmax
of the OP is faster than max
. This is because <
does not define an ordering with NaN
and max
/isless
do extra work to do the right thing:
julia> isless(NaN, 0.0)
false
julia> isless(0.0, NaN)
true
julia> NaN < 0.0
false
julia> 0.0 < NaN
false
A fair comparison would be to use
fmax(a, b) = ifelse(b > a, b, a)
both in absmax
and the mapreduce
examples. So, absmax
would look like
function absmax(a)
result = typemin(eltype(a))
@inbounds for i in 1:length(a)
result = fmax(result, abs(a[i]))
end
result
end
(Nitpick: I fixed the initial value of result
to typemin(eltype(a))
. The original definition using zero
is not correct unless a
is known to contain non-negative numbers. Alternatively, result = -0.5
is a good initial value since it is the known lower bound of a
.)
To use fmax
with ThreadsX.jl, I added the following implementations:
absmax_mapreduce_tx(x) = ThreadsX.mapreduce(abs,fmax,x; init = typemin(eltype(x)))
Unfortunately, ThreadsX.mapreduce(abs,fmax,x)
cannot be used ATM since the initial value of fmax
is not known. Transducers.jl already have the facility to support this but I have to think about the impact of this.
With these changes, running the script (see below) with julia -t4 script.jl
prints
Threads.nthreads() = 4
n = 524288
4-element Vector{Pair{Symbol, BenchmarkTools.Trial}}:
:sequential => 313.067 Ī¼s
:threaded => 46.089 Ī¼s
:mapreduce => 163.239 Ī¼s
:mapreduce_tx => 85.089 Ī¼s
Better than 3x, but still ThreadsX.jl does not perform well.
Factor 2: task scheduler
A bit more try-and-error revealed that the main difference may be due to how @threads
and @spawn
(used in ThreadsX.jl) interact with Juliaās and OSās task scheduler. In general, @spawn
is more dynamic and has more freedom in the scheduling. This is great for composing ānestedā parallel programs. But for microbenchmarks of āshallowā parallel programs like tested here, the overhead can be manifested as the result above. We can at least constrain the dynamism on the OS side by using JULIA_EXCLUSIVE=1 julia -t4 script.jl
to start the script:
Threads.nthreads() = 4
n = 524288
4-element Vector{Pair{Symbol, BenchmarkTools.Trial}}:
:sequential => 156.608 Ī¼s
:threaded => 41.730 Ī¼s
:mapreduce => 163.019 Ī¼s
:mapreduce_tx => 51.740 Ī¼s
mapreduce_tx
is now pretty close to threaded
.
(Note: Above explanation is a bit hand-wavy since I mixed the dynamism in Juliaās and OSās task schedulers.)
Factor 3: constant overhead in Julia task
Currently, the scheduler requires a few Ī¼s to ten Ī¼s overheads for each task. So computation that takes about 50 Ī¼s will contain a non-negligible effect from the scheduler. So, bumping the problem size to n = 2^22
and running JULIA_EXCLUSIVE=1 julia -t4 script.jl
shows me
Threads.nthreads() = 4
n = 4194304
4-element Vector{Pair{Symbol, BenchmarkTools.Trial}}:
:sequential => 2.533 ms
:threaded => 885.732 Ī¼s
:mapreduce => 2.544 ms
:mapreduce_tx => 821.713 Ī¼s
tl;dr
So, it looks like the performance difference was coming from the algorithmic difference of the base case (Factor 1) and the task system (Factor 2 & 3).
Hereās the script I used:
script
using ThreadsX
using BenchmarkTools
using Random
fmax(a, b) = ifelse(b > a, b, a)
function absmax(a)
result = typemin(eltype(a))
@inbounds for i in 1:length(a)
result = fmax(result, abs(a[i]))
end
result
end
function absmax_range(a,b,e)
result = typemin(eltype(a))
@inbounds for i in b:e
result = fmax(result, abs(a[i]))
end
result
end
function splitrange(l,n)
chuncksize=div(l+n-1,n)
bes=Array{Tuple{Int,Int}}(undef,n)
b=1
for p=1:n
e=min(b+chuncksize-1,l)
bes[p]=(b,e)
b=e+1
end
bes
end
function absmax_threaded(a)
n=Threads.nthreads()
presults=zeros(eltype(a),n)
bes=splitrange(length(a),n)
Threads.@threads for p in 1:n
@inbounds presults[p]=absmax_range(a,bes[p]...)
end
maximum(presults)
end
absmax_mapreduce(x) = mapreduce(abs,fmax,x)
absmax_mapreduce_tx(x) = ThreadsX.mapreduce(abs,fmax,x; init = typemin(eltype(x)))
absmax_mapreduce_tx_simd(x) =
ThreadsX.mapreduce(abs,fmax,x; init = typemin(eltype(x)), simd = Val(true))
function main()
@show Threads.nthreads()
n=2^19
# n=2^22
@show n
Random.seed!(1234)
x = rand(n) .- 0.5
@assert absmax(x)==absmax_threaded(x)
@assert absmax(x)==absmax_mapreduce(x)
@assert absmax(x)==absmax_mapreduce_tx(x)
# @assert absmax(x)==absmax_mapreduce_tx_simd(x)
results = [
:sequential => @benchmark absmax($x)
:threaded => @benchmark absmax_threaded($x)
:mapreduce => @benchmark absmax_mapreduce($x)
:mapreduce_tx => @benchmark absmax_mapreduce_tx($x)
# :mapreduce_tx_simd => @benchmark absmax_mapreduce_tx_simd($x)
]
display(results)
return results
end
results = main()
Hopefully, we can minimize the latter factors by improving Julia. For example, by the ālight-weightā more structured task system we are proposing here: https://github.com/JuliaLang/julia/pull/39773.