ThreadsX mapreduce performance

There seem to be a couple of independent factors contributing to the performance difference.

Factor 1: < vs isless

First of all, abs(a[i])>result && (result=abs(a[i])) used in absmax of the OP is faster than max. This is because < does not define an ordering with NaN and max/isless do extra work to do the right thing:

julia> isless(NaN, 0.0)
false

julia> isless(0.0, NaN)
true

julia> NaN < 0.0
false

julia> 0.0 < NaN
false

A fair comparison would be to use

fmax(a, b) = ifelse(b > a, b, a)

both in absmax and the mapreduce examples. So, absmax would look like

function absmax(a)
    result = typemin(eltype(a))
    @inbounds for i in 1:length(a)
        result = fmax(result, abs(a[i]))
    end
    result
end

(Nitpick: I fixed the initial value of result to typemin(eltype(a)). The original definition using zero is not correct unless a is known to contain non-negative numbers. Alternatively, result = -0.5 is a good initial value since it is the known lower bound of a.)

To use fmax with ThreadsX.jl, I added the following implementations:

absmax_mapreduce_tx(x) = ThreadsX.mapreduce(abs,fmax,x; init = typemin(eltype(x)))

Unfortunately, ThreadsX.mapreduce(abs,fmax,x) cannot be used ATM since the initial value of fmax is not known. Transducers.jl already have the facility to support this but I have to think about the impact of this.

With these changes, running the script (see below) with julia -t4 script.jl prints

Threads.nthreads() = 4
n = 524288
4-element Vector{Pair{Symbol, BenchmarkTools.Trial}}:
   :sequential => 313.067 μs
     :threaded => 46.089 μs
    :mapreduce => 163.239 μs
 :mapreduce_tx => 85.089 μs

Better than 3x, but still ThreadsX.jl does not perform well.

Factor 2: task scheduler

A bit more try-and-error revealed that the main difference may be due to how @threads and @spawn (used in ThreadsX.jl) interact with Julia’s and OS’s task scheduler. In general, @spawn is more dynamic and has more freedom in the scheduling. This is great for composing “nested” parallel programs. But for microbenchmarks of “shallow” parallel programs like tested here, the overhead can be manifested as the result above. We can at least constrain the dynamism on the OS side by using JULIA_EXCLUSIVE=1 julia -t4 script.jl to start the script:

Threads.nthreads() = 4
n = 524288
4-element Vector{Pair{Symbol, BenchmarkTools.Trial}}:
   :sequential => 156.608 μs
     :threaded => 41.730 μs
    :mapreduce => 163.019 μs
 :mapreduce_tx => 51.740 μs

mapreduce_tx is now pretty close to threaded.

(Note: Above explanation is a bit hand-wavy since I mixed the dynamism in Julia’s and OS’s task schedulers.)

Factor 3: constant overhead in Julia task

Currently, the scheduler requires a few μs to ten μs overheads for each task. So computation that takes about 50 μs will contain a non-negligible effect from the scheduler. So, bumping the problem size to n = 2^22 and running JULIA_EXCLUSIVE=1 julia -t4 script.jl shows me

Threads.nthreads() = 4
n = 4194304
4-element Vector{Pair{Symbol, BenchmarkTools.Trial}}:
   :sequential => 2.533 ms
     :threaded => 885.732 μs
    :mapreduce => 2.544 ms
 :mapreduce_tx => 821.713 μs

tl;dr

So, it looks like the performance difference was coming from the algorithmic difference of the base case (Factor 1) and the task system (Factor 2 & 3).

Here’s the script I used:

script
using ThreadsX
using BenchmarkTools
using Random

fmax(a, b) = ifelse(b > a, b, a)

function absmax(a)
    result = typemin(eltype(a))
    @inbounds for i in 1:length(a)
        result = fmax(result, abs(a[i]))
    end
    result
end

function absmax_range(a,b,e)
    result = typemin(eltype(a))
    @inbounds for i in b:e
        result = fmax(result, abs(a[i]))
    end
    result
end

function splitrange(l,n)
    chuncksize=div(l+n-1,n)
    bes=Array{Tuple{Int,Int}}(undef,n)
    b=1
    for p=1:n
        e=min(b+chuncksize-1,l)
        bes[p]=(b,e)
        b=e+1
    end
    bes
end

function absmax_threaded(a)
    n=Threads.nthreads()
    presults=zeros(eltype(a),n)
    bes=splitrange(length(a),n)
    Threads.@threads for p in 1:n
        @inbounds presults[p]=absmax_range(a,bes[p]...)
    end
    maximum(presults)
end

absmax_mapreduce(x) = mapreduce(abs,fmax,x)
absmax_mapreduce_tx(x) = ThreadsX.mapreduce(abs,fmax,x; init = typemin(eltype(x)))
absmax_mapreduce_tx_simd(x) =
    ThreadsX.mapreduce(abs,fmax,x; init = typemin(eltype(x)), simd = Val(true))

function main()
    @show Threads.nthreads()
    n=2^19
    # n=2^22
    @show n
    Random.seed!(1234)
    x = rand(n) .- 0.5
    @assert absmax(x)==absmax_threaded(x)
    @assert absmax(x)==absmax_mapreduce(x)
    @assert absmax(x)==absmax_mapreduce_tx(x)
    # @assert absmax(x)==absmax_mapreduce_tx_simd(x)

    results = [
        :sequential => @benchmark absmax($x)
        :threaded => @benchmark absmax_threaded($x)
        :mapreduce => @benchmark absmax_mapreduce($x)
        :mapreduce_tx => @benchmark absmax_mapreduce_tx($x)
        # :mapreduce_tx_simd => @benchmark absmax_mapreduce_tx_simd($x)
    ]
    display(results)
    return results
end

results = main()

Hopefully, we can minimize the latter factors by improving Julia. For example, by the “light-weight” more structured task system we are proposing here: https://github.com/JuliaLang/julia/pull/39773.

5 Likes