Innefficient paralellization? Need some help optimizing a simple dot product

Sorry for the confusion: BLAS does not benefit from a threaded inner product, so 464 μs holds for 1, 2, 4 and 8 threads; whereas Julia gets that speed only at 8 threads.

I’ve updated the code a bit, and benched it on a MacBook Air with 2 cores / 4 threads.

using BenchmarkTools

"""
    split(N, P, i) -> from:to
    
Find the ith range when 1:N is split into P consecutive parts of roughly equal size
"""
function split(N, P, i)
    base, rem = divrem(N, P)
    from = (i - 1) * base + min(rem, i - 1) + 1
    to = from + base - 1 + (i ≤ rem ? 1 : 0)
    from : to
end

function sdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    v = zero(T)
    @inbounds @simd for i = 1 : length(a)
        v += a[i] * b[i]
    end

    v
end

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    # Local sum
    parts = zeros(Threads.nthreads())

    ccall(:jl_threading_run, Void, (Any,), function()
        P = Threads.threadid()
        N = Threads.nthreads()
        range = split(length(a), N, P)
        v = zero(T)
        @inbounds @simd for i = range
            v += a[i] * b[i]
        end
        parts[P] = v
    end)

    # Total sum
    return sum(parts)
end

function bench(n = 100_000)
    a = rand(n)
    b = rand(n)
    threads = parse(Int, ENV["JULIA_NUM_THREADS"])
    @show threads
    BLAS.set_num_threads(threads)

    # Sanity check
    @assert dot(a,b) ≈ pdot(a,b) ≈ sdot(a, b)

    print("BLAS (", threads, " threads):\t")
    println(@benchmark dot($a, $b))
    print("Julia (no threads):\t")
    println(@benchmark sdot($a, $b))
    print("Julia (", threads, " threads):\t")
    println(@benchmark pdot($a, $b))
end

bench()

Run it like:

$ JULIA_NUM_THREADS=4 julia -O3 pdot.jl

Results:

Julia (no threads):     Trial(55.923 μs)
BLAS (1 threads):       Trial(53.483 μs)
Julia (1 threads):      Trial(56.551 μs)
BLAS (2 threads):       Trial(53.450 μs)
Julia (2 threads):      Trial(30.773 μs)
BLAS (4 threads):       Trial(53.689 μs)
Julia (4 threads):      Trial(27.655 μs)

Surprisingly Julia beats BLAS with 2 and 4 threads! Can someone reproduce this? This is on Julia 0.6.2 with OpenBLAS.

2 Likes