Innefficient paralellization? Need some help optimizing a simple dot product

stabbles · March 15, 2018, 7:09pm

Sorry for the confusion: BLAS does not benefit from a threaded inner product, so 464 μs holds for 1, 2, 4 and 8 threads; whereas Julia gets that speed only at 8 threads.

I’ve updated the code a bit, and benched it on a MacBook Air with 2 cores / 4 threads.

using BenchmarkTools

"""
    split(N, P, i) -> from:to
    
Find the ith range when 1:N is split into P consecutive parts of roughly equal size
"""
function split(N, P, i)
    base, rem = divrem(N, P)
    from = (i - 1) * base + min(rem, i - 1) + 1
    to = from + base - 1 + (i ≤ rem ? 1 : 0)
    from : to
end

function sdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    v = zero(T)
    @inbounds @simd for i = 1 : length(a)
        v += a[i] * b[i]
    end

    v
end

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    # Local sum
    parts = zeros(Threads.nthreads())

    ccall(:jl_threading_run, Void, (Any,), function()
        P = Threads.threadid()
        N = Threads.nthreads()
        range = split(length(a), N, P)
        v = zero(T)
        @inbounds @simd for i = range
            v += a[i] * b[i]
        end
        parts[P] = v
    end)

    # Total sum
    return sum(parts)
end

function bench(n = 100_000)
    a = rand(n)
    b = rand(n)
    threads = parse(Int, ENV["JULIA_NUM_THREADS"])
    @show threads
    BLAS.set_num_threads(threads)

    # Sanity check
    @assert dot(a,b) ≈ pdot(a,b) ≈ sdot(a, b)

    print("BLAS (", threads, " threads):\t")
    println(@benchmark dot($a, $b))
    print("Julia (no threads):\t")
    println(@benchmark sdot($a, $b))
    print("Julia (", threads, " threads):\t")
    println(@benchmark pdot($a, $b))
end

bench()

Run it like:

$ JULIA_NUM_THREADS=4 julia -O3 pdot.jl

Results:

Julia (no threads):     Trial(55.923 μs)
BLAS (1 threads):       Trial(53.483 μs)
Julia (1 threads):      Trial(56.551 μs)
BLAS (2 threads):       Trial(53.450 μs)
Julia (2 threads):      Trial(30.773 μs)
BLAS (4 threads):       Trial(53.689 μs)
Julia (4 threads):      Trial(27.655 μs)

Surprisingly Julia beats BLAS with 2 and 4 threads! Can someone reproduce this? This is on Julia 0.6.2 with OpenBLAS.

Topic		Replies	Views
What's the problem with this simple multi-thread code? General Usage question	17	1228	March 11, 2022
Sum operations between arrays Performance	21	5766	April 7, 2020
Slower @threads than serial for array computations Julia at Scale	26	2802	May 7, 2020
How to perform parallel vector addition? New to Julia question	24	2032	December 27, 2022
Speeding up the multiplying, adding, subtracting of 3D matrices Numerics question	16	799	June 24, 2023

Innefficient paralellization? Need some help optimizing a simple dot product

Related topics