Innefficient paralellization? Need some help optimizing a simple dot product

I have a very simple code I’d like to optimize, and I’m not sure I am getting the expected results. For the sake of simplicity I am optimizing a loop which computes the dot product. I have implemented two functions, a serial dot product, sdot, and a parallel one, pdot, which attempts to use idiomatic reduction:

function sdot(n, x, y)
    a = 0 # This should be a=0. for type stability
    @inbounds @fastmath @simd for i=1:n
        a += x[i]*y[i]
    end
    a
end
function pdot(n, x, y)
    @fastmath @parallel (+) for i=1:n
        x[i]*y[i]
    end
end

I have timed running these functions 3 times for x and y vectors of size 10000000 (full code here). I also compare them to the built-in dot product. These are my results:

Function @time
sdot 0.878811 seconds (90.00 M allocations: 1.341 GiB, 4.44% gc time)
sdot (type stable) 0.034074 seconds (3 allocations: 48 bytes)
pdot 1.068616 seconds (1.63 k allocations: 139.344 KiB)
dot 0.036030 seconds (3 allocations: 48 bytes)

dot is clearly the fastest, which is alright given that it relies on ddot from BLAS. What puzzles me is that my parallel implementation is always slower that my serial implementation. I understand that there is some overhead in @parallel, but I wouldn’t imagine it would be so high. I have changed the values of n through several different orders of magnitude, but pdot always loses. Interestingly, I have coded similar versions in Fortran, and the OMP REDUCE version is always fastest. Those codes can be found here.

A gist to reproduce my results can be found here. Any help is appreciated!

EDIT: I have added a type-stable sdot as per comments

1 Like

For sdot, read the performance tips in the manual, especially the one about type stability. The way you initialize a is not good.

1 Like

It’s not clear that you started julia with any workers, in which case @parallel is just adding overhead.

The performance tip for sdot is critically important. I changed 0 to 0. in your gist and now sdot and dot have identical performance. What’s happening here is you are changing an Int to a Float64, so it’s not type stable.

Thanks @aaowens and @kristoffer.carlsson. Indeed by changing a=0 to a=0., I get the roughly the same performance in sdot as dot (if slightly faster).

Just to clarify, @Michael_Eastwood, I am running this script with julia -p 4 bench_dot.jl, so that should not be an issue. If I got roughly the same speed, I could work with that, but over 30x slower seems excessive.

Would be keen to hear any other ways which I can improve the parallel code!

You’re missing @inbounds in pdot

However, I don’t get it much faster either, the parallel version with @inbounds is still 10x as slow for large vectors. Maybe it’s not generating SIMD operations.

Note that you can barely parallellize the dot product since it is memory bound.

Where do I put the @inbounds? I’ve not been able to figure it out. Could you provide a snippet?

I have been able to get better performance parallelizing in Fortran, so I was hopeful I would get at least some reduction in execution time.

If the computation takes 0.03 seconds it is unlikely that distributing it onto multiple processes will give a speed up. Note that @parallell is distributed memory parallellization.

Julia’s @parallel uses multi-processing, i.e. distributed-memory parallelism. On the other hand, Fortran’s OpenMP uses multi-threading, i.e. shared-memory parallelism. That makes a huge difference in your example, because Julia’s multiple processes must communicate a lot, which is a lot of work compared to your actual computation.

The dot product doesn’t benefit much from multiple cores anyway, because its computation is memory-bound. At most you can expect a speedup according to how many memory channels you have. I think that’s what you’re seeing in your Fortran/OpenMP benchmark, where the parallel version gives a speedup of less than two. (So I guess you have two memory channels?) Edit: The memory-channel argument only applies if the data in question doesn’t fit into the CPU cache. For Float64 vectors of 1,000,000 elements (i.e. 8 MB of data) a modern “big” CPU can have enough L3 cache to hold two of those. Depending on how the cores are connected to the cache, one can then except to get a much better speedup. (Even superlinear speedup if the cores have separate caches.)

Julia’s @parallel will only give a benefit if the work done for each iteration is substantially larger than the communcation work.

2 Likes

Exactly. I’d just add that for multithreading in Julia, use Threads.@threads.

But @threads is not a drop-in replacement for a parallel reduce. Doing something like

values = zeros(Threads.nthreads())
Threads.@threads for i = 1 : n
  values[Threads.threadid())] += a[i] * b[i]
end

is not performant. Something like this is fast (except it probably cannot do SIMD cause of the cyclic distribution):

function threads_dot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    if length(a) != length(b)
        throw(DimensionMismatch())
    end

    # Local sum
    parts = zeros(Threads.nthreads())

    ccall(:jl_threading_run, Ref{Void}, (Any,), function()
        id = Threads.threadid()
        step = Threads.nthreads()
        v = zero(T) # this makes things fast
        @inbounds for i = id:step:length(a)
            v += a[i] * b[i]
        end
        parts[id] = v
    end)

    # Total sum
    return sum(parts)
end
3 Likes

You forgot an important thing; the arrays must be declared as SharedArrays so that the arrays can be available to every worker, even for this simple calculation the parallel version is slightly better. Here are my results:

function sdot(n, x, y)
    a = 0.0
    @inbounds @fastmath for i=1:n
        a += x[i]*y[i]   
    end
    a
end
@everywhere @inbounds @fastmath function pdot(n, x, y)
    a = @parallel (+) for i=1:n
        x[i]*y[i]
    end
end

addprocs(7)
n = 10^7
x = SharedArray{Float64,1}( ones(n) ) 
y = SharedArray{Float64,1}( ones(n) ) 

println("Naive Julia")
println(@sprintf("  %.2f", sdot(n, x, y)))
@time for i=1:3
    sdot(n, x, y)
end

println("\nNative Julia")
println(@sprintf("  %.2f", dot(x, y)))
@time for i=1:3
    dot(x, y)
end

println("\nIdiomatic parallel Julia")
println(@sprintf("  %.2f",  pdot(n, x, y)))
@time for i=1:3
    pdot(n, x, y)
end

The timings:

Naive Julia
  10000000.00
  0.027176 seconds (3 allocations: 48 bytes)

Native Julia
  10000000.00
  0.027156 seconds (3 allocations: 48 bytes)

Idiomatic parallel Julia
  10000000.00
  0.025830 seconds (4.49 k allocations: 356.578 KiB)
[Finished in 4.5s]

The difference becomes clearer for larger inputs (say n = 10^9):

Naive Julia
  1000000000.00
  4.235259 seconds (3 allocations: 48 bytes)

Native Julia
  1000000000.00
  3.088961 seconds (3 allocations: 48 bytes)

Idiomatic parallel Julia
  1000000000.00
  2.282985 seconds (4.47 k allocations: 355.719 KiB)
[Finished in 29.5s]

Yeah, but your solution is still using Julia’s threading. I agree that there’s a tooling issue, specifically that an helpful method is missing, but the answer is still to use threading in some sense.

That seems fishy. Sharing more data around should make things slower. I think in your example it just moves the data sharing outside of the loop to the SharedArray construction whereas @parallel has to send it to do the mapping operation?

So, if you go for threading and a block distribution, then you’d get something like this

using BenchmarkTools

function threads_dot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    # Local sum
    parts = zeros(Threads.nthreads())

    ccall(:jl_threading_run, Ref{Void}, (Any,), function()
        P = Threads.threadid()
        N = Threads.nthreads()
        # I'm too lazy to compute the local range here
        range = Distributed.splitrange(length(a), N)[P]
        v = zero(T)
        @inbounds @simd for i = range
            v += a[i] * b[i]
        end
        parts[P] = v
    end)

    # Total sum
    return sum(parts)
end

function bench()
    @benchmark threads_dot(a, b) setup = (a = rand(1_000_000); b = rand(1_000_000))
end

Results:

  • 1 thread: 2126 μs
  • 2 threads: 1066 μs
  • 4 threads: 574.941 μs
  • 8 threads: 447.798 μs

While dot with BLAS threading takes 464.846 μs. Any idea why this has so much overhead?

1 Like

Yeah, with @threads there seems to always be some reason to have to manually split up the workload over the threads.

Maybe @threads in its current form is just not the right abstraction? At the very least, a mapreduce-style version of @threads similar to @parallel would be pretty useful I think.

2 Likes

Just to get this straight, are you asking about the overhead of BLAS dot? How many threads is BLAS using?

By the way, you could use a Threads.Atomic instead of parts, although it probably doesn’t make much of a difference.

I meant the overhead in the Julia code: only at 8 threads do I match BLAS. What’s going on in the 1, 2 or 4 threads case?

Edit: now testing on my Macbook and I don’t see the overhead for some reason.

Well how many threads is BLAS using? If it’s 8, then Julia would actually be doing better than BLAS. Try experimenting with BLAS.set_num_threads.

Sorry for the confusion: BLAS does not benefit from a threaded inner product, so 464 μs holds for 1, 2, 4 and 8 threads; whereas Julia gets that speed only at 8 threads.

I’ve updated the code a bit, and benched it on a MacBook Air with 2 cores / 4 threads.

using BenchmarkTools

"""
    split(N, P, i) -> from:to
    
Find the ith range when 1:N is split into P consecutive parts of roughly equal size
"""
function split(N, P, i)
    base, rem = divrem(N, P)
    from = (i - 1) * base + min(rem, i - 1) + 1
    to = from + base - 1 + (i ≤ rem ? 1 : 0)
    from : to
end

function sdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    v = zero(T)
    @inbounds @simd for i = 1 : length(a)
        v += a[i] * b[i]
    end

    v
end

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    # Local sum
    parts = zeros(Threads.nthreads())

    ccall(:jl_threading_run, Void, (Any,), function()
        P = Threads.threadid()
        N = Threads.nthreads()
        range = split(length(a), N, P)
        v = zero(T)
        @inbounds @simd for i = range
            v += a[i] * b[i]
        end
        parts[P] = v
    end)

    # Total sum
    return sum(parts)
end

function bench(n = 100_000)
    a = rand(n)
    b = rand(n)
    threads = parse(Int, ENV["JULIA_NUM_THREADS"])
    @show threads
    BLAS.set_num_threads(threads)

    # Sanity check
    @assert dot(a,b) ≈ pdot(a,b) ≈ sdot(a, b)

    print("BLAS (", threads, " threads):\t")
    println(@benchmark dot($a, $b))
    print("Julia (no threads):\t")
    println(@benchmark sdot($a, $b))
    print("Julia (", threads, " threads):\t")
    println(@benchmark pdot($a, $b))
end

bench()

Run it like:

$ JULIA_NUM_THREADS=4 julia -O3 pdot.jl

Results:

Julia (no threads):     Trial(55.923 μs)
BLAS (1 threads):       Trial(53.483 μs)
Julia (1 threads):      Trial(56.551 μs)
BLAS (2 threads):       Trial(53.450 μs)
Julia (2 threads):      Trial(30.773 μs)
BLAS (4 threads):       Trial(53.689 μs)
Julia (4 threads):      Trial(27.655 μs)

Surprisingly Julia beats BLAS with 2 and 4 threads! Can someone reproduce this? This is on Julia 0.6.2 with OpenBLAS.

2 Likes

Oh wow, that’s surprising to me. Some results on my machine (0.6.2, Linux):

Julia (no threads):     Trial(35.786 μs)

BLAS (1 threads):       Trial(35.793 μs)
Julia (1 threads):      Trial(36.234 μs)

BLAS (2 threads):       Trial(35.731 μs)
Julia (2 threads):      Trial(19.571 μs)

BLAS (4 threads):       Trial(35.775 μs)
Julia (4 threads):      Trial(10.698 μs)

BLAS (8 threads):       Trial(35.772 μs)
Julia (8 threads):      Trial(5.150 μs)

BLAS (16 threads):      Trial(35.782 μs)
Julia (16 threads):     Trial(4.514 μs)

BLAS (32 threads):      Trial(35.760 μs)
Julia (32 threads):     Trial(4.050 μs)

BLAS (64 threads):      Trial(35.819 μs)
Julia (64 threads):     Trial(4.122 μs)

Edit: added more results.

4 Likes