# Innefficient paralellization? Need some help optimizing a simple dot product

I have a very simple code I’d like to optimize, and I’m not sure I am getting the expected results. For the sake of simplicity I am optimizing a loop which computes the dot product. I have implemented two functions, a serial dot product, `sdot`, and a parallel one, `pdot`, which attempts to use idiomatic reduction:

``````function sdot(n, x, y)
a = 0 # This should be a=0. for type stability
@inbounds @fastmath @simd for i=1:n
a += x[i]*y[i]
end
a
end
function pdot(n, x, y)
@fastmath @parallel (+) for i=1:n
x[i]*y[i]
end
end
``````

I have timed running these functions 3 times for `x` and `y` vectors of size `10000000` (full code here). I also compare them to the built-in dot product. These are my results:

Function `@time`
`sdot ` 0.878811 seconds (90.00 M allocations: 1.341 GiB, 4.44% gc time)
`sdot ` (type stable) 0.034074 seconds (3 allocations: 48 bytes)
`pdot ` 1.068616 seconds (1.63 k allocations: 139.344 KiB)
`dot` 0.036030 seconds (3 allocations: 48 bytes)

`dot` is clearly the fastest, which is alright given that it relies on `ddot` from BLAS. What puzzles me is that my parallel implementation is always slower that my serial implementation. I understand that there is some overhead in `@parallel`, but I wouldn’t imagine it would be so high. I have changed the values of `n` through several different orders of magnitude, but `pdot` always loses. Interestingly, I have coded similar versions in Fortran, and the `OMP REDUCE` version is always fastest. Those codes can be found here.

A gist to reproduce my results can be found here. Any help is appreciated!

EDIT: I have added a type-stable `sdot` as per comments

1 Like

For `sdot`, read the performance tips in the manual, especially the one about type stability. The way you initialize a is not good.

1 Like

It’s not clear that you started julia with any workers, in which case `@parallel` is just adding overhead.

The performance tip for `sdot` is critically important. I changed `0` to `0.` in your gist and now `sdot` and `dot` have identical performance. What’s happening here is you are changing an Int to a Float64, so it’s not type stable.

Thanks @aaowens and @kristoffer.carlsson. Indeed by changing `a=0` to `a=0.`, I get the roughly the same performance in `sdot` as `dot` (if slightly faster).

Just to clarify, @Michael_Eastwood, I am running this script with `julia -p 4 bench_dot.jl`, so that should not be an issue. If I got roughly the same speed, I could work with that, but over 30x slower seems excessive.

Would be keen to hear any other ways which I can improve the parallel code!

You’re missing `@inbounds` in `pdot`

However, I don’t get it much faster either, the parallel version with `@inbounds` is still 10x as slow for large vectors. Maybe it’s not generating SIMD operations.

Note that you can barely parallellize the dot product since it is memory bound.

Where do I put the `@inbounds`? I’ve not been able to figure it out. Could you provide a snippet?

I have been able to get better performance parallelizing in Fortran, so I was hopeful I would get at least some reduction in execution time.

If the computation takes 0.03 seconds it is unlikely that distributing it onto multiple processes will give a speed up. Note that `@parallell` is distributed memory parallellization.

Julia’s `@parallel` uses multi-processing, i.e. distributed-memory parallelism. On the other hand, Fortran’s OpenMP uses multi-threading, i.e. shared-memory parallelism. That makes a huge difference in your example, because Julia’s multiple processes must communicate a lot, which is a lot of work compared to your actual computation.

The dot product doesn’t benefit much from multiple cores anyway, because its computation is memory-bound. At most you can expect a speedup according to how many memory channels you have. I think that’s what you’re seeing in your Fortran/OpenMP benchmark, where the parallel version gives a speedup of less than two. (So I guess you have two memory channels?) Edit: The memory-channel argument only applies if the data in question doesn’t fit into the CPU cache. For Float64 vectors of 1,000,000 elements (i.e. 8 MB of data) a modern “big” CPU can have enough L3 cache to hold two of those. Depending on how the cores are connected to the cache, one can then except to get a much better speedup. (Even superlinear speedup if the cores have separate caches.)

Julia’s `@parallel` will only give a benefit if the work done for each iteration is substantially larger than the communcation work.

2 Likes

Exactly. I’d just add that for multithreading in Julia, use `Threads.@threads`.

But `@threads` is not a drop-in replacement for a parallel reduce. Doing something like

``````values = zeros(Threads.nthreads())
end
``````

is not performant. Something like this is fast (except it probably cannot do SIMD cause of the cyclic distribution):

``````function threads_dot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
if length(a) != length(b)
throw(DimensionMismatch())
end

# Local sum

v = zero(T) # this makes things fast
@inbounds for i = id:step:length(a)
v += a[i] * b[i]
end
parts[id] = v
end)

# Total sum
return sum(parts)
end
``````
3 Likes

You forgot an important thing; the arrays must be declared as `SharedArray`s so that the arrays can be available to every worker, even for this simple calculation the parallel version is slightly better. Here are my results:

``````function sdot(n, x, y)
a = 0.0
@inbounds @fastmath for i=1:n
a += x[i]*y[i]
end
a
end
@everywhere @inbounds @fastmath function pdot(n, x, y)
a = @parallel (+) for i=1:n
x[i]*y[i]
end
end

n = 10^7
x = SharedArray{Float64,1}( ones(n) )
y = SharedArray{Float64,1}( ones(n) )

println("Naive Julia")
println(@sprintf("  %.2f", sdot(n, x, y)))
@time for i=1:3
sdot(n, x, y)
end

println("\nNative Julia")
println(@sprintf("  %.2f", dot(x, y)))
@time for i=1:3
dot(x, y)
end

println("\nIdiomatic parallel Julia")
println(@sprintf("  %.2f",  pdot(n, x, y)))
@time for i=1:3
pdot(n, x, y)
end
``````

The timings:

``````Naive Julia
10000000.00
0.027176 seconds (3 allocations: 48 bytes)

Native Julia
10000000.00
0.027156 seconds (3 allocations: 48 bytes)

Idiomatic parallel Julia
10000000.00
0.025830 seconds (4.49 k allocations: 356.578 KiB)
[Finished in 4.5s]
``````

The difference becomes clearer for larger inputs (say n = 10^9):

``````Naive Julia
1000000000.00
4.235259 seconds (3 allocations: 48 bytes)

Native Julia
1000000000.00
3.088961 seconds (3 allocations: 48 bytes)

Idiomatic parallel Julia
1000000000.00
2.282985 seconds (4.47 k allocations: 355.719 KiB)
[Finished in 29.5s]
``````

Yeah, but your solution is still using Julia’s threading. I agree that there’s a tooling issue, specifically that an helpful method is missing, but the answer is still to use threading in some sense.

That seems fishy. Sharing more data around should make things slower. I think in your example it just moves the data sharing outside of the loop to the SharedArray construction whereas `@parallel` has to send it to do the mapping operation?

So, if you go for threading and a block distribution, then you’d get something like this

``````using BenchmarkTools

# Local sum

# I'm too lazy to compute the local range here
range = Distributed.splitrange(length(a), N)[P]
v = zero(T)
@inbounds @simd for i = range
v += a[i] * b[i]
end
parts[P] = v
end)

# Total sum
return sum(parts)
end

function bench()
@benchmark threads_dot(a, b) setup = (a = rand(1_000_000); b = rand(1_000_000))
end
``````

Results:

While `dot` with BLAS threading takes `464.846 μs`. Any idea why this has so much overhead?

1 Like

Yeah, with `@threads` there seems to always be some reason to have to manually split up the workload over the threads.

Maybe `@threads` in its current form is just not the right abstraction? At the very least, a `mapreduce`-style version of `@threads` similar to `@parallel` would be pretty useful I think.

2 Likes

Just to get this straight, are you asking about the overhead of BLAS `dot`? How many threads is BLAS using?

By the way, you could use a `Threads.Atomic` instead of `parts`, although it probably doesn’t make much of a difference.

I meant the overhead in the Julia code: only at 8 threads do I match BLAS. What’s going on in the 1, 2 or 4 threads case?

Edit: now testing on my Macbook and I don’t see the overhead for some reason.

Well how many threads is BLAS using? If it’s 8, then Julia would actually be doing better than BLAS. Try experimenting with `BLAS.set_num_threads`.

Sorry for the confusion: BLAS does not benefit from a threaded inner product, so 464 μs holds for 1, 2, 4 and 8 threads; whereas Julia gets that speed only at 8 threads.

I’ve updated the code a bit, and benched it on a MacBook Air with 2 cores / 4 threads.

``````using BenchmarkTools

"""
split(N, P, i) -> from:to

Find the ith range when 1:N is split into P consecutive parts of roughly equal size
"""
function split(N, P, i)
base, rem = divrem(N, P)
from = (i - 1) * base + min(rem, i - 1) + 1
to = from + base - 1 + (i ≤ rem ? 1 : 0)
from : to
end

function sdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
v = zero(T)
@inbounds @simd for i = 1 : length(a)
v += a[i] * b[i]
end

v
end

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
# Local sum

range = split(length(a), N, P)
v = zero(T)
@inbounds @simd for i = range
v += a[i] * b[i]
end
parts[P] = v
end)

# Total sum
return sum(parts)
end

function bench(n = 100_000)
a = rand(n)
b = rand(n)

# Sanity check
@assert dot(a,b) ≈ pdot(a,b) ≈ sdot(a, b)

println(@benchmark dot(\$a, \$b))
println(@benchmark sdot(\$a, \$b))
println(@benchmark pdot(\$a, \$b))
end

bench()
``````

Run it like:

``````\$ JULIA_NUM_THREADS=4 julia -O3 pdot.jl
``````

Results:

``````Julia (no threads):     Trial(55.923 μs)
``````

Surprisingly Julia beats BLAS with 2 and 4 threads! Can someone reproduce this? This is on Julia 0.6.2 with OpenBLAS.

2 Likes

Oh wow, that’s surprising to me. Some results on my machine (0.6.2, Linux):

``````Julia (no threads):     Trial(35.786 μs)