Sorry for the confusion: BLAS does not benefit from a threaded inner product, so 464 μs holds for 1, 2, 4 and 8 threads; whereas Julia gets that speed only at 8 threads.
I’ve updated the code a bit, and benched it on a MacBook Air with 2 cores / 4 threads.
using BenchmarkTools
"""
split(N, P, i) -> from:to
Find the ith range when 1:N is split into P consecutive parts of roughly equal size
"""
function split(N, P, i)
base, rem = divrem(N, P)
from = (i - 1) * base + min(rem, i - 1) + 1
to = from + base - 1 + (i ≤ rem ? 1 : 0)
from : to
end
function sdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
v = zero(T)
@inbounds @simd for i = 1 : length(a)
v += a[i] * b[i]
end
v
end
function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
# Local sum
parts = zeros(Threads.nthreads())
ccall(:jl_threading_run, Void, (Any,), function()
P = Threads.threadid()
N = Threads.nthreads()
range = split(length(a), N, P)
v = zero(T)
@inbounds @simd for i = range
v += a[i] * b[i]
end
parts[P] = v
end)
# Total sum
return sum(parts)
end
function bench(n = 100_000)
a = rand(n)
b = rand(n)
threads = parse(Int, ENV["JULIA_NUM_THREADS"])
@show threads
BLAS.set_num_threads(threads)
# Sanity check
@assert dot(a,b) ≈ pdot(a,b) ≈ sdot(a, b)
print("BLAS (", threads, " threads):\t")
println(@benchmark dot($a, $b))
print("Julia (no threads):\t")
println(@benchmark sdot($a, $b))
print("Julia (", threads, " threads):\t")
println(@benchmark pdot($a, $b))
end
bench()
Run it like:
$ JULIA_NUM_THREADS=4 julia -O3 pdot.jl
Results:
Julia (no threads): Trial(55.923 μs)
BLAS (1 threads): Trial(53.483 μs)
Julia (1 threads): Trial(56.551 μs)
BLAS (2 threads): Trial(53.450 μs)
Julia (2 threads): Trial(30.773 μs)
BLAS (4 threads): Trial(53.689 μs)
Julia (4 threads): Trial(27.655 μs)
Surprisingly Julia beats BLAS with 2 and 4 threads! Can someone reproduce this? This is on Julia 0.6.2 with OpenBLAS.