~~I think you are running into this problem: https://discourse.julialang.org/t/innefficient-paralellization-need-some-help-optimizing-a-simple-dot-product/9723/19~~
For whatever reason, OpenBLAS does not multi-thread its dot product. Luckily, you can make your own dot product which outperforms OpenBLAS. (This is also why dot
is so much faster with MKL for such a basic operation - it’s multi-threaded by default)
EDIT: This no longer seems to be the case, OpenBLAS is threaded. Make sure you are using BLAS.set_num_threads(Sys.CPU_THREADS)
or whatever is appropriate given your architecture.
using LinearAlgebra
using Distributed
function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
N = Threads.nthreads()
v = zeros(T, N)
ranges = Distributed.splitrange(length(a), N)
Threads.@threads for i in 1:N
x = zero(T)
P = Threads.threadid()
subrange = ranges[P]
@inbounds @simd for i in subrange
x += a[i] * b[i]
end
v[P] = x
end
sum(v)
end
Benchmarks for 2 threads (start Julia with environment variable set JULIA_NUM_THREADS=2
)
using BenchmarkTools
a = randn(10000); b = randn(10000)
BLAS.set_num_threads(2)
@btime dot($a, $b)
# 1.570 μs (0 allocations: 0 bytes)
@btime pdot($a, $b)
# 1.270 μs (3 allocations: 272 bytes)
There is some performance left on the table, but this is an alternate way to get the speed you need. I only have 2 threads on this machine but I imagine with yours you may get a better speed-up.
EDIT: With some more testing, it looks like OpenBLAS is threading the dot product:
a = randn(100000); b = randn(100000)
@btime dot($a, $b)
# 12.857 μs (0 allocations: 0 bytes)
@btime pdot($a, $b)
# 12.651 μs (3 allocations: 272 bytes)
BLAS.set_num_threads(1)
@btime dot($a, $b)
# 24.300 μs (0 allocations: 0 bytes)
BLAS.set_num_threads(2)
@btime dot($a, $b)
# 12.842 μs (0 allocations: 0 bytes)
BLAS.set_num_threads(4)
@btime dot($a, $b)
# 12.157 μs (0 allocations: 0 bytes)
pdot
remains within a few percent of the BLAS call. So I think BLAS is using the threaded version now. Make sure you call BLAS.set_num_threads