How to implant Julia 1.0 code:Matrix{Vector{T}}(undef, M, N) in Julia 0.6?

~~I think you are running into this problem: https://discourse.julialang.org/t/innefficient-paralellization-need-some-help-optimizing-a-simple-dot-product/9723/19~~

For whatever reason, OpenBLAS does not multi-thread its dot product. Luckily, you can make your own dot product which outperforms OpenBLAS. (This is also why dot is so much faster with MKL for such a basic operation - it’s multi-threaded by default)

EDIT: This no longer seems to be the case, OpenBLAS is threaded. Make sure you are using BLAS.set_num_threads(Sys.CPU_THREADS) or whatever is appropriate given your architecture.

using LinearAlgebra
using Distributed

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    N = Threads.nthreads()
    v = zeros(T, N)
    ranges = Distributed.splitrange(length(a), N)
    Threads.@threads for i in 1:N
        x = zero(T)
        P = Threads.threadid()
        subrange = ranges[P]
        @inbounds @simd for i in subrange
            x += a[i] * b[i]
        end
        v[P] = x
    end
    sum(v)
end

Benchmarks for 2 threads (start Julia with environment variable set JULIA_NUM_THREADS=2)

using BenchmarkTools
a = randn(10000); b = randn(10000)

BLAS.set_num_threads(2)
@btime dot($a, $b)
#  1.570 μs (0 allocations: 0 bytes)

@btime pdot($a, $b)
#  1.270 μs (3 allocations: 272 bytes)

There is some performance left on the table, but this is an alternate way to get the speed you need. I only have 2 threads on this machine but I imagine with yours you may get a better speed-up.

EDIT: With some more testing, it looks like OpenBLAS is threading the dot product:

a = randn(100000); b = randn(100000)
@btime dot($a, $b)
#  12.857 μs (0 allocations: 0 bytes)
@btime pdot($a, $b)
#  12.651 μs (3 allocations: 272 bytes)

BLAS.set_num_threads(1)
@btime dot($a, $b)
#  24.300 μs (0 allocations: 0 bytes)

BLAS.set_num_threads(2)
@btime dot($a, $b)
#  12.842 μs (0 allocations: 0 bytes)

BLAS.set_num_threads(4)
@btime dot($a, $b)
#  12.157 μs (0 allocations: 0 bytes)

pdot remains within a few percent of the BLAS call. So I think BLAS is using the threaded version now. Make sure you call BLAS.set_num_threads