Innefficient paralellization? Need some help optimizing a simple dot product

Now that’s a great result! :smiley:

What are the hardware specs? What happens when n is an order larger?

1 Like

Anyone want to PR this threaded mapreduce for v0.7?

CPU: Intel i7-6950X CPU, 3GHz base frequency, 3.5GHz max, 10 cores, 20 threads, 25 MB cache, Q2 2016 (we’ve got some pretty crazy hardware in our lab).

1 Like

See also https://github.com/xianyi/OpenBLAS/issues/530#issuecomment-89394351 and following comments.

Yes. As you note, using @threads you would write this code as:

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    N = Threads.nthreads()
    v = zeros(T, N)
    Threads.@threads for i in 1:N
        x = zero(T)
        P = Threads.threadid()
        range = split(length(a), N, P)
        @inbounds @simd for i in range
            x += a[i] * b[i]
        end
        v[P] = x
    end
    sum(v)
end

so that we do not have to dig into ccall. I prefer using @threads rather than ccall as when pdot would be called inside another part of code that is run using @threads the macro detects such a situation and handles it and ccall does not.

As for general map-reduce using threads for expensive operations (dot product is not in this class) I often find it faster to do load balancing between threads like in tmap! function here https://github.com/bkamins/KissThreading.jl/blob/master/src/KissThreading.jl (the example implements map but it could rewritten to perform map-reduce).

10 Likes

That is definitely better!

Many thanks everyone. I realize now that I wanted to do multi-threading, so thanks @stabbles and @bkamins for the codes.

I have compared them to my own Fortran codes, as well as BLAS. Julia seems to perform equally well. These are my benchmarks:

Native Fortran (no threads): 48.440 μs
Native Fortran (4 threads): 30.308 μs
Fortran BLAS (no threads): 39.551 μs
Julia Fortran (no threads): Trial(47.135 μs)
Julia Fortran (4 threads): Trial(22.483 μs)
Julia BLAS (4 threads): Trial(37.368 μs)
Julia (no threads): Trial(48.717 μs)
Julia (4 threads): Trial(34.340 μs)
Julia (4 procs): Trial(784.996 μs)

Here are the Fortran codes, the Julia codes and the shell script to run them.

1 Like

After following this discussion with a lot of interest, I have encountered a strange performance discrepancy between two seemingly equivalent versions of a threaded dot product. The following module provides a serial dot product, and two threaded dot product functions similar to the ones above.

module DotProducts

export serdot, ptdot_1, ptdot_2

function serdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    @assert length(a) == length(b) "Dimension mismatch"
    return serdot_(a, b)
end

function serdot_(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    N = length(a)
    Σ = zero(T)
    @inbounds @simd for k = 1:N
        Σ += a[k] * b[k]
    end
    return Σ
end

function ptdot_(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    N = length(a)
    nt = Threads.nthreads()
    v = zeros(T, nt)
    subrng = Distributed.splitrange(N, nt)
    Threads.@threads for i = 1:nt
        tid = Threads.threadid()
        v[tid] = serdot_(view(a, subrng[tid]), view(b, subrng[tid]))
    end
    return sum(v)
end

function ptdot_1(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    @assert length(a) == length(b) "Dimension mismatch"
    return ptdot_(a, b)
end

function ptdot_2(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    @assert length(a) == length(b) "Dimension mismatch"
    N = length(a)
    nt = Threads.nthreads()
    v = zeros(T, nt)
    subrng = Distributed.splitrange(N, nt)
    Threads.@threads for i = 1:nt
        tid = Threads.threadid()
        v[tid] = serdot_(view(a, subrng[tid]), view(b, subrng[tid]))
    end
    return sum(v)
end

end

Here, serdot checks the lengths of the two arrays before calling serdot_ to perform the actual serial dot product. Similarly, ptdot_1 checks the lengths of the two arrays before calling ptdot_, each of whose threads calls serdot_ for a subrange of indices. The second version, ptdot_2 differs only in that instead of calling ptdot_ I have copied the source code from its function body. I then benchmarked the functions using the following script.

using BenchmarkTools
using DotProducts

N = 100_000
a = rand(N)
b = rand(N)
nt = parse(Int, ENV["JULIA_NUM_THREADS"])
@show nt

@assert dot(a, b) ≈ serdot(a, b)
@assert dot(a, b) ≈ ptdot_1(a, b) ≈ ptdot_2(a, b)

println("Unthreaded:")
print("\tBLAS\t")
println(@benchmark dot($a, $b))
print("\tJulia\t")
println(@benchmark serdot($a, $b))

println("\nJulia with $nt threads: ")
print("\tptdot_1\t")
println(@benchmark ptdot_1($a, $b))
print("\tptdot_2\t")
println(@benchmark ptdot_2($a, $b))

Here is the weird thing: ptdot_2 is significantly slower than ptdot_1. The command

JULIA_NUM_THREADS=4 julia -O3 -LDotProducts.jl benchmark_dot.jl 

produces

nt = 4
Unthreaded:
        BLAS    Trial(32.441 μs)
        Julia   Trial(28.888 μs)

Julia with 4 threads: 
        ptdot_1 Trial(8.456 μs)
        ptdot_2 Trial(16.774 μs)

The difference in performance is less marked when using fewer threads.

Julia with 2 threads: 
        ptdot_1 Trial(15.276 μs)
        ptdot_2 Trial(19.546 μs)

Julia with 1 threads: 
        ptdot_1 Trial(29.325 μs)
        ptdot_2 Trial(32.734 μs)

Here is my versioninfo.

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Interesting. I can reproduce:

nt = 4
Unthreaded:
	BLAS	Trial(24.096 μs)
	Julia	Trial(30.413 μs)

Julia with 4 threads: 
	ptdot_1	Trial(9.086 μs)
	ptdot_2	Trial(18.543 μs)

Julia with 2 threads: 
	ptdot_1	Trial(16.489 μs)
	ptdot_2	Trial(21.167 μs)

Julia with 1 threads: 
	ptdot_1	Trial(30.857 μs)
	ptdot_2	Trial(35.507 μs)

Versioninfo:

Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

Just a guess, but maybe this is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub once again (the @threads macro creates a closure). See also see Parallelizing for loop in the computation of a gradient - #7 by tkoolen. Check the code_warntype.

Yes, I get the following.

julia> @code_warntype ptdot_2(a, b)
Variables:
  #self#::DotProducts.#ptdot_2
  a::Array{Float64,1}
  b::Array{Float64,1}
  N::Int64
  nt::Int64
  v::Core.Box
  subrng::Core.Box
  range::Core.Box
  threadsfor_fun::DotProducts.##18#threadsfor_fun#2{Array{Float64,1},Array{Float64,1}}
  #temp#::Bool
julia> @code_warntype ptdot_(a, b)
Variables:
  #self#::#ptdot_
  a::Array{Float64,1}
  b::Array{Float64,1}
  N::Int64
  nt::Int64
  v::Array{Float64,1}
  subrng::Array{UnitRange{Int64},1}
  range::UnitRange{Int64}
  threadsfor_fun::##38#threadsfor_fun#5{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{UnitRange{Int64},1},UnitRange{Int64}}
  #temp#::Bool

The only difference between ptdot_2 and ptdot_ is that the former includes an @assert macro, and it seems odd that this would make any difference.
Here are some benchmarks from my other desktop.

$ JULIA_NUM_THREADS=8 julia -O3 -LDotProducts.jl benchmark_dot.jl 
nt = 8
Unthreaded:
	BLAS	Trial(16.872 μs)
	Julia	Trial(20.609 μs)

Julia with 8 threads: 
	ptdot_1	Trial(5.313 μs)
	ptdot_2	Trial(31.530 μs)

The versioninfo is

Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH Zen)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

Yeah, it seems odd, but see performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub. Also, range should be correctly inferred on master, see wrap range variable in a let block for at_threads by vchuravy · Pull Request #24688 · JuliaLang/julia · GitHub. You might want to try on Julia master as well, and try the let block workaround: performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub.

I think performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub is the biggest performance gotcha in the language right now.

1 Like

OpenBLAS has merged the threaded dot product in the meanwhile!

https://github.com/xianyi/OpenBLAS/pull/1491

1 Like