Innefficient paralellization? Need some help optimizing a simple dot product

stabbles · March 15, 2018, 7:21pm

Now that’s a great result!

What are the hardware specs? What happens when n is an order larger?

ChrisRackauckas · March 15, 2018, 7:24pm

Anyone want to PR this threaded mapreduce for v0.7?

tkoolen · March 15, 2018, 7:42pm

CPU: Intel i7-6950X CPU, 3GHz base frequency, 3.5GHz max, 10 cores, 20 threads, 25 MB cache, Q2 2016 (we’ve got some pretty crazy hardware in our lab).

tkoolen · March 15, 2018, 7:49pm

See also https://github.com/xianyi/OpenBLAS/issues/530#issuecomment-89394351 and following comments.

bkamins · March 15, 2018, 11:08pm

Yes. As you note, using @threads you would write this code as:

function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
    N = Threads.nthreads()
    v = zeros(T, N)
    Threads.@threads for i in 1:N
        x = zero(T)
        P = Threads.threadid()
        range = split(length(a), N, P)
        @inbounds @simd for i in range
            x += a[i] * b[i]
        end
        v[P] = x
    end
    sum(v)
end

so that we do not have to dig into ccall. I prefer using @threads rather than ccall as when pdot would be called inside another part of code that is run using @threads the macro detects such a situation and handles it and ccall does not.

As for general map-reduce using threads for expensive operations (dot product is not in this class) I often find it faster to do load balancing between threads like in tmap! function here https://github.com/bkamins/KissThreading.jl/blob/master/src/KissThreading.jl (the example implements map but it could rewritten to perform map-reduce).

stabbles · March 16, 2018, 9:18am

That is definitely better!

cdacosta · March 16, 2018, 10:22pm

Many thanks everyone. I realize now that I wanted to do multi-threading, so thanks @stabbles and @bkamins for the codes.

I have compared them to my own Fortran codes, as well as BLAS. Julia seems to perform equally well. These are my benchmarks:

Native Fortran (no threads): 48.440 μs
Native Fortran (4 threads): 30.308 μs
Fortran BLAS (no threads): 39.551 μs
Julia Fortran (no threads): Trial(47.135 μs)
Julia Fortran (4 threads): Trial(22.483 μs)
Julia BLAS (4 threads): Trial(37.368 μs)
Julia (no threads): Trial(48.717 μs)
Julia (4 threads): Trial(34.340 μs)
Julia (4 procs): Trial(784.996 μs)

Here are the Fortran codes, the Julia codes and the shell script to run them.

billmclean · March 23, 2018, 6:07am

After following this discussion with a lot of interest, I have encountered a strange performance discrepancy between two seemingly equivalent versions of a threaded dot product. The following module provides a serial dot product, and two threaded dot product functions similar to the ones above.

module DotProducts

export serdot, ptdot_1, ptdot_2

function serdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    @assert length(a) == length(b) "Dimension mismatch"
    return serdot_(a, b)
end

function serdot_(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    N = length(a)
    Σ = zero(T)
    @inbounds @simd for k = 1:N
        Σ += a[k] * b[k]
    end
    return Σ
end

function ptdot_(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    N = length(a)
    nt = Threads.nthreads()
    v = zeros(T, nt)
    subrng = Distributed.splitrange(N, nt)
    Threads.@threads for i = 1:nt
        tid = Threads.threadid()
        v[tid] = serdot_(view(a, subrng[tid]), view(b, subrng[tid]))
    end
    return sum(v)
end

function ptdot_1(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    @assert length(a) == length(b) "Dimension mismatch"
    return ptdot_(a, b)
end

function ptdot_2(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
    @assert length(a) == length(b) "Dimension mismatch"
    N = length(a)
    nt = Threads.nthreads()
    v = zeros(T, nt)
    subrng = Distributed.splitrange(N, nt)
    Threads.@threads for i = 1:nt
        tid = Threads.threadid()
        v[tid] = serdot_(view(a, subrng[tid]), view(b, subrng[tid]))
    end
    return sum(v)
end

end

Here, serdot checks the lengths of the two arrays before calling serdot_ to perform the actual serial dot product. Similarly, ptdot_1 checks the lengths of the two arrays before calling ptdot_, each of whose threads calls serdot_ for a subrange of indices. The second version, ptdot_2 differs only in that instead of calling ptdot_ I have copied the source code from its function body. I then benchmarked the functions using the following script.

using BenchmarkTools
using DotProducts

N = 100_000
a = rand(N)
b = rand(N)
nt = parse(Int, ENV["JULIA_NUM_THREADS"])
@show nt

@assert dot(a, b) ≈ serdot(a, b)
@assert dot(a, b) ≈ ptdot_1(a, b) ≈ ptdot_2(a, b)

println("Unthreaded:")
print("\tBLAS\t")
println(@benchmark dot($a, $b))
print("\tJulia\t")
println(@benchmark serdot($a, $b))

println("\nJulia with $nt threads: ")
print("\tptdot_1\t")
println(@benchmark ptdot_1($a, $b))
print("\tptdot_2\t")
println(@benchmark ptdot_2($a, $b))

Here is the weird thing: ptdot_2 is significantly slower than ptdot_1. The command

JULIA_NUM_THREADS=4 julia -O3 -LDotProducts.jl benchmark_dot.jl

produces

nt = 4
Unthreaded:
        BLAS    Trial(32.441 μs)
        Julia   Trial(28.888 μs)

Julia with 4 threads: 
        ptdot_1 Trial(8.456 μs)
        ptdot_2 Trial(16.774 μs)

The difference in performance is less marked when using fewer threads.

Julia with 2 threads: 
        ptdot_1 Trial(15.276 μs)
        ptdot_2 Trial(19.546 μs)

Julia with 1 threads: 
        ptdot_1 Trial(29.325 μs)
        ptdot_2 Trial(32.734 μs)

Here is my versioninfo.

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

cdacosta · March 23, 2018, 4:15pm

Interesting. I can reproduce:

nt = 4
Unthreaded:
	BLAS	Trial(24.096 μs)
	Julia	Trial(30.413 μs)

Julia with 4 threads: 
	ptdot_1	Trial(9.086 μs)
	ptdot_2	Trial(18.543 μs)

Julia with 2 threads: 
	ptdot_1	Trial(16.489 μs)
	ptdot_2	Trial(21.167 μs)

Julia with 1 threads: 
	ptdot_1	Trial(30.857 μs)
	ptdot_2	Trial(35.507 μs)

Versioninfo:

Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

tkoolen · March 23, 2018, 5:26pm

Just a guess, but maybe this is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub once again (the @threads macro creates a closure). See also see Parallelizing for loop in the computation of a gradient - #7 by tkoolen. Check the code_warntype.

billmclean · March 24, 2018, 1:16am

Yes, I get the following.

julia> @code_warntype ptdot_2(a, b)
Variables:
  #self#::DotProducts.#ptdot_2
  a::Array{Float64,1}
  b::Array{Float64,1}
  N::Int64
  nt::Int64
  v::Core.Box
  subrng::Core.Box
  range::Core.Box
  threadsfor_fun::DotProducts.##18#threadsfor_fun#2{Array{Float64,1},Array{Float64,1}}
  #temp#::Bool
julia> @code_warntype ptdot_(a, b)
Variables:
  #self#::#ptdot_
  a::Array{Float64,1}
  b::Array{Float64,1}
  N::Int64
  nt::Int64
  v::Array{Float64,1}
  subrng::Array{UnitRange{Int64},1}
  range::UnitRange{Int64}
  threadsfor_fun::##38#threadsfor_fun#5{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{UnitRange{Int64},1},UnitRange{Int64}}
  #temp#::Bool

The only difference between ptdot_2 and ptdot_ is that the former includes an @assert macro, and it seems odd that this would make any difference.
Here are some benchmarks from my other desktop.

$ JULIA_NUM_THREADS=8 julia -O3 -LDotProducts.jl benchmark_dot.jl 
nt = 8
Unthreaded:
	BLAS	Trial(16.872 μs)
	Julia	Trial(20.609 μs)

Julia with 8 threads: 
	ptdot_1	Trial(5.313 μs)
	ptdot_2	Trial(31.530 μs)

The versioninfo is

Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH Zen)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

tkoolen · March 24, 2018, 1:41am

Yeah, it seems odd, but see performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub. Also, range should be correctly inferred on master, see wrap range variable in a let block for at_threads by vchuravy · Pull Request #24688 · JuliaLang/julia · GitHub. You might want to try on Julia master as well, and try the let block workaround: performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub.

I think performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub is the biggest performance gotcha in the language right now.

stabbles · March 28, 2018, 7:23am

OpenBLAS has merged the threaded dot product in the meanwhile!

https://github.com/xianyi/OpenBLAS/pull/1491

Topic		Replies	Views
How to make Julia slow? Performance question	11	2019	September 16, 2019
Interesting post about SIMD dot product (and cosine similarity) Offtopic performance	17	854	December 2, 2024
Naive dot product faster in Fortran than in Juila Performance	12	1403	July 24, 2021
Dot product not parallelized on cluster Performance linearalgebra	4	260	January 4, 2023
Optimize dot(::Vector{Float64}, ::Vector{ForwardDiff.Dual}) General Usage question	3	332	January 27, 2020

Innefficient paralellization? Need some help optimizing a simple dot product

Related topics