Now that’s a great result!
What are the hardware specs? What happens when n
is an order larger?
Now that’s a great result!
What are the hardware specs? What happens when n
is an order larger?
Anyone want to PR this threaded mapreduce for v0.7?
CPU: Intel i7-6950X CPU, 3GHz base frequency, 3.5GHz max, 10 cores, 20 threads, 25 MB cache, Q2 2016 (we’ve got some pretty crazy hardware in our lab).
See also https://github.com/xianyi/OpenBLAS/issues/530#issuecomment-89394351 and following comments.
Yes. As you note, using @threads
you would write this code as:
function pdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T}
N = Threads.nthreads()
v = zeros(T, N)
Threads.@threads for i in 1:N
x = zero(T)
P = Threads.threadid()
range = split(length(a), N, P)
@inbounds @simd for i in range
x += a[i] * b[i]
end
v[P] = x
end
sum(v)
end
so that we do not have to dig into ccall
. I prefer using @threads
rather than ccall
as when pdot
would be called inside another part of code that is run using @threads
the macro detects such a situation and handles it and ccall
does not.
As for general map-reduce using threads for expensive operations (dot product is not in this class) I often find it faster to do load balancing between threads like in tmap!
function here https://github.com/bkamins/KissThreading.jl/blob/master/src/KissThreading.jl (the example implements map
but it could rewritten to perform map-reduce).
That is definitely better!
Many thanks everyone. I realize now that I wanted to do multi-threading, so thanks @stabbles and @bkamins for the codes.
I have compared them to my own Fortran codes, as well as BLAS. Julia seems to perform equally well. These are my benchmarks:
Native Fortran (no threads): 48.440 μs
Native Fortran (4 threads): 30.308 μs
Fortran BLAS (no threads): 39.551 μs
Julia Fortran (no threads): Trial(47.135 μs)
Julia Fortran (4 threads): Trial(22.483 μs)
Julia BLAS (4 threads): Trial(37.368 μs)
Julia (no threads): Trial(48.717 μs)
Julia (4 threads): Trial(34.340 μs)
Julia (4 procs): Trial(784.996 μs)
Here are the Fortran codes, the Julia codes and the shell script to run them.
After following this discussion with a lot of interest, I have encountered a strange performance discrepancy between two seemingly equivalent versions of a threaded dot product. The following module provides a serial dot product, and two threaded dot product functions similar to the ones above.
module DotProducts
export serdot, ptdot_1, ptdot_2
function serdot(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
@assert length(a) == length(b) "Dimension mismatch"
return serdot_(a, b)
end
function serdot_(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
N = length(a)
Σ = zero(T)
@inbounds @simd for k = 1:N
Σ += a[k] * b[k]
end
return Σ
end
function ptdot_(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
N = length(a)
nt = Threads.nthreads()
v = zeros(T, nt)
subrng = Distributed.splitrange(N, nt)
Threads.@threads for i = 1:nt
tid = Threads.threadid()
v[tid] = serdot_(view(a, subrng[tid]), view(b, subrng[tid]))
end
return sum(v)
end
function ptdot_1(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
@assert length(a) == length(b) "Dimension mismatch"
return ptdot_(a, b)
end
function ptdot_2(a::AbstractVector{T}, b::AbstractVector{T}) where {T<:Number}
@assert length(a) == length(b) "Dimension mismatch"
N = length(a)
nt = Threads.nthreads()
v = zeros(T, nt)
subrng = Distributed.splitrange(N, nt)
Threads.@threads for i = 1:nt
tid = Threads.threadid()
v[tid] = serdot_(view(a, subrng[tid]), view(b, subrng[tid]))
end
return sum(v)
end
end
Here, serdot
checks the lengths of the two arrays before calling serdot_
to perform the actual serial dot product. Similarly, ptdot_1
checks the lengths of the two arrays before calling ptdot_
, each of whose threads calls serdot_
for a subrange of indices. The second version, ptdot_2
differs only in that instead of calling ptdot_
I have copied the source code from its function body. I then benchmarked the functions using the following script.
using BenchmarkTools
using DotProducts
N = 100_000
a = rand(N)
b = rand(N)
nt = parse(Int, ENV["JULIA_NUM_THREADS"])
@show nt
@assert dot(a, b) ≈ serdot(a, b)
@assert dot(a, b) ≈ ptdot_1(a, b) ≈ ptdot_2(a, b)
println("Unthreaded:")
print("\tBLAS\t")
println(@benchmark dot($a, $b))
print("\tJulia\t")
println(@benchmark serdot($a, $b))
println("\nJulia with $nt threads: ")
print("\tptdot_1\t")
println(@benchmark ptdot_1($a, $b))
print("\tptdot_2\t")
println(@benchmark ptdot_2($a, $b))
Here is the weird thing: ptdot_2
is significantly slower than ptdot_1
. The command
JULIA_NUM_THREADS=4 julia -O3 -LDotProducts.jl benchmark_dot.jl
produces
nt = 4
Unthreaded:
BLAS Trial(32.441 μs)
Julia Trial(28.888 μs)
Julia with 4 threads:
ptdot_1 Trial(8.456 μs)
ptdot_2 Trial(16.774 μs)
The difference in performance is less marked when using fewer threads.
Julia with 2 threads:
ptdot_1 Trial(15.276 μs)
ptdot_2 Trial(19.546 μs)
Julia with 1 threads:
ptdot_1 Trial(29.325 μs)
ptdot_2 Trial(32.734 μs)
Here is my versioninfo.
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
Interesting. I can reproduce:
nt = 4
Unthreaded:
BLAS Trial(24.096 μs)
Julia Trial(30.413 μs)
Julia with 4 threads:
ptdot_1 Trial(9.086 μs)
ptdot_2 Trial(18.543 μs)
Julia with 2 threads:
ptdot_1 Trial(16.489 μs)
ptdot_2 Trial(21.167 μs)
Julia with 1 threads:
ptdot_1 Trial(30.857 μs)
ptdot_2 Trial(35.507 μs)
Versioninfo:
Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Just a guess, but maybe this is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub once again (the @threads
macro creates a closure). See also see Parallelizing for loop in the computation of a gradient - #7 by tkoolen. Check the code_warntype
.
Yes, I get the following.
julia> @code_warntype ptdot_2(a, b)
Variables:
#self#::DotProducts.#ptdot_2
a::Array{Float64,1}
b::Array{Float64,1}
N::Int64
nt::Int64
v::Core.Box
subrng::Core.Box
range::Core.Box
threadsfor_fun::DotProducts.##18#threadsfor_fun#2{Array{Float64,1},Array{Float64,1}}
#temp#::Bool
julia> @code_warntype ptdot_(a, b)
Variables:
#self#::#ptdot_
a::Array{Float64,1}
b::Array{Float64,1}
N::Int64
nt::Int64
v::Array{Float64,1}
subrng::Array{UnitRange{Int64},1}
range::UnitRange{Int64}
threadsfor_fun::##38#threadsfor_fun#5{Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{UnitRange{Int64},1},UnitRange{Int64}}
#temp#::Bool
The only difference between ptdot_2 and ptdot_ is that the former includes an @assert
macro, and it seems odd that this would make any difference.
Here are some benchmarks from my other desktop.
$ JULIA_NUM_THREADS=8 julia -O3 -LDotProducts.jl benchmark_dot.jl
nt = 8
Unthreaded:
BLAS Trial(16.872 μs)
Julia Trial(20.609 μs)
Julia with 8 threads:
ptdot_1 Trial(5.313 μs)
ptdot_2 Trial(31.530 μs)
The versioninfo is
Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: AMD Ryzen 7 1700 Eight-Core Processor
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH Zen)
LAPACK: libopenblasp.so.0
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, generic)
Yeah, it seems odd, but see performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub. Also, range
should be correctly inferred on master, see wrap range variable in a let block for at_threads by vchuravy · Pull Request #24688 · JuliaLang/julia · GitHub. You might want to try on Julia master as well, and try the let
block workaround: performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub.
I think performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub is the biggest performance gotcha in the language right now.
OpenBLAS has merged the threaded dot product in the meanwhile!