I wanted to illustrate the beauty of Julia for my class by showing the speed of a simple vector dot product, but to my surprise the Julia loop version was about 5x slower than the built-in dot() that (I think) calls BLAS.cdotc_ http://www.netlib.org/lapack/explore-html/dd/db2/cdotc_8f_source.html
That FORTAN code uses a loop that looks essentially identical to the Julia code.
Why is Julia so much slower? Neither @inbounds nor @simd helped BTW.
Is Julia calling a fancier version of BLAS with assembly code or such?
Here is my test code:
using BenchmarkTools: @btime
using LinearAlgebra: dot
N = 2^16; x = rand(ComplexF32, N); y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
function f5(x,y) # basic loop
accum = zero(eltype(x))
for i in eachindex(x)
accum += x[i] * conj(y[i])
end
return accum
end
@assert f2(x,y) ā f5(x,y) # check
# times below are Julia 1.5 on 2017 iMac Pro with 8 threads
@btime f2($x,$y); # dot(y,x) 13.2 us
@btime f5($x,$y); # loop 60.2 us (@inbounds and @simd did not help)
I also found LinearAlgebra.BLAS.set_num_threads(1) and that led to exactly the same timing results for dot and same for setting it to 4 threads, so I think it must be single threaded.
const get_num_threads = function() # anonymous so it will be serialized when called
blas = LinearAlgebra.BLAS.vendor()
# Wrap in a try to catch unsupported blas versions
try
if blas == :openblas
return ccall((:openblas_get_num_threads, Base.libblas_name), Cint, ())
elseif blas == :openblas64
return ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
elseif blas == :mkl
return ccall((:MKL_Get_Max_Num_Threads, Base.libblas_name), Cint, ())
end
# OSX BLAS looks at an environment variable
if Sys.isapple()
return tryparse(Cint, get(ENV, "VECLIB_MAXIMUM_THREADS", "1"))
end
catch
end
return nothing
end
julĪ¹ĢĢa> using LinearAlgebra
julĪ¹ĢĢa> get_num_threads()
4
Odd, though, that you can set the number of threads, but not query it. Exactly the opposite of native Julia threads, which you can query but not set. I wonder why.
Thank you @yuyichao for the answer: BLAS is using architecture-specific assembly code for the dot product.
I retried my timing results on a linux server, and there I found that @simd with @inbounds was nearly as fast as as the dot() that calls BLAS. Not sure why my iMac āProā didnāt benefit from those macros.
Anyway, hopefully someday the Julia/LLVM compiler will do it for us
I see similar performance to BLAS with @fastmath and @inbounds.
N = 2^16
x = rand(ComplexF32, N)
y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
@fastmath function f5(x,y) # basic loop
accum = zero(eltype(x))
@inbounds for i in eachindex(x)
accum += x[i] * conj(y[i])
end
return accum
end
@assert f2(x,y) ā f5(x,y)
@btime f2($x,$y) # dot(y,x) 19.699 us (BLAS)
@btime f5($x,$y) # loop 20.799 us (@fastmath and @inbounds)