# Why is BLAS dot product so much faster than Julia loop?

I wanted to illustrate the beauty of Julia for my class by showing the speed of a simple vector dot product, but to my surprise the Julia loop version was about 5x slower than the built-in `dot()` that (I think) calls BLAS.cdotc_
http://www.netlib.org/lapack/explore-html/dd/db2/cdotc_8f_source.html
That FORTAN code uses a loop that looks essentially identical to the Julia code.
Why is Julia so much slower? Neither `@inbounds` nor `@simd` helped BTW.
Is Julia calling a fancier version of BLAS with assembly code or such?
Here is my test code:

``````using BenchmarkTools: @btime
using LinearAlgebra: dot
N = 2^16; x = rand(ComplexF32, N); y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
function f5(x,y) # basic loop
accum = zero(eltype(x))
for i in eachindex(x)
accum += x[i] * conj(y[i])
end
return accum
end
@assert f2(x,y) ≈ f5(x,y) # check
# times below are Julia 1.5 on 2017 iMac Pro with 8 threads
@btime f2(\$x,\$y); # dot(y,x)  13.2 us
@btime f5(\$x,\$y); # loop      60.2 us (@inbounds and @simd did not help)
``````

This question is kind of the opposite of How to make Julia slow?

Edit: answered below (BLAS uses cpu-specific assembly code).

I find a significant improvement with `@simd` + `@inbounds`

``````julia> function f5(x,y) # basic loop
accum = zero(eltype(x))
for i in eachindex(x)
accum += x[i] * conj(y[i])
end
return accum
end
f5 (generic function with 1 method)

julia> function f6(x,y) # basic loop
accum = zero(eltype(x))
@simd for i in eachindex(x)
@inbounds accum += x[i] * conj(y[i])
end
return accum
end
f6 (generic function with 1 method)

julia> @btime f5(\$x,\$y);
138.478 μs (0 allocations: 0 bytes)

julia> @btime f6(\$x,\$y);
56.938 μs (0 allocations: 0 bytes)

julia> @assert f5(x,y) ≈ f6(x,y)
``````

That is exactly what I had tried.
Could you please show the time for f2 (dot) for comparison?

I had a very similar question, with a lot of nice answers

Simple Mat-Vec multiply (understanding performance, without the bugs)

my favorite by far was to use @tullio to avoid coding loops at all, just use Einstein tensor notation

``````return @tullio x[i]*y[i]
``````

Ah sorry had missed that, here it is

``````julia> @btime f2(\$x,\$y);
44.532 μs (0 allocations: 0 bytes)
``````

LoopVectorization.jl also has an example for the scalar product here: https://github.com/chriselrod/LoopVectorization.jl#dot-product. Unfortunately, you’ll have to use StructArrays.jl for this to work with complex numbers.

1 Like

Thanks for the tip. I will look into Tullio.jl, but I am still curious why BLAS is so much faster.

I did “top” while running my timing test and it showed 100% cpu, not 400% cpu so I am guessing it is single threaded, but not sure.

You can check with `LinearAlgebra.BLAS.numthreads()` (taking function name from memory.)

The closest I found to that was:

``````julia-1.5.0κ8> LinearAlgebra.BLAS.vendor()
:openblas64
julia-1.5.0κ8> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=32"
``````

I also found `LinearAlgebra.BLAS.set_num_threads(1)` and that led to exactly the same timing results for `dot` and same for setting it to 4 threads, so I think it must be single threaded.

1 Like

Because it does in essence all the stuff Tullio is doing under the hood… @avx and soforth.

If so, I’d like to see the BLAS source code for that. The code I am seeing doesn’t have any such features:

It’s probably the Fortran compiler that knows people use Fortran to write matrix ops and optimizes the heck out of them

cut and paste worked here . . .

``````const get_num_threads = function() # anonymous so it will be serialized when called
blas = LinearAlgebra.BLAS.vendor()
# Wrap in a try to catch unsupported blas versions
try
if blas == :openblas
elseif blas == :openblas64
elseif blas == :mkl
end

# OSX BLAS looks at an environment variable
if Sys.isapple()
end
catch
end

return nothing
end

julϊ̇a> using LinearAlgebra

4

``````
1 Like
2 Likes

Faulty memory there, sorry.

Odd, though, that you can set the number of threads, but not query it. Exactly the opposite of native Julia threads, which you can query but not set. I wonder why.

Thank you @yuyichao for the answer: BLAS is using architecture-specific assembly code for the dot product.

I retried my timing results on a linux server, and there I found that `@simd` with `@inbounds` was nearly as fast as as the `dot()` that calls BLAS. Not sure why my iMac “Pro” didn’t benefit from those macros.

Anyway, hopefully someday the Julia/LLVM compiler will do it for us I see similar performance to BLAS with `@fastmath` and `@inbounds`.

``````N = 2^16
x = rand(ComplexF32, N)
y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
@fastmath function f5(x,y) # basic loop
accum = zero(eltype(x))
@inbounds for i in eachindex(x)
accum += x[i] * conj(y[i])
end
return accum
end
@assert f2(x,y) ≈ f5(x,y)
@btime f2(\$x,\$y)   # dot(y,x)  19.699 us (BLAS)
@btime f5(\$x,\$y)   # loop      20.799 us (@fastmath and @inbounds)
``````
1 Like