Why is BLAS dot product so much faster than Julia loop?

I wanted to illustrate the beauty of Julia for my class by showing the speed of a simple vector dot product, but to my surprise the Julia loop version was about 5x slower than the built-in dot() that (I think) calls BLAS.cdotc_
http://www.netlib.org/lapack/explore-html/dd/db2/cdotc_8f_source.html
That FORTAN code uses a loop that looks essentially identical to the Julia code.
Why is Julia so much slower? Neither @inbounds nor @simd helped BTW.
Is Julia calling a fancier version of BLAS with assembly code or such?
Here is my test code:

using BenchmarkTools: @btime
using LinearAlgebra: dot
N = 2^16; x = rand(ComplexF32, N); y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
function f5(x,y) # basic loop
	accum = zero(eltype(x))
	for i in eachindex(x)
		accum += x[i] * conj(y[i])
	end
	return accum
end
@assert f2(x,y) ā‰ˆ f5(x,y) # check
# times below are Julia 1.5 on 2017 iMac Pro with 8 threads
@btime f2($x,$y); # dot(y,x)  13.2 us
@btime f5($x,$y); # loop      60.2 us (@inbounds and @simd did not help)

This question is kind of the opposite of How to make Julia slow?

Edit: answered below (BLAS uses cpu-specific assembly code).

I find a significant improvement with @simd + @inbounds

julia> function f5(x,y) # basic loop
               accum = zero(eltype(x))
               for i in eachindex(x)
                       accum += x[i] * conj(y[i])
               end
               return accum
       end
f5 (generic function with 1 method)

julia> function f6(x,y) # basic loop
               accum = zero(eltype(x))
               @simd for i in eachindex(x)
                       @inbounds accum += x[i] * conj(y[i])
               end
               return accum
       end
f6 (generic function with 1 method)

julia> @btime f5($x,$y);
  138.478 Ī¼s (0 allocations: 0 bytes)

julia> @btime f6($x,$y);
  56.938 Ī¼s (0 allocations: 0 bytes)

julia> @assert f5(x,y) ā‰ˆ f6(x,y)

That is exactly what I had tried.
Could you please show the time for f2 (dot) for comparison?

I had a very similar question, with a lot of nice answers

Simple Mat-Vec multiply (understanding performance, without the bugs)

my favorite by far was to use @tullio to avoid coding loops at all, just use Einstein tensor notation

return @tullio x[i]*y[i]

Ah sorry had missed that, here it is

julia> @btime f2($x,$y);
  44.532 Ī¼s (0 allocations: 0 bytes)

LoopVectorization.jl also has an example for the scalar product here: https://github.com/chriselrod/LoopVectorization.jl#dot-product. Unfortunately, youā€™ll have to use StructArrays.jl for this to work with complex numbers.

1 Like

Is Blas running single-threaded?

Thanks for the tip. I will look into Tullio.jl, but I am still curious why BLAS is so much faster.

I did ā€œtopā€ while running my timing test and it showed 100% cpu, not 400% cpu so I am guessing it is single threaded, but not sure.

You can check with LinearAlgebra.BLAS.numthreads() (taking function name from memory.)

The closest I found to that was:

julia-1.5.0Īŗ8> LinearAlgebra.BLAS.vendor()
:openblas64
julia-1.5.0Īŗ8> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=32"

I also found LinearAlgebra.BLAS.set_num_threads(1) and that led to exactly the same timing results for dot and same for setting it to 4 threads, so I think it must be single threaded.

1 Like

Because it does in essence all the stuff Tullio is doing under the hoodā€¦ @avx and soforth.

If so, Iā€™d like to see the BLAS source code for that. The code I am seeing doesnā€™t have any such features:

Itā€™s probably the Fortran compiler that knows people use Fortran to write matrix ops and optimizes the heck out of them

cut and paste worked here . . .

const get_num_threads = function() # anonymous so it will be serialized when called
    blas = LinearAlgebra.BLAS.vendor()
    # Wrap in a try to catch unsupported blas versions
    try
        if blas == :openblas
            return ccall((:openblas_get_num_threads, Base.libblas_name), Cint, ())
        elseif blas == :openblas64
            return ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
        elseif blas == :mkl
            return ccall((:MKL_Get_Max_Num_Threads, Base.libblas_name), Cint, ())
        end

        # OSX BLAS looks at an environment variable
        if Sys.isapple()
            return tryparse(Cint, get(ENV, "VECLIB_MAXIMUM_THREADS", "1"))
        end
    catch
    end

    return nothing
end

julĪ¹ĢˆĢ‡a> using LinearAlgebra

julĪ¹ĢˆĢ‡a> get_num_threads()
4

1 Like

https://github.com/xianyi/OpenBLAS/blob/d5e6940253b2ee638509de283b8b1d7695fefbbf/kernel/x86_64/cdot_microk_haswell-2.c

2 Likes

Faulty memory there, sorry.

Odd, though, that you can set the number of threads, but not query it. Exactly the opposite of native Julia threads, which you can query but not set. I wonder why.

Thank you @yuyichao for the answer: BLAS is using architecture-specific assembly code for the dot product.

I retried my timing results on a linux server, and there I found that @simd with @inbounds was nearly as fast as as the dot() that calls BLAS. Not sure why my iMac ā€œProā€ didnā€™t benefit from those macros.

Anyway, hopefully someday the Julia/LLVM compiler will do it for us :slight_smile:

I see similar performance to BLAS with @fastmath and @inbounds.

N = 2^16
x = rand(ComplexF32, N)
y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
@fastmath function f5(x,y) # basic loop
    accum = zero(eltype(x))
    @inbounds for i in eachindex(x)
        accum += x[i] * conj(y[i])
    end
    return accum
end
@assert f2(x,y) ā‰ˆ f5(x,y) 
@btime f2($x,$y)   # dot(y,x)  19.699 us (BLAS)
@btime f5($x,$y)   # loop      20.799 us (@fastmath and @inbounds)
1 Like