Why is BLAS dot product so much faster than Julia loop?

JeffFessler · August 15, 2020, 2:53pm

I wanted to illustrate the beauty of Julia for my class by showing the speed of a simple vector dot product, but to my surprise the Julia loop version was about 5x slower than the built-in dot() that (I think) calls BLAS.cdotc_
http://www.netlib.org/lapack/explore-html/dd/db2/cdotc_8f_source.html
That FORTAN code uses a loop that looks essentially identical to the Julia code.
Why is Julia so much slower? Neither @inbounds nor @simd helped BTW.
Is Julia calling a fancier version of BLAS with assembly code or such?
Here is my test code:

using BenchmarkTools: @btime
using LinearAlgebra: dot
N = 2^16; x = rand(ComplexF32, N); y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
function f5(x,y) # basic loop
	accum = zero(eltype(x))
	for i in eachindex(x)
		accum += x[i] * conj(y[i])
	end
	return accum
end
@assert f2(x,y) ≈ f5(x,y) # check
# times below are Julia 1.5 on 2017 iMac Pro with 8 threads
@btime f2($x,$y); # dot(y,x)  13.2 us
@btime f5($x,$y); # loop      60.2 us (@inbounds and @simd did not help)

This question is kind of the opposite of How to make Julia slow?

Edit: answered below (BLAS uses cpu-specific assembly code).

jishnub · August 15, 2020, 3:01pm

I find a significant improvement with @simd + @inbounds

julia> function f5(x,y) # basic loop
               accum = zero(eltype(x))
               for i in eachindex(x)
                       accum += x[i] * conj(y[i])
               end
               return accum
       end
f5 (generic function with 1 method)

julia> function f6(x,y) # basic loop
               accum = zero(eltype(x))
               @simd for i in eachindex(x)
                       @inbounds accum += x[i] * conj(y[i])
               end
               return accum
       end
f6 (generic function with 1 method)

julia> @btime f5($x,$y);
  138.478 μs (0 allocations: 0 bytes)

julia> @btime f6($x,$y);
  56.938 μs (0 allocations: 0 bytes)

julia> @assert f5(x,y) ≈ f6(x,y)

JeffFessler · August 15, 2020, 3:02pm

That is exactly what I had tried.
Could you please show the time for f2 (dot) for comparison?

dlakelan · August 15, 2020, 3:03pm

I had a very similar question, with a lot of nice answers

Simple Mat-Vec multiply (understanding performance, without the bugs)

my favorite by far was to use @tullio to avoid coding loops at all, just use Einstein tensor notation

return @tullio x[i]*y[i]

jishnub · August 15, 2020, 3:04pm

Ah sorry had missed that, here it is

julia> @btime f2($x,$y);
  44.532 μs (0 allocations: 0 bytes)

simeonschaub · August 15, 2020, 3:04pm

LoopVectorization.jl also has an example for the scalar product here: https://github.com/chriselrod/LoopVectorization.jl#dot-product. Unfortunately, you’ll have to use StructArrays.jl for this to work with complex numbers.

DNF · August 15, 2020, 3:07pm

Is Blas running single-threaded?

JeffFessler · August 15, 2020, 3:10pm

Thanks for the tip. I will look into Tullio.jl, but I am still curious why BLAS is so much faster.

JeffFessler · August 15, 2020, 3:11pm

I did “top” while running my timing test and it showed 100% cpu, not 400% cpu so I am guessing it is single threaded, but not sure.

DNF · August 15, 2020, 3:15pm

You can check with LinearAlgebra.BLAS.numthreads() (taking function name from memory.)

JeffFessler · August 15, 2020, 3:23pm

The closest I found to that was:

julia-1.5.0κ8> LinearAlgebra.BLAS.vendor()
:openblas64
julia-1.5.0κ8> LinearAlgebra.BLAS.openblas_get_config()
"OpenBLAS 0.3.9  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=32"

I also found LinearAlgebra.BLAS.set_num_threads(1) and that led to exactly the same timing results for dot and same for setting it to 4 threads, so I think it must be single threaded.

dlakelan · August 15, 2020, 3:23pm

Because it does in essence all the stuff Tullio is doing under the hood… @avx and soforth.

JeffFessler · August 15, 2020, 3:28pm

If so, I’d like to see the BLAS source code for that. The code I am seeing doesn’t have any such features:

github.com

OpenMathLib/OpenBLAS/blob/ce3651516f12079f3ca2418aa85b9ad571c3a391/lapack-netlib/BLAS/SRC/cdotc.f

*> \brief \b CDOTC
*
*  =========== DOCUMENTATION ===========
*
* Online html documentation available at
*            http://www.netlib.org/lapack/explore-html/
*
*  Definition:
*  ===========
*
*       COMPLEX FUNCTION CDOTC(N,CX,INCX,CY,INCY)
*
*       .. Scalar Arguments ..
*       INTEGER INCX,INCY,N
*       ..
*       .. Array Arguments ..
*       COMPLEX CX(*),CY(*)
*       ..
*
*

This file has been truncated. show original

dlakelan · August 15, 2020, 3:32pm

It’s probably the Fortran compiler that knows people use Fortran to write matrix ops and optimizes the heck out of them

onButtonUp · August 15, 2020, 3:38pm

cut and paste worked here . . .

const get_num_threads = function() # anonymous so it will be serialized when called
    blas = LinearAlgebra.BLAS.vendor()
    # Wrap in a try to catch unsupported blas versions
    try
        if blas == :openblas
            return ccall((:openblas_get_num_threads, Base.libblas_name), Cint, ())
        elseif blas == :openblas64
            return ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
        elseif blas == :mkl
            return ccall((:MKL_Get_Max_Num_Threads, Base.libblas_name), Cint, ())
        end

        # OSX BLAS looks at an environment variable
        if Sys.isapple()
            return tryparse(Cint, get(ENV, "VECLIB_MAXIMUM_THREADS", "1"))
        end
    catch
    end

    return nothing
end

julϊ̇a> using LinearAlgebra

julϊ̇a> get_num_threads()
4

yuyichao · August 15, 2020, 3:39pm

https://github.com/xianyi/OpenBLAS/blob/d5e6940253b2ee638509de283b8b1d7695fefbbf/kernel/x86_64/cdot_microk_haswell-2.c

DNF · August 15, 2020, 3:41pm

Faulty memory there, sorry.

Odd, though, that you can set the number of threads, but not query it. Exactly the opposite of native Julia threads, which you can query but not set. I wonder why.

JeffFessler · August 15, 2020, 4:11pm

Thank you @yuyichao for the answer: BLAS is using architecture-specific assembly code for the dot product.

I retried my timing results on a linux server, and there I found that @simd with @inbounds was nearly as fast as as the dot() that calls BLAS. Not sure why my iMac “Pro” didn’t benefit from those macros.

Anyway, hopefully someday the Julia/LLVM compiler will do it for us

Seif_Shebl · August 15, 2020, 4:48pm

I see similar performance to BLAS with @fastmath and @inbounds.

N = 2^16
x = rand(ComplexF32, N)
y = rand(ComplexF32, N)
f2(x,y) = dot(y,x)
@fastmath function f5(x,y) # basic loop
    accum = zero(eltype(x))
    @inbounds for i in eachindex(x)
        accum += x[i] * conj(y[i])
    end
    return accum
end
@assert f2(x,y) ≈ f5(x,y) 
@btime f2($x,$y)   # dot(y,x)  19.699 us (BLAS)
@btime f5($x,$y)   # loop      20.799 us (@fastmath and @inbounds)

Topic		Replies	Views
Interesting post about SIMD dot product (and cosine similarity) Offtopic performance	17	766	December 2, 2024
Naive dot product faster in Fortran than in Juila Performance	12	1374	July 24, 2021
Dot product not parallelized on cluster Performance linearalgebra	4	252	January 4, 2023
Julia matrix-multiplication performance Performance linearalgebra	20	8265	October 30, 2022
Alternate BLAS libraries? General Usage blas	22	2823	July 4, 2020

Why is BLAS dot product so much faster than Julia loop?

Related topics