Another BLAS and Julia comparison

Just for fun I did a direct translation of the BLAS function saspy, that is written in Fortran (You can found it here) into Julia. With two small optimizations (@inbounds and @threads) I can get almost the same speed of BLAS.aspy!.

function saxpy!(N,SA,SX,INCX::Int,SY,INCY::Int)
   if (N ≤ 0) return nothing end
   if (SA==0.0) return nothing end
   if (INCX==1) && (INCY==1)
   #
   #         code for both increments equal to 1
   #
   #
   #         clean-up loop
   #
      M = N%4
      if (M≠0)
         @inbounds Threads.@threads for I = 1:M
            SY[I] = SY[I] + SA*SX[I]
         end
      end
      if (N < 4) return nothing end
      MP1 = M + 1
      @inbounds Threads.@threads for I = MP1:4:N
         SY[I] = SY[I] + SA*SX[I]
         SY[I+1] = SY[I+1] + SA*SX[I+1]
         SY[I+2] = SY[I+2] + SA*SX[I+2]
         SY[I+3] = SY[I+3] + SA*SX[I+3]
      end
   else
   #
   #        code for unequal increments or equal increments
   #          not equal to 1
   #
      IX = 1
      IY = 1
      if (INCX<0)
         IX = (-N+1)*INCX + 1
      end
      if (INCY<0)
         IY = (-N+1)*INCY + 1
      end
      @inbounds Threads.@threads for I = 1:N
       SY[IY] = SY[IY] + SA*SX[IX]
       IX = IX + INCX
       IY = IY + INCY
      end
   end
   return nothing
end
using LinearAlgebra
using BenchmarkTools
N = 10^7
const a = 0.2
x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime BLAS.axpy!($a,$x,$y)

x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime saxpy!($N,$a,$x,1,$y,1)

x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime y .= a.*x + y;

Results in

  17.958 ms (0 allocations: 0 bytes)
  20.543 ms (3 allocations: 96 bytes)
  214.485 ms (8 allocations: 152.59 MiB)

I must to say congratulations to Julia community for allow this level of performance and still maintain the simplicity of the code. But, I’m here to ask : Does the code can still be optimized ? :thinking:

2 Likes

You’re missing a . in .+ y in the last line, and missing interpolating the variables:

julia> @btime $y .= $a.*$x .+ $y
  12.048 ms (0 allocations: 0 bytes)

I don’t think it’s a fair comparison to use threads in Julia. What are the timings without using threads?

4 Likes

OpenBLAS uses threads. Julia’s default BLAS is OpenBLAS.
Really it should be configurable, like OpenBLAS, on number of threads
(but this is nontrivial, except if you are willuing to use JULIA_NUM_THREADS as the number of threads, in which case it is intrinically already done)

Could you please remind us how to check the number of OpenBlas threads?

Mostly correct:

ENV["OPENBLAS_NUM_THREADS"]

More correct:

get(() -> ENV["OMP_NUM_THREADS"], ENV, "OPENBLAS_NUM_THREADS")

The above works regardless of if the environment variables are defined outside julia or not.

On startup, if neither is defined OPENBLAS_NUM_THREADS
is set to scale up to be equal to your number of cores, capping at 8.

See:

1 Like

If I recall correctly, there is no BLAS.get_num_threads corresponding to the existing BLAS.set_num_threads. I think there is a BLAS function (per implementation) to be wrapped, just has not been done.

EDIT: I forgot that I wrote one. Here is a gist. The part of the gist setting the BLAS vendor is out-of-date. Jeff made some change rendering it unecessary.

I use the this — which is probably the wrong way to go about it…

using LinearAlgebra
function get_num_threads()
    bconf = LinearAlgebra.BLAS.openblas_get_config()
    m = match(r"MAX_THREADS=", bconf)
    parse(Int, bconf[m.offset+12:end])
end

I do not think that corresponds to the number of threads that BLAS uses. I get MAX_THREADS=16 on my laptop with 4 cores.

2 Likes

You can do

ccall((BLAS.@blasfunc(openblas_get_num_threads), Base.libblas_name), LinearAlgebra.BlasInt, ())
5 Likes

To be clear, what you are translating is the reference BLAS routine. This is often much slower than an optimized BLAS like OpenBLAS.

7 Likes

Without threads and using interpolations variables we have the same results

18.022 ms (0 allocations: 0 bytes) # OpenBLAS
31.980 ms (0 allocations: 0 bytes) # my sapy!
34.621 ms (0 allocations: 0 bytes) # $y .= $a.*$x .+ $y

To be clear, what you are translating is the reference BLAS routine. This is often much slower than an optimized BLAS like OpenBLAS.

AXPY and all BLAS level 1 routines have no specific optimization. It’s just a for loop.

BLAS1 operations certainly benefit much less from optimization than BLAS2 and BLAS3 operations (due to the lack of memory re-use), but it’s not true in general that they cannot be optimized compared to a naive loop that does one operation per loop iteration (as in the reference BLAS). For example, they can use hand-coded SIMD instructions, can be strength-reduced/unrolled to expose greater instruction-level parallelism to the compiler, and can even be multi-threaded for large enough vectors.

For example, OpenBLAS uses a hand-coded assembly AXPY kernel to maximize utilization of SSE2 SIMD instructions, and even its higher-level AXPY implementation is more complicated than the reference BLAS. OpenBLAS also has lots of other optimized BLAS1 functions like ddot. They wouldn’t have gone to this trouble if it was just as efficient to compile the reference BLAS with appropriate optimization flags.

7 Likes

For a simple loop, any optimizing compiler can do the same if you allow it by passing -msse2, -mavx or -mavx512 (or the corresponding Neon flags on ARM).

The only reason they do it this way is that they do runtime CPU detection so that packagers can just do a simple make and the binary can be used with full performance whether the final user only supports sse2 or AVX512

Unfortunately, not always. As a simple example, two of the arguments for daxpy are strides of the input arrays, but optimizing compilers don’t know that unit strides (contiguous data) are a common case and that they should specialize for it.

Even if you specialize for the unit-stride case manually (which is also done in the reference BLAS), however, I just tried on my laptop (AVX2 64-bit Intel cpu) to compile a naive loop implementation (written in C: for (size_t i=0; i<n; ++i) y[i]+=a*x[i];) with gcc-9 -O3 -mavx2, and the result was about 20% slower for length-1000 arrays than the ILP64 OpenBLAS daxpy.

You can do runtime CPU detection without hand-coding assembly. Just write a single daxpy_kernel.c file with a function DAXPY_KERNEL and then compile it multiple times with different flags and link all of the resulting object code into the library. (For example, compile it with -O3 -mavx2 -DDAXPY_KERNEL=daxpy_kernel_avx2 -o daxpy_kernel_avx2.o.) Then select among the kernels at runtime by checking cpuid or similar.

Compilers have gotten very good at vectorizing simple loops, and some of the OpenBLAS source may be legacy code from the days of more primitive compilers. And definitely the benefits of hand-optimizing BLAS1 kernels are slim compared to BLAS3. But there are still gains to be eked out in many cases for critical code.

2 Likes

A simple loop like this should completely saturate your memory bandwidth even with non ideal instructions as it is completely predictable.

I created a test program and on my machine there seem to be a serious GCC bug on simple axpy as Clang is 3.5x faster and within 1% of OpenBLAS

Test program: https://gist.github.com/mratsim/cec39df71606eccab93255217141b274
Godbolt: https://gcc.godbolt.org/z/atCLeM

For reference this is OpenBLAS Assembly

When i say non-ideal, that would be for example using SSE on an AVX machine.

4 Likes

Unlike you, I was looking at a smaller array that fit in L1 cache and I benchmarked running daxpy repeatedly on the same array, taking the minimum time over many iterations. I think this is more representative of typical usage than allocating a large array and performing daxpy only once.

However, I was forgetting the -mfma flag. When I added this, some of the time difference went away, and the peak slowdown is now only about 10–15%. I couldn’t reproduce your gcc problem, though.

(For very small arrays, of course, OpenBLAS begins to pay a slight performance price for things like the runtime kernel-switching. Even for BLAS3 there is a crossover where a simple nested loop is faster, but I’m guessing it happens at a much smaller size. Benchmarking smaller sizes is also much noisier, as can be seen above, and trying to collect good data is a bit frustrating.)

Also, consider the context of this thread. The original poster was doing a benchmark of code that would not have approached the speed of OpenBLAS because they didn’t turn on SIMD optimization (@simd in Julia code, or corresponding flags in the Fortran code). Even with good compilers, most people won’t approach the performance of an optimized BLAS library even for BLAS1 code for “trivial” reasons like not knowing about the best flags to use (or about things like restrict, for that matter). Not to mention snafus like the gcc issue on your machine.

My initial point above was that the original poster thought they were comparing their Julia code to optimized BLAS performance, but in fact this was not the case. (Comparing straightforward loops in Julia to the corresponding code in Fortran, without playing with compilers too much, is still worthwhile though!)

8 Likes