Another BLAS and Julia comparison

Just for fun I did a direct translation of the BLAS function `saspy`, that is written in Fortran (You can found it here) into Julia. With two small optimizations (`@inbounds` and `@threads`) I can get almost the same speed of `BLAS.aspy!`.

``````function saxpy!(N,SA,SX,INCX::Int,SY,INCY::Int)
if (N ≤ 0) return nothing end
if (SA==0.0) return nothing end
if (INCX==1) && (INCY==1)
#
#         code for both increments equal to 1
#
#
#         clean-up loop
#
M = N%4
if (M≠0)
SY[I] = SY[I] + SA*SX[I]
end
end
if (N < 4) return nothing end
MP1 = M + 1
SY[I] = SY[I] + SA*SX[I]
SY[I+1] = SY[I+1] + SA*SX[I+1]
SY[I+2] = SY[I+2] + SA*SX[I+2]
SY[I+3] = SY[I+3] + SA*SX[I+3]
end
else
#
#        code for unequal increments or equal increments
#          not equal to 1
#
IX = 1
IY = 1
if (INCX<0)
IX = (-N+1)*INCX + 1
end
if (INCY<0)
IY = (-N+1)*INCY + 1
end
SY[IY] = SY[IY] + SA*SX[IX]
IX = IX + INCX
IY = IY + INCY
end
end
return nothing
end
``````
``````using LinearAlgebra
using BenchmarkTools
N = 10^7
const a = 0.2
x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime BLAS.axpy!(\$a,\$x,\$y)

x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime saxpy!(\$N,\$a,\$x,1,\$y,1)

x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime y .= a.*x + y;
``````

Results in

``````  17.958 ms (0 allocations: 0 bytes)
20.543 ms (3 allocations: 96 bytes)
214.485 ms (8 allocations: 152.59 MiB)
``````

I must to say congratulations to Julia community for allow this level of performance and still maintain the simplicity of the code. But, I’m here to ask : Does the code can still be optimized ?

2 Likes

You’re missing a `.` in `.+ y` in the last line, and missing interpolating the variables:

``````julia> @btime \$y .= \$a.*\$x .+ \$y
12.048 ms (0 allocations: 0 bytes)
``````

I don’t think it’s a fair comparison to use threads in Julia. What are the timings without using threads?

4 Likes

OpenBLAS uses threads. Julia’s default BLAS is OpenBLAS.
Really it should be configurable, like OpenBLAS, on number of threads
(but this is nontrivial, except if you are willuing to use JULIA_NUM_THREADS as the number of threads, in which case it is intrinically already done)

Could you please remind us how to check the number of OpenBlas threads?

Mostly correct:

``````ENV["OPENBLAS_NUM_THREADS"]
``````

More correct:

``````get(() -> ENV["OMP_NUM_THREADS"], ENV, "OPENBLAS_NUM_THREADS")
``````

The above works regardless of if the environment variables are defined outside julia or not.

On startup, if neither is defined `OPENBLAS_NUM_THREADS`
is set to scale up to be equal to your number of cores, capping at 8.

See:

1 Like

If I recall correctly, there is no `BLAS.get_num_threads` corresponding to the existing `BLAS.set_num_threads`. I think there is a BLAS function (per implementation) to be wrapped, just has not been done.

EDIT: I forgot that I wrote one. Here is a gist. The part of the gist setting the BLAS vendor is out-of-date. Jeff made some change rendering it unecessary.

I use the this — which is probably the wrong way to go about it…

``````using LinearAlgebra
bconf = LinearAlgebra.BLAS.openblas_get_config()
parse(Int, bconf[m.offset+12:end])
end
``````

I do not think that corresponds to the number of threads that BLAS uses. I get `MAX_THREADS=16` on my laptop with 4 cores.

2 Likes

You can do

``````ccall((BLAS.@blasfunc(openblas_get_num_threads), Base.libblas_name), LinearAlgebra.BlasInt, ())
``````
5 Likes

To be clear, what you are translating is the reference BLAS routine. This is often much slower than an optimized BLAS like OpenBLAS.

7 Likes

Without threads and using interpolations variables we have the same results

``````18.022 ms (0 allocations: 0 bytes) # OpenBLAS
31.980 ms (0 allocations: 0 bytes) # my sapy!
34.621 ms (0 allocations: 0 bytes) # \$y .= \$a.*\$x .+ \$y
``````

To be clear, what you are translating is the reference BLAS routine. This is often much slower than an optimized BLAS like OpenBLAS.

AXPY and all BLAS level 1 routines have no specific optimization. It’s just a for loop.

BLAS1 operations certainly benefit much less from optimization than BLAS2 and BLAS3 operations (due to the lack of memory re-use), but it’s not true in general that they cannot be optimized compared to a naive loop that does one operation per loop iteration (as in the reference BLAS). For example, they can use hand-coded SIMD instructions, can be strength-reduced/unrolled to expose greater instruction-level parallelism to the compiler, and can even be multi-threaded for large enough vectors.

For example, OpenBLAS uses a hand-coded assembly AXPY kernel to maximize utilization of SSE2 SIMD instructions, and even its higher-level AXPY implementation is more complicated than the reference BLAS. OpenBLAS also has lots of other optimized BLAS1 functions like `ddot`. They wouldn’t have gone to this trouble if it was just as efficient to compile the reference BLAS with appropriate optimization flags.

7 Likes

For a simple loop, any optimizing compiler can do the same if you allow it by passing -msse2, -mavx or -mavx512 (or the corresponding Neon flags on ARM).

The only reason they do it this way is that they do runtime CPU detection so that packagers can just do a simple `make` and the binary can be used with full performance whether the final user only supports sse2 or AVX512

Unfortunately, not always. As a simple example, two of the arguments for `daxpy` are strides of the input arrays, but optimizing compilers don’t know that unit strides (contiguous data) are a common case and that they should specialize for it.

Even if you specialize for the unit-stride case manually (which is also done in the reference BLAS), however, I just tried on my laptop (AVX2 64-bit Intel cpu) to compile a naive loop implementation (written in C: `for (size_t i=0; i<n; ++i) y[i]+=a*x[i];`) with `gcc-9 -O3 -mavx2`, and the result was about 20% slower for length-1000 arrays than the ILP64 OpenBLAS `daxpy`.

You can do runtime CPU detection without hand-coding assembly. Just write a single `daxpy_kernel.c` file with a function `DAXPY_KERNEL` and then compile it multiple times with different flags and link all of the resulting object code into the library. (For example, compile it with `-O3 -mavx2 -DDAXPY_KERNEL=daxpy_kernel_avx2 -o daxpy_kernel_avx2.o`.) Then select among the kernels at runtime by checking `cpuid` or similar.

Compilers have gotten very good at vectorizing simple loops, and some of the OpenBLAS source may be legacy code from the days of more primitive compilers. And definitely the benefits of hand-optimizing BLAS1 kernels are slim compared to BLAS3. But there are still gains to be eked out in many cases for critical code.

2 Likes

A simple loop like this should completely saturate your memory bandwidth even with non ideal instructions as it is completely predictable.

I created a test program and on my machine there seem to be a serious GCC bug on simple axpy as Clang is 3.5x faster and within 1% of OpenBLAS

For reference this is OpenBLAS Assembly

When i say non-ideal, that would be for example using SSE on an AVX machine.

4 Likes

Unlike you, I was looking at a smaller array that fit in L1 cache and I benchmarked running `daxpy` repeatedly on the same array, taking the minimum time over many iterations. I think this is more representative of typical usage than allocating a large array and performing `daxpy` only once.

However, I was forgetting the `-mfma` flag. When I added this, some of the time difference went away, and the peak slowdown is now only about 10–15%. I couldn’t reproduce your gcc problem, though.

(For very small arrays, of course, OpenBLAS begins to pay a slight performance price for things like the runtime kernel-switching. Even for BLAS3 there is a crossover where a simple nested loop is faster, but I’m guessing it happens at a much smaller size. Benchmarking smaller sizes is also much noisier, as can be seen above, and trying to collect good data is a bit frustrating.)

Also, consider the context of this thread. The original poster was doing a benchmark of code that would not have approached the speed of OpenBLAS because they didn’t turn on SIMD optimization (`@simd` in Julia code, or corresponding flags in the Fortran code). Even with good compilers, most people won’t approach the performance of an optimized BLAS library even for BLAS1 code for “trivial” reasons like not knowing about the best flags to use (or about things like `restrict`, for that matter). Not to mention snafus like the gcc issue on your machine.

My initial point above was that the original poster thought they were comparing their Julia code to optimized BLAS performance, but in fact this was not the case. (Comparing straightforward loops in Julia to the corresponding code in Fortran, without playing with compilers too much, is still worthwhile though!)

8 Likes