Just for fun I did a direct translation of the BLAS function saspy, that is written in Fortran (You can found it here) into Julia. With two small optimizations (@inbounds and @threads) I can get almost the same speed of BLAS.aspy!.
function saxpy!(N,SA,SX,INCX::Int,SY,INCY::Int)
if (N ≤ 0) return nothing end
if (SA==0.0) return nothing end
if (INCX==1) && (INCY==1)
#
# code for both increments equal to 1
#
#
# clean-up loop
#
M = N%4
if (M≠0)
@inbounds Threads.@threads for I = 1:M
SY[I] = SY[I] + SA*SX[I]
end
end
if (N < 4) return nothing end
MP1 = M + 1
@inbounds Threads.@threads for I = MP1:4:N
SY[I] = SY[I] + SA*SX[I]
SY[I+1] = SY[I+1] + SA*SX[I+1]
SY[I+2] = SY[I+2] + SA*SX[I+2]
SY[I+3] = SY[I+3] + SA*SX[I+3]
end
else
#
# code for unequal increments or equal increments
# not equal to 1
#
IX = 1
IY = 1
if (INCX<0)
IX = (-N+1)*INCX + 1
end
if (INCY<0)
IY = (-N+1)*INCY + 1
end
@inbounds Threads.@threads for I = 1:N
SY[IY] = SY[IY] + SA*SX[IX]
IX = IX + INCX
IY = IY + INCY
end
end
return nothing
end
using LinearAlgebra
using BenchmarkTools
N = 10^7
const a = 0.2
x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime BLAS.axpy!($a,$x,$y)
x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime saxpy!($N,$a,$x,1,$y,1)
x = convert(Array{Float64},collect(1:N))
y = convert(Array{Float64},collect(1:N))
@btime y .= a.*x + y;
Results in
17.958 ms (0 allocations: 0 bytes)
20.543 ms (3 allocations: 96 bytes)
214.485 ms (8 allocations: 152.59 MiB)
I must to say congratulations to Julia community for allow this level of performance and still maintain the simplicity of the code. But, I’m here to ask : Does the code can still be optimized ?
OpenBLAS uses threads. Julia’s default BLAS is OpenBLAS.
Really it should be configurable, like OpenBLAS, on number of threads
(but this is nontrivial, except if you are willuing to use JULIA_NUM_THREADS as the number of threads, in which case it is intrinically already done)
If I recall correctly, there is no BLAS.get_num_threads corresponding to the existing BLAS.set_num_threads. I think there is a BLAS function (per implementation) to be wrapped, just has not been done.
EDIT: I forgot that I wrote one. Here is a gist. The part of the gist setting the BLAS vendor is out-of-date. Jeff made some change rendering it unecessary.
I use the this — which is probably the wrong way to go about it…
using LinearAlgebra
function get_num_threads()
bconf = LinearAlgebra.BLAS.openblas_get_config()
m = match(r"MAX_THREADS=", bconf)
parse(Int, bconf[m.offset+12:end])
end
BLAS1 operations certainly benefit much less from optimization than BLAS2 and BLAS3 operations (due to the lack of memory re-use), but it’s not true in general that they cannot be optimized compared to a naive loop that does one operation per loop iteration (as in the reference BLAS). For example, they can use hand-coded SIMD instructions, can be strength-reduced/unrolled to expose greater instruction-level parallelism to the compiler, and can even be multi-threaded for large enough vectors.
For example, OpenBLAS uses a hand-coded assembly AXPY kernel to maximize utilization of SSE2 SIMD instructions, and even its higher-level AXPY implementation is more complicated than the reference BLAS. OpenBLAS also has lots of other optimized BLAS1 functions like ddot. They wouldn’t have gone to this trouble if it was just as efficient to compile the reference BLAS with appropriate optimization flags.
For a simple loop, any optimizing compiler can do the same if you allow it by passing -msse2, -mavx or -mavx512 (or the corresponding Neon flags on ARM).
The only reason they do it this way is that they do runtime CPU detection so that packagers can just do a simple make and the binary can be used with full performance whether the final user only supports sse2 or AVX512
Unfortunately, not always. As a simple example, two of the arguments for daxpy are strides of the input arrays, but optimizing compilers don’t know that unit strides (contiguous data) are a common case and that they should specialize for it.
Even if you specialize for the unit-stride case manually (which is also done in the reference BLAS), however, I just tried on my laptop (AVX2 64-bit Intel cpu) to compile a naive loop implementation (written in C: for (size_t i=0; i<n; ++i) y[i]+=a*x[i];) with gcc-9 -O3 -mavx2, and the result was about 20% slower for length-1000 arrays than the ILP64 OpenBLAS daxpy.
You can do runtime CPU detection without hand-coding assembly. Just write a single daxpy_kernel.c file with a function DAXPY_KERNEL and then compile it multiple times with different flags and link all of the resulting object code into the library. (For example, compile it with -O3 -mavx2 -DDAXPY_KERNEL=daxpy_kernel_avx2 -o daxpy_kernel_avx2.o.) Then select among the kernels at runtime by checking cpuid or similar.
Compilers have gotten very good at vectorizing simple loops, and some of the OpenBLAS source may be legacy code from the days of more primitive compilers. And definitely the benefits of hand-optimizing BLAS1 kernels are slim compared to BLAS3. But there are still gains to be eked out in many cases for critical code.
Unlike you, I was looking at a smaller array that fit in L1 cache and I benchmarked running daxpy repeatedly on the same array, taking the minimum time over many iterations. I think this is more representative of typical usage than allocating a large array and performing daxpy only once.
However, I was forgetting the -mfma flag. When I added this, some of the time difference went away, and the peak slowdown is now only about 10–15%. I couldn’t reproduce your gcc problem, though.
(For very small arrays, of course, OpenBLAS begins to pay a slight performance price for things like the runtime kernel-switching. Even for BLAS3 there is a crossover where a simple nested loop is faster, but I’m guessing it happens at a much smaller size. Benchmarking smaller sizes is also much noisier, as can be seen above, and trying to collect good data is a bit frustrating.)
Also, consider the context of this thread. The original poster was doing a benchmark of code that would not have approached the speed of OpenBLAS because they didn’t turn on SIMD optimization (@simd in Julia code, or corresponding flags in the Fortran code). Even with good compilers, most people won’t approach the performance of an optimized BLAS library even for BLAS1 code for “trivial” reasons like not knowing about the best flags to use (or about things like restrict, for that matter). Not to mention snafus like the gcc issue on your machine.
My initial point above was that the original poster thought they were comparing their Julia code to optimized BLAS performance, but in fact this was not the case. (Comparing straightforward loops in Julia to the corresponding code in Fortran, without playing with compilers too much, is still worthwhile though!)