# OpenBLAS vs. "for loops" demo

I am preparing a presentation/demo for an upcoming meetup and want to showcase that “for loops” in Julia are fast. How fast? Fast enough to write simple code that compares to BLAS routines.
So I decided to implement a “custom_axpy!” function and compare it with the OpenBLAS version.
Here goes the code:

``````using BenchmarkTools
using LinearAlgebra

""" Example 1 - implementing BLAS axpy! with for loop and Threads """
a = 2.0
b = Float64.(collect(1:100_000_000)) # create a big array of Floats
c = Float64.(collect(1:100_000_000)) # create a big array of Floats

c[i] = a*b[i]+c[i]
end
return nothing
end

for i = 1:length(c)
c[i] = a*b[i]+c[i]
end
return nothing
end

# now let's test!
@btime LinearAlgebra.BLAS.axpy!(a,b,c);
@btime LinearAlgebra.BLAS.axpy!(a,b,c);
``````

Depending on machine and CPU resource management, I don’t always see an improvement when using multiple threads.
The good thing is that the custom function has comparable performance to the BLAS routine (just a tiny bit slower).

*tested with Julia 1.1.1

1 Like

It’ll go just a little faster if you add `@inbounds`. Or use broadcasting, which does that for you:

``````julia> b_axpy!(a,b,c) = (c .= a .* b .+ c; nothing)
b_axpy! (generic function with 1 method)

julia> @btime b_axpy!(a,b,c);
92.255 ms (0 allocations: 0 bytes)

julia> function inbounds_axpy!(a,b,c)
for i = 1:length(c)
@inbounds c[i] = a*b[i]+c[i]
end
return nothing
end
inbounds_axpy! (generic function with 1 method)

julia> @btime inbounds_axpy!(a,b,c);
92.229 ms (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.BLAS.axpy!(a,b,c);
92.531 ms (0 allocations: 0 bytes)

99.350 ms (0 allocations: 0 bytes)
``````
3 Likes

PS: Actually, it should be `daxpy!` in my code.

1 Like

Are you sure your own code is running with multiple threads? (Did you set `JULIA_NUM_THREADS`?)

Many thanks for the suggestions and the tutorial! Indeed, @inbounds helps.
I’m using Juno, so I set the number of threads in Settings to be equal to the number of physical cores (2 on the current machine). I also double checked with Threads.nthreads() and played with the JULIA_NUM_THREADS.
The results are similar, regardless of the number of threads.

``````a = 2.0
b = Float64.(collect(1:100_000_000)) # create a big array of Floats
c = Float64.(collect(1:100_000_000)) # create a big array of Floats

c[i] = a*b[i]+c[i]
end
return nothing
end

for i = 1:length(c)
c[i] = a*b[i]+c[i]
end
return nothing
end

@inbounds c[i] = a*b[i]+c[i]
end
return nothing
end

function inbounds_axpy!(a,b,c)
for i = 1:length(c)
@inbounds c[i] = a*b[i]+c[i]
end
return nothing
end

@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs single threaded
214.561 ms (0 allocations: 0 bytes)
224.731 ms (0 allocations: 0 bytes)
@btime inbounds_axpy!(a,b,c); # function runs on a single thread
213.999 ms (0 allocations: 0 bytes)

@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs multi threaded
210.972 ms (0 allocations: 0 bytes)
214.327 ms (1 allocation: 48 bytes)
220.438 ms (1 allocation: 48 bytes) # a second run was 213ms
_____________________________________
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs single threaded
216.450 ms (0 allocations: 0 bytes)
226.447 ms (0 allocations: 0 bytes)
@btime inbounds_axpy!(a,b,c); # function runs on a single thread
216.408 ms (0 allocations: 0 bytes)

@btime LinearAlgebra.BLAS.axpy!(a,b,c);  # BLAS runs multi threaded
212.855 ms (0 allocations: 0 bytes)
Note that the axpy operation is memory-bound (at least for vectors as large as used here). In your results, even the BLAS routine doesn’t speed up with more threads. This indicates that your machine has only one memory channel (laptop?). When I run `BLAS.axpy!` and your `threaded_inbounds_axpy!` on my machine, I do see a speedup with more threads (roughly 2x for 2 threads) and moreover, both routines deliver the same timing. Tested with Julia 1.1.1 and 1.2.0, which give the same timings, though 1.2.0 seems to have a regression as it allocates more.