I am preparing a presentation/demo for an upcoming meetup and want to showcase that “for loops” in Julia are fast. How fast? Fast enough to write simple code that compares to BLAS routines.

So I decided to implement a “custom_axpy!” function and compare it with the OpenBLAS version.

Here goes the code:

```
using BenchmarkTools
using LinearAlgebra
""" Example 1 - implementing BLAS axpy! with for loop and Threads """
a = 2.0
b = Float64.(collect(1:100_000_000)) # create a big array of Floats
c = Float64.(collect(1:100_000_000)) # create a big array of Floats
# our custom axpy!, multithreaded / single threaded
function custom_threaded_axpy!(a,b,c)
Threads.@threads for i = 1:length(c)
c[i] = a*b[i]+c[i]
end
return nothing
end
function custom_non_threaded_axpy!(a,b,c)
for i = 1:length(c)
c[i] = a*b[i]+c[i]
end
return nothing
end
# now let's test!
# Single threaded
LinearAlgebra.BLAS.set_num_threads(1)
@btime LinearAlgebra.BLAS.axpy!(a,b,c);
@btime custom_non_threaded_axpy!(a,b,c);
# Multi threaded
LinearAlgebra.BLAS.set_num_threads(4)
@btime LinearAlgebra.BLAS.axpy!(a,b,c);
@btime custom_threaded_axpy!(a,b,c);
```

Depending on machine and CPU resource management, I don’t always see an improvement when using multiple threads.

The good thing is that the custom function has comparable performance to the BLAS routine (just a tiny bit slower).

Please share your feedback about the code, how to improve performance or additional examples.

Thanks in advance for your help!

*tested with Julia 1.1.1

1 Like

It’ll go just a little faster if you add `@inbounds`

. Or use broadcasting, which does that for you:

```
julia> b_axpy!(a,b,c) = (c .= a .* b .+ c; nothing)
b_axpy! (generic function with 1 method)
julia> @btime b_axpy!(a,b,c);
92.255 ms (0 allocations: 0 bytes)
julia> function inbounds_axpy!(a,b,c)
for i = 1:length(c)
@inbounds c[i] = a*b[i]+c[i]
end
return nothing
end
inbounds_axpy! (generic function with 1 method)
julia> @btime inbounds_axpy!(a,b,c);
92.229 ms (0 allocations: 0 bytes)
julia> @btime LinearAlgebra.BLAS.axpy!(a,b,c);
92.531 ms (0 allocations: 0 bytes)
julia> @btime custom_non_threaded_axpy!(a,b,c);
99.350 ms (0 allocations: 0 bytes)
```

3 Likes

Are you sure your own code is running with multiple threads? (Did you set `JULIA_NUM_THREADS`

?)

Many thanks for the suggestions and the tutorial! Indeed, @inbounds helps.

I’m using Juno, so I set the number of threads in Settings to be equal to the number of physical cores (2 on the current machine). I also double checked with Threads.nthreads() and played with the JULIA_NUM_THREADS.

The results are similar, regardless of the number of threads.

```
a = 2.0
b = Float64.(collect(1:100_000_000)) # create a big array of Floats
c = Float64.(collect(1:100_000_000)) # create a big array of Floats
# our custom axpy!, multithreaded / single threaded / @inbounds
function custom_threaded_axpy!(a,b,c)
Threads.@threads for i = 1:length(c)
c[i] = a*b[i]+c[i]
end
return nothing
end
function custom_non_threaded_axpy!(a,b,c)
for i = 1:length(c)
c[i] = a*b[i]+c[i]
end
return nothing
end
function threaded_inbounds_axpy!(a,b,c)
Threads.@threads for i = 1:length(c)
@inbounds c[i] = a*b[i]+c[i]
end
return nothing
end
function inbounds_axpy!(a,b,c)
for i = 1:length(c)
@inbounds c[i] = a*b[i]+c[i]
end
return nothing
end
Julia Threads set to 2:
LinearAlgebra.BLAS.set_num_threads(1)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs single threaded
214.561 ms (0 allocations: 0 bytes)
@btime custom_non_threaded_axpy!(a,b,c); # function runs on a single thread
224.731 ms (0 allocations: 0 bytes)
@btime inbounds_axpy!(a,b,c); # function runs on a single thread
213.999 ms (0 allocations: 0 bytes)
LinearAlgebra.BLAS.set_num_threads(2)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs multi threaded
210.972 ms (0 allocations: 0 bytes)
@btime custom_threaded_axpy!(a,b,c); # function runs on two threads
214.327 ms (1 allocation: 48 bytes)
@btime threaded_inbounds_axpy!(a,b,c); # function runs on two threads
220.438 ms (1 allocation: 48 bytes) # a second run was 213ms
_____________________________________
Julia Threads set to 1:
LinearAlgebra.BLAS.set_num_threads(1)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs single threaded
216.450 ms (0 allocations: 0 bytes)
@btime custom_non_threaded_axpy!(a,b,c); # function runs on a single thread
226.447 ms (0 allocations: 0 bytes)
@btime inbounds_axpy!(a,b,c); # function runs on a single thread
216.408 ms (0 allocations: 0 bytes)
LinearAlgebra.BLAS.set_num_threads(2)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs multi threaded
212.855 ms (0 allocations: 0 bytes)
@btime custom_threaded_axpy!(a,b,c); # function runs on two threads
215.051 ms (1 allocation: 48 bytes)
@btime threaded_inbounds_axpy!(a,b,c); # function runs on two threads
212.970 ms (1 allocation: 48 bytes)
```

1 Like

Note that the axpy operation is memory-bound (at least for vectors as large as used here). In your results, even the BLAS routine doesn’t speed up with more threads. This indicates that your machine has only one memory channel (laptop?). When I run `BLAS.axpy!`

and your `threaded_inbounds_axpy!`

on my machine, I do see a speedup with more threads (roughly 2x for 2 threads) and moreover, both routines deliver the same timing. Tested with Julia 1.1.1 and 1.2.0, which give the same timings, though 1.2.0 seems to have a regression as it allocates more.

2 Likes

You are right. The timings above were obtained on a laptop with only one memory channel. Thank you for the clarification!

1 Like