OpenBLAS vs. "for loops" demo

I am preparing a presentation/demo for an upcoming meetup and want to showcase that “for loops” in Julia are fast. How fast? Fast enough to write simple code that compares to BLAS routines.
So I decided to implement a “custom_axpy!” function and compare it with the OpenBLAS version.
Here goes the code:

using BenchmarkTools
using LinearAlgebra

""" Example 1 - implementing BLAS axpy! with for loop and Threads """
a = 2.0
b = Float64.(collect(1:100_000_000)) # create a big array of Floats 
c = Float64.(collect(1:100_000_000)) # create a big array of Floats

# our custom axpy!, multithreaded / single threaded
function custom_threaded_axpy!(a,b,c)
        Threads.@threads for i = 1:length(c)
                c[i] = a*b[i]+c[i]
        end
        return nothing
end

function custom_non_threaded_axpy!(a,b,c)
        for i = 1:length(c)
                c[i] = a*b[i]+c[i]
        end
        return nothing
end

# now let's test!
# Single threaded
LinearAlgebra.BLAS.set_num_threads(1)
@btime LinearAlgebra.BLAS.axpy!(a,b,c);
@btime custom_non_threaded_axpy!(a,b,c);
# Multi threaded
LinearAlgebra.BLAS.set_num_threads(4)
@btime LinearAlgebra.BLAS.axpy!(a,b,c);
@btime custom_threaded_axpy!(a,b,c);

Depending on machine and CPU resource management, I don’t always see an improvement when using multiple threads.
The good thing is that the custom function has comparable performance to the BLAS routine (just a tiny bit slower).
Please share your feedback about the code, how to improve performance or additional examples.
Thanks in advance for your help!

*tested with Julia 1.1.1

1 Like

It’ll go just a little faster if you add @inbounds. Or use broadcasting, which does that for you:

julia> b_axpy!(a,b,c) = (c .= a .* b .+ c; nothing)
b_axpy! (generic function with 1 method)

julia> @btime b_axpy!(a,b,c);
  92.255 ms (0 allocations: 0 bytes)

julia> function inbounds_axpy!(a,b,c)
               for i = 1:length(c)
                       @inbounds c[i] = a*b[i]+c[i]
               end
               return nothing
       end
inbounds_axpy! (generic function with 1 method)

julia> @btime inbounds_axpy!(a,b,c);
  92.229 ms (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.BLAS.axpy!(a,b,c);
  92.531 ms (0 allocations: 0 bytes)

julia> @btime custom_non_threaded_axpy!(a,b,c);
  99.350 ms (0 allocations: 0 bytes)
3 Likes

https://github.com/PetrKryslUCSD/JuliaTutorial2019/blob/master/Julia_tutorial.jl line 137 and below.

PS: Actually, it should be daxpy! in my code.

1 Like

Are you sure your own code is running with multiple threads? (Did you set JULIA_NUM_THREADS?)

Many thanks for the suggestions and the tutorial! Indeed, @inbounds helps.
I’m using Juno, so I set the number of threads in Settings to be equal to the number of physical cores (2 on the current machine). I also double checked with Threads.nthreads() and played with the JULIA_NUM_THREADS.
The results are similar, regardless of the number of threads.

a = 2.0
b = Float64.(collect(1:100_000_000)) # create a big array of Floats
c = Float64.(collect(1:100_000_000)) # create a big array of Floats

# our custom axpy!, multithreaded / single threaded / @inbounds
function custom_threaded_axpy!(a,b,c)
        Threads.@threads for i = 1:length(c)
                c[i] = a*b[i]+c[i]
        end
        return nothing
end

function custom_non_threaded_axpy!(a,b,c)
        for i = 1:length(c)
                c[i] = a*b[i]+c[i]
        end
        return nothing
end

function threaded_inbounds_axpy!(a,b,c)
        Threads.@threads for i = 1:length(c)
                @inbounds c[i] = a*b[i]+c[i]
        end
        return nothing
end

function inbounds_axpy!(a,b,c)
        for i = 1:length(c)
                @inbounds c[i] = a*b[i]+c[i]
        end
        return nothing
end

Julia Threads set to 2:
LinearAlgebra.BLAS.set_num_threads(1)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs single threaded
  214.561 ms (0 allocations: 0 bytes)
@btime custom_non_threaded_axpy!(a,b,c); # function runs on a single thread
  224.731 ms (0 allocations: 0 bytes)
@btime inbounds_axpy!(a,b,c); # function runs on a single thread
  213.999 ms (0 allocations: 0 bytes)

LinearAlgebra.BLAS.set_num_threads(2)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs multi threaded
  210.972 ms (0 allocations: 0 bytes)
@btime custom_threaded_axpy!(a,b,c);  # function runs on two threads
  214.327 ms (1 allocation: 48 bytes)
@btime threaded_inbounds_axpy!(a,b,c); # function runs on two threads
  220.438 ms (1 allocation: 48 bytes) # a second run was 213ms
_____________________________________
Julia Threads set to 1:
LinearAlgebra.BLAS.set_num_threads(1)
@btime LinearAlgebra.BLAS.axpy!(a,b,c); # BLAS runs single threaded
  216.450 ms (0 allocations: 0 bytes)
@btime custom_non_threaded_axpy!(a,b,c); # function runs on a single thread
  226.447 ms (0 allocations: 0 bytes)
@btime inbounds_axpy!(a,b,c); # function runs on a single thread
  216.408 ms (0 allocations: 0 bytes)

LinearAlgebra.BLAS.set_num_threads(2)
@btime LinearAlgebra.BLAS.axpy!(a,b,c);  # BLAS runs multi threaded
  212.855 ms (0 allocations: 0 bytes)
@btime custom_threaded_axpy!(a,b,c);  # function runs on two threads
  215.051 ms (1 allocation: 48 bytes)
@btime threaded_inbounds_axpy!(a,b,c);  # function runs on two threads
  212.970 ms (1 allocation: 48 bytes)
1 Like

Note that the axpy operation is memory-bound (at least for vectors as large as used here). In your results, even the BLAS routine doesn’t speed up with more threads. This indicates that your machine has only one memory channel (laptop?). When I run BLAS.axpy! and your threaded_inbounds_axpy! on my machine, I do see a speedup with more threads (roughly 2x for 2 threads) and moreover, both routines deliver the same timing. Tested with Julia 1.1.1 and 1.2.0, which give the same timings, though 1.2.0 seems to have a regression as it allocates more.

2 Likes

You are right. The timings above were obtained on a laptop with only one memory channel. Thank you for the clarification!

1 Like