Hi everyone,
I was quite surprised by the following benchmark.The goal is the following : initialize a big vector x, and try to add 1 to all of its coordinates as fast as possible, taking advantage of multi-threading. Here is the code
# Be sure to launch julia with multiple threads
# julia --threads 8
using BenchmarkTools
using Base.Threads
using LoopVectorization
using Polyester
t = nthreads();
n = 40_000_000;
x = zeros(t * n);
slices = [@view x[((i-1)*n+1):(i*n)] for i = 1:t];
println("\nBe sure to lauch Julia with multiple threads.")
println("Also, check htop during this test. \n")
println("First test : julia's built-in broadcasted addition (single-threaded)...")
@btime @. x += 1;
println("\nSecond test : julia's built-in multithreading...")
@btime @threads for i = 1:t
@. slices[i] += 1
end;
println("\nThird test : Polyester's multithreading...")
@btime @batch for i = 1:t
@. slices[i] += 1
end;
println("\nLast test : LoopVectorization's multithreading...")
@btime @tturbo @. x += 1;
println("\nTesting BLAS multithreading performance")
A = randn(3000,3000);
B = randn(3000,3000);
@btime C = A*B;
BLAS.set_num_threads(1)
println("\nTesting BLAS single-threaded performance")
@btime C = A*B;
Here is my result :
First test : julia’s built-in broadcasted addition (single-threaded)…
404.168 ms (2 allocations: 64 bytes)Second test : julia’s built-in multithreading…
412.907 ms (85 allocations: 6.31 KiB)Third test : Polyester’s multithreading…
429.058 ms (3 allocations: 64 bytes)Last test : LoopVectorization’s multithreading…
424.350 ms (2 allocations: 64 bytes)Testing BLAS multithreading performance
251.361 ms (2 allocations: 68.66 MiB)Testing BLAS single-threaded performance
946.403 ms (2 allocations: 68.66 MiB)
952.055 ms (2 allocations: 68.66 MiB)
This test was done on two linux machines : one with Intel i9, one with AMD Ryzen 7, after launching Julia as is (both chips have 8 physical cores):
julia --threads 8
As you can see, the scaling we expect for the x .+= 1 is not happening. Although, when keeping an eye on htop, I see that only one core is working for the first test, and then all the cores are working for the other tests. But it does not seem to be a machine/kernel problem, since the BLAS routines are indeed scaling as expected.
However : my coworker ran the exact same code on his machine, which is an iMac with M1 chip. And the x .+= 1 is scaling as expected : the first test is the slower, and then the other tests are roughly t times faster, where t is the number of threads.
Do you see where it could be coming from ?
Thank you in advance