Same multi-threaded code, scaling observed only on some machines

Hi everyone,

I was quite surprised by the following benchmark.The goal is the following : initialize a big vector x, and try to add 1 to all of its coordinates as fast as possible, taking advantage of multi-threading. Here is the code

# Be sure to launch julia with multiple threads

# julia --threads 8

using BenchmarkTools

using Base.Threads

using LoopVectorization

using Polyester

t = nthreads();

n = 40_000_000;

x = zeros(t * n);

slices = [@view x[((i-1)*n+1):(i*n)] for i = 1:t];

println("\nBe sure to lauch Julia with multiple threads.")

println("Also, check htop during this test. \n")

println("First test : julia's built-in broadcasted addition (single-threaded)...")

@btime @. x += 1;

println("\nSecond test : julia's built-in multithreading...")
@btime @threads for i = 1:t

    @. slices[i] += 1

end;

println("\nThird test : Polyester's multithreading...")

@btime @batch for i = 1:t

    @. slices[i] += 1

end;

println("\nLast test : LoopVectorization's multithreading...")

@btime @tturbo @. x += 1;

println("\nTesting BLAS multithreading performance")
A = randn(3000,3000);
B = randn(3000,3000);

@btime C = A*B;

BLAS.set_num_threads(1)
println("\nTesting BLAS single-threaded performance")

@btime C = A*B;

Here is my result :

First test : julia’s built-in broadcasted addition (single-threaded)…
404.168 ms (2 allocations: 64 bytes)

Second test : julia’s built-in multithreading…
412.907 ms (85 allocations: 6.31 KiB)

Third test : Polyester’s multithreading…
429.058 ms (3 allocations: 64 bytes)

Last test : LoopVectorization’s multithreading…
424.350 ms (2 allocations: 64 bytes)

Testing BLAS multithreading performance
251.361 ms (2 allocations: 68.66 MiB)

Testing BLAS single-threaded performance
946.403 ms (2 allocations: 68.66 MiB)
952.055 ms (2 allocations: 68.66 MiB)

This test was done on two linux machines : one with Intel i9, one with AMD Ryzen 7, after launching Julia as is (both chips have 8 physical cores):

julia --threads 8

As you can see, the scaling we expect for the x .+= 1 is not happening. Although, when keeping an eye on htop, I see that only one core is working for the first test, and then all the cores are working for the other tests. But it does not seem to be a machine/kernel problem, since the BLAS routines are indeed scaling as expected.

However : my coworker ran the exact same code on his machine, which is an iMac with M1 chip. And the x .+= 1 is scaling as expected : the first test is the slower, and then the other tests are roughly t times faster, where t is the number of threads.

Do you see where it could be coming from ?
Thank you in advance

The difference comes from the ratio of CPU execution units to memory bandwidth. Apple’s M1 chips have a lot more memory bandwidth than the standard dual channel memory, while most non-server x86 have low enough bandwidth that a single core can use most of it.

3 Likes

Thank you for your quick answer! So if I understand right, I should observe the scaling on smaller data, my chunks of x are too heavy. And… indeed, I do. If I replace n by 4000 (instead of 40000000), or if I increase the number of slices (and thus decrease their size) then I observe the scaling.

So I have to pay attention to the ratio of the number of calls and the size of the data on which my threads are operating.