Same multi-threaded code, scaling observed only on some machines

EdTi · August 14, 2024, 7:51pm

Hi everyone,

I was quite surprised by the following benchmark.The goal is the following : initialize a big vector x, and try to add 1 to all of its coordinates as fast as possible, taking advantage of multi-threading. Here is the code

# Be sure to launch julia with multiple threads

# julia --threads 8

using BenchmarkTools

using Base.Threads

using LoopVectorization

using Polyester

t = nthreads();

n = 40_000_000;

x = zeros(t * n);

slices = [@view x[((i-1)*n+1):(i*n)] for i = 1:t];

println("\nBe sure to lauch Julia with multiple threads.")

println("Also, check htop during this test. \n")

println("First test : julia's built-in broadcasted addition (single-threaded)...")

@btime @. x += 1;

println("\nSecond test : julia's built-in multithreading...")
@btime @threads for i = 1:t

    @. slices[i] += 1

end;

println("\nThird test : Polyester's multithreading...")

@btime @batch for i = 1:t

    @. slices[i] += 1

end;

println("\nLast test : LoopVectorization's multithreading...")

@btime @tturbo @. x += 1;

println("\nTesting BLAS multithreading performance")
A = randn(3000,3000);
B = randn(3000,3000);

@btime C = A*B;

BLAS.set_num_threads(1)
println("\nTesting BLAS single-threaded performance")

@btime C = A*B;

Here is my result :

First test : julia’s built-in broadcasted addition (single-threaded)…
404.168 ms (2 allocations: 64 bytes)

Second test : julia’s built-in multithreading…
412.907 ms (85 allocations: 6.31 KiB)

Third test : Polyester’s multithreading…
429.058 ms (3 allocations: 64 bytes)

Last test : LoopVectorization’s multithreading…
424.350 ms (2 allocations: 64 bytes)

Testing BLAS multithreading performance
251.361 ms (2 allocations: 68.66 MiB)

Testing BLAS single-threaded performance
946.403 ms (2 allocations: 68.66 MiB)
952.055 ms (2 allocations: 68.66 MiB)

This test was done on two linux machines : one with Intel i9, one with AMD Ryzen 7, after launching Julia as is (both chips have 8 physical cores):

julia --threads 8

As you can see, the scaling we expect for the x .+= 1 is not happening. Although, when keeping an eye on htop, I see that only one core is working for the first test, and then all the cores are working for the other tests. But it does not seem to be a machine/kernel problem, since the BLAS routines are indeed scaling as expected.

However : my coworker ran the exact same code on his machine, which is an iMac with M1 chip. And the x .+= 1 is scaling as expected : the first test is the slower, and then the other tests are roughly t times faster, where t is the number of threads.

Do you see where it could be coming from ?
Thank you in advance

Oscar_Smith · August 14, 2024, 7:58pm

The difference comes from the ratio of CPU execution units to memory bandwidth. Apple’s M1 chips have a lot more memory bandwidth than the standard dual channel memory, while most non-server x86 have low enough bandwidth that a single core can use most of it.

EdTi · August 14, 2024, 8:02pm

Thank you for your quick answer! So if I understand right, I should observe the scaling on smaller data, my chunks of x are too heavy. And… indeed, I do. If I replace n by 4000 (instead of 40000000), or if I increase the number of slices (and thus decrease their size) then I observe the scaling.

So I have to pay attention to the ratio of the number of calls and the size of the data on which my threads are operating.

Topic		Replies	Views
Help me understand multi-threaded scaling for matrix multiplication Performance question	22	641	April 16, 2024
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2441	January 13, 2023
Simple performance test of threaded execution General Usage	3	479	March 3, 2023
Expected 72X speedup, observed 1.1X Performance	5	739	April 28, 2021
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1614	October 16, 2021

Same multi-threaded code, scaling observed only on some machines

Related topics