How to quantitatively analyze the impact of memory bandwidth on multithreading

quantumdreamer · January 22, 2025, 6:46am

When I increased the number of threads in my Julia project, I noticed that the performance of the project did not scale linearly with the number of threads. Currently, the multi-threaded performance is twice that of the single-threaded version. In the single-threaded case, there is a significant allocation of large arrays (which I have optimized as much as possible through profiling and pre-allocation). It is known that the project is not limited by memory, and I suspect it is limited by memory bandwidth. However, I am unsure how to qualitatively analyze this. The codebase is too large to abstract a minimal example.

quantumdreamer · January 22, 2025, 6:49am

I used the STREAMBenchmark package’s memory_bandwidth to analyze the memory bandwidth of the server, but I am unsure how to measure or analyze the impact of different thread counts on memory bandwidth through a small demo.

quantumdreamer · January 22, 2025, 9:04am

a = rand(Float32,1000000);

b = rand(Float32,1000000);

@btime a.=b;

# 313.215 μs (2 allocations: 32 bytes)

a = rand(Float16,1000000);

b = rand(Float16,1000000);

@btime a.=b;

# 147.202 μs (2 allocations: 32 bytes)

a = rand(Float16,2000000);

b = rand(Float16,2000000);

@btime a.=b;

# 313.202 μs (2 allocations: 32 bytes)

a = rand(Float16,3000000);

b = rand(Float16,3000000);

@btime a.=b;

# 477.880 μs (2 allocations: 32 bytes)

a = rand(Float16,4000000);

b = rand(Float16,4000000);

@btime a.=b;

# 707.652 μs (2 allocations: 32 bytes)

a = rand(Float16,5000000);

b = rand(Float16,5000000);

@btime a.=b;

# 944.194 μs (2 allocations: 32 bytes)

a = rand(Float16,6000000);

b = rand(Float16,6000000);

@btime a.=b;

# 1.279 ms (2 allocations: 32 bytes)

a = rand(Float16,10000000);

b = rand(Float16,10000000);

@btime a.=b;

# 4.005 ms (2 allocations: 32 bytes)

a = rand(Float16,20000000);

b = rand(Float16,20000000);

@btime a.=b;

# 15.698 ms (2 allocations: 32 bytes)

quantumdreamer · January 23, 2025, 6:57am

using ThreadPinning
pinthreads(:compact)
using STREAMBenchmark
# memory_bandwidth(;verbose=true)
scaling_benchmark()

# Threads: 1    Max. memory bandwidth: 26204.7
# Threads: 2    Max. memory bandwidth: 48202.7
# Threads: 3    Max. memory bandwidth: 70129.2
3-element Vector{Float64}:
 26204.7
 48202.7
 70129.2

quantumdreamer · January 23, 2025, 9:27am

function foo_sequential()
    N = 10
    a = [ rand(Float16,500000) for i = 1:N]
    b = [ rand(Float16,500000) for i = 1:N]
    for i in eachindex(a)
        a[i].=b[i]
    end
    return nothing
end

function foo_parallel()
    N = 10
    a = [ rand(Float16,500000) for i = 1:N]
    b = [ rand(Float16,500000) for i = 1:N]
    Threads.@threads for i in eachindex(a)
        a[i].=b[i]
    end
    return nothing
end
foo_sequential()
@time foo_sequential()
foo_parallel()
@time foo_parallel()
# threads = 1
# 0.029979 seconds (42 allocations: 19.075 MiB)
# 0.028899 seconds (49 allocations: 19.075 MiB)
# threads = 2
# 0.027747 seconds (42 allocations: 19.075 MiB, 4.16% gc time)
# 0.030938 seconds (55 allocations: 19.076 MiB, 3.74% gc time)
# threads = 3
# 0.032039 seconds (42 allocations: 19.075 MiB)
# 0.031098 seconds (61 allocations: 19.076 MiB)

giordano · January 23, 2025, 9:29am

Do you know the Amdahl’s law? Did you try to profile your code and see what’s the bottleneck to understand if it’s a serial process? Are you aware of the fact the garbage collector stops the world when it runs and you need to reduce memory allocations in hot loops as much as possible?

giordano · January 23, 2025, 9:35am

The threading scheduler has some overhead (roughly of the order of microsecond, but it varies between systems), if you run super fast operations like copying individual elements of one array to another one (which roughly takes units of nanosecond) within each iteration, the cost of spawning a task completely dwarfs the benefit of parallelising the process. See also the section “Multi-threading: is it always worth it?” of this jupyter notebook which had pretty much the same example.

You want to do much more substantial work within each iteration to make the parallelism more beneficial. You may want to look at packages like OhMyThreads.jl and ChunkSplitters.jl

quantumdreamer · January 23, 2025, 9:37am

I am sure that threads accounts for over 90% of the execution time .but I can not give you the example becase it is too complex.Now I want to get some answers throught the given julia code above .

quantumdreamer · January 23, 2025, 9:40am

ok,thank you.Maybe I need to modify my example,it is not appropriate.

foobar_lv2 · January 23, 2025, 11:17am

OP is threading over entire arrays, i.e. each operation is a 1MB memcpy. That’s slow enough that threading overhead should not play a signigicant role.

On the other hand,

julia> using BenchmarkTools

julia> a = rand(Float16, 500_000); b=copy(a);
julia> @btime rand(Float16, 500_000);
  85.059 μs (3 allocations: 976.63 KiB)
julia> @btime copyto!(a,b);
  25.109 μs (0 allocations: 0 bytes)

So:

scheduler overhead should not matter, but multi-threading won’t help a lot either, because most of your work is single-threaded either way
You’re probably measuring L3 bandwidth to some amount, not main memory bandwidth (your working set is 2*10*2*0.5 = 20MB).
Using benchmarkTools instead of @time, I cannot reproduce your observation (i.e. multithreading doesn’t hurt on my machine).

quantumdreamer · February 6, 2025, 3:20am

Thanks you, I get some answers I want.

Topic		Replies	Views
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	634	August 7, 2018
Multithreading performance regressions in 0.6? General Usage multithreading	2	1169	May 16, 2017
Threads memory allocations General Usage	2	489	January 22, 2020
Using and understanding multi-threading Performance multithreading	1	292	January 11, 2024
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	703	October 7, 2021

How to quantitatively analyze the impact of memory bandwidth on multithreading

Related topics