When I increased the number of threads in my Julia project, I noticed that the performance of the project did not scale linearly with the number of threads. Currently, the multi-threaded performance is twice that of the single-threaded version. In the single-threaded case, there is a significant allocation of large arrays (which I have optimized as much as possible through profiling and pre-allocation). It is known that the project is not limited by memory, and I suspect it is limited by memory bandwidth. However, I am unsure how to qualitatively analyze this. The codebase is too large to abstract a minimal example.
I used the STREAMBenchmark
package’s memory_bandwidth
to analyze the memory bandwidth of the server, but I am unsure how to measure or analyze the impact of different thread counts on memory bandwidth through a small demo.
a = rand(Float32,1000000);
b = rand(Float32,1000000);
@btime a.=b;
# 313.215 ÎĽs (2 allocations: 32 bytes)
a = rand(Float16,1000000);
b = rand(Float16,1000000);
@btime a.=b;
# 147.202 ÎĽs (2 allocations: 32 bytes)
a = rand(Float16,2000000);
b = rand(Float16,2000000);
@btime a.=b;
# 313.202 ÎĽs (2 allocations: 32 bytes)
a = rand(Float16,3000000);
b = rand(Float16,3000000);
@btime a.=b;
# 477.880 ÎĽs (2 allocations: 32 bytes)
a = rand(Float16,4000000);
b = rand(Float16,4000000);
@btime a.=b;
# 707.652 ÎĽs (2 allocations: 32 bytes)
a = rand(Float16,5000000);
b = rand(Float16,5000000);
@btime a.=b;
# 944.194 ÎĽs (2 allocations: 32 bytes)
a = rand(Float16,6000000);
b = rand(Float16,6000000);
@btime a.=b;
# 1.279 ms (2 allocations: 32 bytes)
a = rand(Float16,10000000);
b = rand(Float16,10000000);
@btime a.=b;
# 4.005 ms (2 allocations: 32 bytes)
a = rand(Float16,20000000);
b = rand(Float16,20000000);
@btime a.=b;
# 15.698 ms (2 allocations: 32 bytes)
using ThreadPinning
pinthreads(:compact)
using STREAMBenchmark
# memory_bandwidth(;verbose=true)
scaling_benchmark()
# Threads: 1 Max. memory bandwidth: 26204.7
# Threads: 2 Max. memory bandwidth: 48202.7
# Threads: 3 Max. memory bandwidth: 70129.2
3-element Vector{Float64}:
26204.7
48202.7
70129.2
function foo_sequential()
N = 10
a = [ rand(Float16,500000) for i = 1:N]
b = [ rand(Float16,500000) for i = 1:N]
for i in eachindex(a)
a[i].=b[i]
end
return nothing
end
function foo_parallel()
N = 10
a = [ rand(Float16,500000) for i = 1:N]
b = [ rand(Float16,500000) for i = 1:N]
Threads.@threads for i in eachindex(a)
a[i].=b[i]
end
return nothing
end
foo_sequential()
@time foo_sequential()
foo_parallel()
@time foo_parallel()
# threads = 1
# 0.029979 seconds (42 allocations: 19.075 MiB)
# 0.028899 seconds (49 allocations: 19.075 MiB)
# threads = 2
# 0.027747 seconds (42 allocations: 19.075 MiB, 4.16% gc time)
# 0.030938 seconds (55 allocations: 19.076 MiB, 3.74% gc time)
# threads = 3
# 0.032039 seconds (42 allocations: 19.075 MiB)
# 0.031098 seconds (61 allocations: 19.076 MiB)
Do you know the Amdahl’s law? Did you try to profile your code and see what’s the bottleneck to understand if it’s a serial process? Are you aware of the fact the garbage collector stops the world when it runs and you need to reduce memory allocations in hot loops as much as possible?
The threading scheduler has some overhead (roughly of the order of microsecond, but it varies between systems), if you run super fast operations like copying individual elements of one array to another one (which roughly takes units of nanosecond) within each iteration, the cost of spawning a task completely dwarfs the benefit of parallelising the process. See also the section “Multi-threading: is it always worth it?” of this jupyter notebook which had pretty much the same example.
You want to do much more substantial work within each iteration to make the parallelism more beneficial. You may want to look at packages like OhMyThreads.jl and ChunkSplitters.jl
I am sure that threads accounts for over 90% of the execution time .but I can not give you the example becase it is too complex.Now I want to get some answers throught the given julia code above .
ok,thank you.Maybe I need to modify my example,it is not appropriate.
OP is threading over entire arrays, i.e. each operation is a 1MB memcpy. That’s slow enough that threading overhead should not play a signigicant role.
On the other hand,
julia> using BenchmarkTools
julia> a = rand(Float16, 500_000); b=copy(a);
julia> @btime rand(Float16, 500_000);
85.059 ÎĽs (3 allocations: 976.63 KiB)
julia> @btime copyto!(a,b);
25.109 ÎĽs (0 allocations: 0 bytes)
So:
- scheduler overhead should not matter, but multi-threading won’t help a lot either, because most of your work is single-threaded either way
- You’re probably measuring L3 bandwidth to some amount, not main memory bandwidth (your working set is
2*10*2*0.5
= 20MB). - Using benchmarkTools instead of
@time
, I cannot reproduce your observation (i.e. multithreading doesn’t hurt on my machine).