Expected 72X speedup, observed 1.1X

PhilipDLoewen · April 27, 2021, 12:58am

I dreamed up a simple-looking task that I thought multithreading could accelerate. Running the file below (I named it threadbench.jl) proved me wrong: with 72 cores and 72 threads, the timings are not appreciably better than with a single thread. Understanding why I don’t get 72X speedup may help me in my long journey to improving some code I actually care about. Any ideas?

using Base.Threads

# This function does the same simple job either with or without threading.
function tt(invec; threadit=false)
	outvec = zeros(size(invec))
	if threadit
		@threads for i in 1:size(invec,1)
			outvec[i] = threadid()+invec[i]
		end
	else
		for i in 1:size(invec,1)
			outvec[i] = threadid()+invec[i]
		end
	end
	return outvec
end

# Run the function once to make sure it's compiled
z = rand(1024)
y = tt(z)
y = tt(z,threadit=true) # Exercise all branches to be extra sure

x = round.(rand(2^27),digits=2)
	
println("Timing with one thread:")
@time y1 = tt(x,threadit=false)
	
println("Timing with $(Threads.nthreads()) threads:")
@time y72 = tt(x,threadit=true)

On my machine, the results are as follows:

Timing with one thread:
  1.443072 seconds (2 allocations: 1.000 GiB, 3.54% gc time)
Timing with 72 threads:
  1.263002 seconds (367 allocations: 1.000 GiB)

Thanks!

affans · April 27, 2021, 1:19am

julia> @time y1 = tt(x,threadit=false);
  0.252025 seconds (2 allocations: 1.000 GiB, 0.79% gc time)

julia> @time y72 = tt(x,threadit=true);
  0.227822 seconds (45 allocations: 1.000 GiB, 1.30% gc time)

with 8 threads on my i7 laptop. I have these speeds because I declared x to be const. What I am noticing is that the times jump wildly between 0.25 - 0.45 seconds for me in both scenarios.

stevengj · April 27, 2021, 1:24am

This code basically spends all its time reading/writing memory (the actual addition should be negligible), and your machine probably doesn’t have the memory bandwidth for all your cores to access that much RAM simultaneously (the 1GiB invec and outvec arrays are too big for the caches).

Try something compute-bound, not memory-bound.

stevengj · April 27, 2021, 1:26am

Use @btime tt($x,threadit=true); from the BenchmarkTools.jl package instead of @time. It will take several measurements, so it will be less noisy, and the $x means that you don’t need to declare x as const to avoid a global-variable penalty.

PhilipDLoewen · April 28, 2021, 4:36am

Thanks a lot, both of you, for taking the time to reply. I will continue on my journey of learning and exploration, hoping that the next issue I encounter will be worthy of starting a new thread (not continuing here).

I took Steven’s advice and changed the task to be split into @threads: the new job was to compute a large number of random N-by-N determinants (with N=5 or N=3). At first I used the LinearAlgebra.jl package for this, and managed to crash BLAS: this was surprising / gratifying / educational! (Apparently OPENBLAS does its own threading, and the collision between that and my explicit threading requests was too much to handle.) Specifying OPENBLAS_NUM_THREADS=1 before starting Julia overcame this issue, but nonetheless the version of my task using @threads was significantly slower than the single-threaded version. Finally I wrote my own super-simple 3-by-3 determinant evaluator function, and then the threading approach allowed a modest speedup – around 6X. Perhaps memory access is still the bottleneck, since 10 Julia threads are just as fast as 72 for this operation.

There is no new question here, just a little story to confirm that I took your advice seriously and learned from it. Perhaps some future reader will, too. Thanks again!

Oscar_Smith · April 28, 2021, 4:50am

Note that if you’re working with small arrays of fixed sizes, (smaller than 10x10 or so), StaticArrays will often be much faster since it stores the size of the array in the type, and therefore can generate specialized code.

Topic		Replies	Views
Same multi-threaded code, scaling observed only on some machines Performance	2	67	August 14, 2024
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2404	January 13, 2023
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1789	April 6, 2021
Using and understanding multi-threading Performance multithreading	1	285	January 11, 2024
Speed up for-loop with multithreading Performance question , multithreading	10	1819	April 22, 2022

Expected 72X speedup, observed 1.1X

Related topics