Expected 72X speedup, observed 1.1X

I dreamed up a simple-looking task that I thought multithreading could accelerate. Running the file below (I named it threadbench.jl) proved me wrong: with 72 cores and 72 threads, the timings are not appreciably better than with a single thread. Understanding why I don’t get 72X speedup may help me in my long journey to improving some code I actually care about. Any ideas?

using Base.Threads

# This function does the same simple job either with or without threading.
function tt(invec; threadit=false)
	outvec = zeros(size(invec))
	if threadit
		@threads for i in 1:size(invec,1)
			outvec[i] = threadid()+invec[i]
		for i in 1:size(invec,1)
			outvec[i] = threadid()+invec[i]
	return outvec

# Run the function once to make sure it's compiled
z = rand(1024)
y = tt(z)
y = tt(z,threadit=true) # Exercise all branches to be extra sure

x = round.(rand(2^27),digits=2)
println("Timing with one thread:")
@time y1 = tt(x,threadit=false)
println("Timing with $(Threads.nthreads()) threads:")
@time y72 = tt(x,threadit=true)

On my machine, the results are as follows:

Timing with one thread:
  1.443072 seconds (2 allocations: 1.000 GiB, 3.54% gc time)
Timing with 72 threads:
  1.263002 seconds (367 allocations: 1.000 GiB)


julia> @time y1 = tt(x,threadit=false);
  0.252025 seconds (2 allocations: 1.000 GiB, 0.79% gc time)

julia> @time y72 = tt(x,threadit=true);
  0.227822 seconds (45 allocations: 1.000 GiB, 1.30% gc time)

with 8 threads on my i7 laptop. I have these speeds because I declared x to be const. What I am noticing is that the times jump wildly between 0.25 - 0.45 seconds for me in both scenarios.

This code basically spends all its time reading/writing memory (the actual addition should be negligible), and your machine probably doesn’t have the memory bandwidth for all your cores to access that much RAM simultaneously (the 1GiB invec and outvec arrays are too big for the caches).

Try something compute-bound, not memory-bound.


Use @btime tt($x,threadit=true); from the BenchmarkTools.jl package instead of @time. It will take several measurements, so it will be less noisy, and the $x means that you don’t need to declare x as const to avoid a global-variable penalty.


Thanks a lot, both of you, for taking the time to reply. I will continue on my journey of learning and exploration, hoping that the next issue I encounter will be worthy of starting a new thread (not continuing here).

I took Steven’s advice and changed the task to be split into @threads: the new job was to compute a large number of random N-by-N determinants (with N=5 or N=3). At first I used the LinearAlgebra.jl package for this, and managed to crash BLAS: this was surprising / gratifying / educational! (Apparently OPENBLAS does its own threading, and the collision between that and my explicit threading requests was too much to handle.) Specifying OPENBLAS_NUM_THREADS=1 before starting Julia overcame this issue, but nonetheless the version of my task using @threads was significantly slower than the single-threaded version. Finally I wrote my own super-simple 3-by-3 determinant evaluator function, and then the threading approach allowed a modest speedup – around 6X. Perhaps memory access is still the bottleneck, since 10 Julia threads are just as fast as 72 for this operation.

There is no new question here, just a little story to confirm that I took your advice seriously and learned from it. Perhaps some future reader will, too. Thanks again!


Note that if you’re working with small arrays of fixed sizes, (smaller than 10x10 or so), StaticArrays will often be much faster since it stores the size of the array in the type, and therefore can generate specialized code.