# Expected 72X speedup, observed 1.1X

I dreamed up a simple-looking task that I thought multithreading could accelerate. Running the file below (I named it `threadbench.jl`) proved me wrong: with 72 cores and 72 threads, the timings are not appreciably better than with a single thread. Understanding why I don’t get 72X speedup may help me in my long journey to improving some code I actually care about. Any ideas?

``````using Base.Threads

# This function does the same simple job either with or without threading.
function tt(invec; threadit=false)
outvec = zeros(size(invec))
if threadit
@threads for i in 1:size(invec,1)
outvec[i] = threadid()+invec[i]
end
else
for i in 1:size(invec,1)
outvec[i] = threadid()+invec[i]
end
end
return outvec
end

# Run the function once to make sure it's compiled
z = rand(1024)
y = tt(z)
y = tt(z,threadit=true) # Exercise all branches to be extra sure

x = round.(rand(2^27),digits=2)

println("Timing with one thread:")
@time y1 = tt(x,threadit=false)

println("Timing with \$(Threads.nthreads()) threads:")
@time y72 = tt(x,threadit=true)
``````

On my machine, the results are as follows:

``````Timing with one thread:
1.443072 seconds (2 allocations: 1.000 GiB, 3.54% gc time)
Timing with 72 threads:
1.263002 seconds (367 allocations: 1.000 GiB)
``````

Thanks!

``````julia> @time y1 = tt(x,threadit=false);
0.252025 seconds (2 allocations: 1.000 GiB, 0.79% gc time)

julia> @time y72 = tt(x,threadit=true);
0.227822 seconds (45 allocations: 1.000 GiB, 1.30% gc time)
``````

with 8 threads on my i7 laptop. I have these speeds because I declared `x` to be `const`. What I am noticing is that the times jump wildly between 0.25 - 0.45 seconds for me in both scenarios.

This code basically spends all its time reading/writing memory (the actual addition should be negligible), and your machine probably doesn’t have the memory bandwidth for all your cores to access that much RAM simultaneously (the 1GiB `invec` and `outvec` arrays are too big for the caches).

Try something compute-bound, not memory-bound.

17 Likes

Use `@btime tt(\$x,threadit=true);` from the BenchmarkTools.jl package instead of `@time`. It will take several measurements, so it will be less noisy, and the `\$x` means that you don’t need to declare `x` as `const` to avoid a global-variable penalty.

4 Likes

Thanks a lot, both of you, for taking the time to reply. I will continue on my journey of learning and exploration, hoping that the next issue I encounter will be worthy of starting a new thread (not continuing here).

I took Steven’s advice and changed the task to be split into @threads: the new job was to compute a large number of random N-by-N determinants (with N=5 or N=3). At first I used the `LinearAlgebra.jl` package for this, and managed to crash BLAS: this was surprising / gratifying / educational! (Apparently OPENBLAS does its own threading, and the collision between that and my explicit threading requests was too much to handle.) Specifying OPENBLAS_NUM_THREADS=1 before starting Julia overcame this issue, but nonetheless the version of my task using @threads was significantly slower than the single-threaded version. Finally I wrote my own super-simple 3-by-3 determinant evaluator function, and then the threading approach allowed a modest speedup – around 6X. Perhaps memory access is still the bottleneck, since 10 Julia threads are just as fast as 72 for this operation.

There is no new question here, just a little story to confirm that I took your advice seriously and learned from it. Perhaps some future reader will, too. Thanks again!

3 Likes

Note that if you’re working with small arrays of fixed sizes, (smaller than `10x10` or so), `StaticArrays` will often be much faster since it stores the size of the array in the type, and therefore can generate specialized code.