First if you have a 6 core CPU then the 12 threads are using hyper-threading which means 2 threads try to share resources in a single core. If both threads are doing the same computations then there is likely conflict over those resources. In extreme situations this might mean the threads run at half speed. (If the threads are doing wildly different operations then there are less conflicts inside the core and the threads run near max speed and everyone is happy.) But that would only get you down to a x6 speed increase.

Next have you run some sort of process monitor i.e. top or htop? Are all cores fully utilized? If you are not seeing 12 cores staying at 100% utilization then you have some sort of contention that is causing the threads to wait.

This could be the amount of time a thread has to spend processing a “chunk”. If that is short then you could be hit by the synchronization of the thread “asking” for a new chunk to process. The longer each thread can run independently the less synchronization that has to happen, and the less time the threads wait to synchronize.

Also are you printing or writing information to disk during the process? This is another area of contention that could be slowing everything down.

Lastly I there could be contention with the garbage collection. I really haven’t played with threading and the garbage collector but the collector might be “stopping the world” to often. Not sure how to debug that other than trying to reduce memory allocation.

Are you perhaps threading a loop in global scope with Threads.@threads ? I am asking because you write “essentially independent”.

I did that in the beginning and the code ran slower with more than one thread than the serial version. Putting the threaded loop in a function gave the expected speed-up. I guess I introduced a hidden closure by using Threads.@threads.

I think the best way is to use an example. Say we want to calculate the matrix dot product (Gram) x'x for very large (long) matrices where the calculation may not fit into memory, we could carry out the calculating in blocks, we could also parallelize the calculation using threads, the code is below. For some reason I’m now getting around x5.5 faster with the same data load as before:

using Base.Threads
using LinearAlgebra.BLAS: set_num_threads
nthreads()
# 12
# Generate the block matrices
function makeNestedMatrix(n::Integer = 1000, p::Integer = 100, chunk = 1000)
x = Array{Array{Float64, 2}}(undef, n)
for i in 1:n
x[i] = rand(Float64, chunk, p)
end
return x
end
function threadedGram(x::Array{Array{Float64, 2}})
p = size(x[1])[2]
n = length(x)
z = zeros(Float64, p, p)
@threads for i in 1:n
z .+= x[i]' * x[i]
end
return z
end
function serialGram(x::Array{Array{Float64, 2}})
p = size(x[1])[2]
n = length(x)
z = zeros(Float64, p, p)
for i in 1:n
z .+= x[i]' * x[i]
end
return z
end
x = makeNestedMatrix(10_000, 100, 1000);
set_num_threads(1)
@time serialGram(x);
# 4.238534 seconds (20.01 k allocations: 763.779 MiB, 21.82% gc time)
@time threadedGram(x);
# 0.765528 seconds (20.10 k allocations: 763.788 MiB, 0.58% gc time)
# Check with native matrix multiplication
# Generate the dense matrix for comparison
y = zeros(size(x[1])[1]*length(x), size(x[1])[2]);
chunk = size(x[1])[1];
for i in 1:length(x)
y[((i - 1)*chunk + 1):(i*chunk),:] .= x[i]
end
set_num_threads(12)
@time y' * y;
# 3.229251 seconds (7 allocations: 78.375 KiB)

Sure. Everything within that loop was a single function doing the actual computation which got called with changing parameters (parameter variation problem) over and over again. So from a performance perspective it was fine for serial code. With threading it was not.

If you’re using matrix multiplication (or other functionality from BLAS) — then your code is already multithreaded within the library call. BLAS’ multithreading doesn’t (yet) compose well with Julia’s threads, so this doesn’t play nicely at the moment, but that’s slated to change for 1.4.

You might have a point there. I have only 32GB on my computer and generated 30,000 rather the original 10,000 chunks. The CPU utilization for single core and multi-threaded is given below. We hit 100% CPU utilization on threaded briefly. The times are 10.666989 s and 2.904637 s for single and multithreaded respectively.

I guess I’ll gave to rent some cloud compute briefly to test this properly.

What I’m doing should be find should it not? I switch off BLAS threading using set_num_threads(1) to directly compare multi-threaded and single threaded response. The system monitor indicates this switch is working since I only see the selected number of threads active when I do the runs.

I thought that the the * function for 2D arrays is just dispatching to BLAS and we control the number of threads C uses with set_num_threads() function? The only thing I know about partr is that it has something to do with Julia concurrency - majorly naive about that I’m afraid.

Yes, you’re correct. I must confess that I didn’t scroll down on your code sample far enough to actually see how you were benchmarking — I just saw the matmul thought that might be your issue. Sorry about that.

That said, you’ll almost never see perfect linear scaling. If you have a 6 core machine, then getting a 4-5.5x scaling with 12 threads is quite good. You might actually get better results with fewer threads (closer to the number of physical cores), but if you’re data-bandwidth-limited it may not have that big of an effect.

What I need is one of those new Threadripper 3 chips to play with, they should have much more cache, even the Ryzen 3950x with 16 cores and 32 threads has a massive 64MB of cache compared to my measly 12MB, twice the amount of cache memory per thread, hmmm nice.

Apart from perf issues, this looks like a big data race to me. I think you meant to write

function threadedGram2(x::Array{Array{Float64, 2}})
p = size(x[1])[2]
n = length(x)
z = [zeros(Float64, p, p) for i in 1:Threads.nthreads()]
@threads for i in 1:n
LinearAlgebra.mul!(z[Threads.threadid()], x[i]', x[i], 1, 1)
end
r = pop!(z)
for zz in z
r .+= zz
end
return r
end

You have a good point however, I have now noticed that the answer from threadedGram() is slightly different from serialGram() and the native multiplication (the latter two agree). I’ve modified your code and it now runs as fast as threadedGram():

function threadedGram2(x::Array{Array{Float64, 2}})
p = size(x[1])[2]
n = length(x)
z = [zeros(Float64, p, p) for i in 1:nthreads()]
@threads for i in 1:n
# mul!(z[threadid()], x[i]', x[i], 1, 1)
z[threadid()] .+= x[i]' * x[i]
end
r = pop!(z)
for zz in z
r .+= zz
end
return r
end
@time threadedGram2(x);
# 0.784108 seconds (20.11 k allocations: 764.628 MiB, 0.45% gc time)

Sharing code is good. I may never have picked up on that bug!

It appears that I misused the new shiny 5-arg mul!. Oops!

Try LinearAlgebra.mul!(z[Threads.threadid()], x[i]', x[i], true, true) instead. I somehow thought the API would translate integer arguments into the proper type for the underlying BLAS call, but it apparently doesn’t (can go straight to slack/#gripes).

The main reason for the 5-arg mul dance is not so much speed of each iteration, but rather to keep allocations / gc out of the loop (gc in multi-threaded contexts is complex to reason about and a perf trap).

If you have a parallel mapreduce(f, op, x) (e.g., reduce(op, Map(f), x) from Transducers.jl), a neat way to minimize allocation and call mul! would be to use LazyArrays. Something like this (untested):

using LazyArrays: @~
using Transducers: Map
z = reduce(Map(x -> @~ x'x), x; init=nothing) do a, b
a === nothing ? copy(b) : a .+= b
end

Slightly off topic but I think it would probably be useful to have this in Base:

import Base.zeros
"""
x = rand(Float64, (4,4))
zeros(x)
"""
function zeros(x::AbstractArray)
T = eltype(x)
dim = size(x)
return zeros(T, dim)
end
import Base.ones
"""
x = rand(Float64, (4,4))
ones(x)
"""
function ones(x::AbstractArray)
T = eltype(x)
dim = size(x)
return ones(T, dim)
end

zero(x)
" Get the additive identity element for the type of `x`"

although, one(x) would give you identity matrix, not sure if you want that or not (because it’s the multiplicative identity, if not, you can always fill(one(eltype(x)), size(x)) or what you’re already doing.