Data structures for threaded computing

I have noticed that I am not getting the correct multiple speedups I should be getting when using threads in Julia v1.3 rc4 for what are essentially independent calculations. When I try breaking up the matrices into arrays of arrays I get a < x4 speedup over single threaded workloads with 12 threads.

I am guessing the reason this is happening is because arrays can only be accessed by one thread at a time. My question is are there Julia data structures that allow multiple threads to access the data structure at the same time? For instance do dictionaries and/or tuples allow this to occur, and if so do I need to worry about locks when using these? The data is read only.

1 Like

The speedup you see in practice will be influenced by a lot of things. The different threads do for instance share cache memory. If computations on one thread fill the cache, the other threads won’t speed things up very much.

1 Like

That guess is not correct (but see https://en.wikipedia.org/wiki/False_sharing).

I’ve tried reducing the block size so that for each block 12 threads will easily fit into cache but this doesn’t make a difference.

First if you have a 6 core CPU then the 12 threads are using hyper-threading which means 2 threads try to share resources in a single core. If both threads are doing the same computations then there is likely conflict over those resources. In extreme situations this might mean the threads run at half speed. (If the threads are doing wildly different operations then there are less conflicts inside the core and the threads run near max speed and everyone is happy.) But that would only get you down to a x6 speed increase.

Next have you run some sort of process monitor i.e. top or htop? Are all cores fully utilized? If you are not seeing 12 cores staying at 100% utilization then you have some sort of contention that is causing the threads to wait.

This could be the amount of time a thread has to spend processing a “chunk”. If that is short then you could be hit by the synchronization of the thread “asking” for a new chunk to process. The longer each thread can run independently the less synchronization that has to happen, and the less time the threads wait to synchronize.

Also are you printing or writing information to disk during the process? This is another area of contention that could be slowing everything down.

Lastly I there could be contention with the garbage collection. I really haven’t played with threading and the garbage collector but the collector might be “stopping the world” to often. Not sure how to debug that other than trying to reduce memory allocation.

1 Like

Are you perhaps threading a loop in global scope with Threads.@threads ? I am asking because you write “essentially independent”.

I did that in the beginning and the code ran slower with more than one thread than the serial version. Putting the threaded loop in a function gave the expected speed-up. I guess I introduced a hidden closure by using Threads.@threads.

I think the best way is to use an example. Say we want to calculate the matrix dot product (Gram) x'x for very large (long) matrices where the calculation may not fit into memory, we could carry out the calculating in blocks, we could also parallelize the calculation using threads, the code is below. For some reason I’m now getting around x5.5 faster with the same data load as before:

using Base.Threads
using LinearAlgebra.BLAS: set_num_threads

nthreads()
# 12

# Generate the block matrices
function makeNestedMatrix(n::Integer = 1000, p::Integer = 100, chunk = 1000)
  x = Array{Array{Float64, 2}}(undef, n)
  for i in 1:n
    x[i] = rand(Float64, chunk, p)
  end
  return x
end

function threadedGram(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = zeros(Float64, p, p)
  @threads for i in 1:n
    z .+= x[i]' * x[i]
  end
  return z
end

function serialGram(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = zeros(Float64, p, p)
  for i in 1:n
    z .+= x[i]' * x[i]
  end
  return z
end

x = makeNestedMatrix(10_000, 100, 1000);

set_num_threads(1)
@time serialGram(x);
# 4.238534 seconds (20.01 k allocations: 763.779 MiB, 21.82% gc time)

@time threadedGram(x);
# 0.765528 seconds (20.10 k allocations: 763.788 MiB, 0.58% gc time)

# Check with native matrix multiplication

# Generate the dense matrix for comparison
y = zeros(size(x[1])[1]*length(x), size(x[1])[2]);

chunk = size(x[1])[1];
for i in 1:length(x)
  y[((i - 1)*chunk + 1):(i*chunk),:] .= x[i]
end

set_num_threads(12)
@time y' * y;
# 3.229251 seconds (7 allocations: 78.375 KiB)

I now always use functions when doing these kind of comparisons since they are compiled and optimised for specific arguments.

Sure. Everything within that loop was a single function doing the actual computation which got called with changing parameters (parameter variation problem) over and over again. So from a performance perspective it was fine for serial code. With threading it was not.

If you’re using matrix multiplication (or other functionality from BLAS) — then your code is already multithreaded within the library call. BLAS’ multithreading doesn’t (yet) compose well with Julia’s threads, so this doesn’t play nicely at the moment, but that’s slated to change for 1.4.

5 Likes

You might have a point there. I have only 32GB on my computer and generated 30,000 rather the original 10,000 chunks. The CPU utilization for single core and multi-threaded is given below. We hit 100% CPU utilization on threaded briefly. The times are 10.666989 s and 2.904637 s for single and multithreaded respectively.

I guess I’ll gave to rent some cloud compute briefly to test this properly.

With the original 10,000 chunks we don’t even his 100% for multi-threaded applications. The green spike in the middle is for single threaded run.

That looks like you’re data-bandwidth-limited. Having a gazillion cores isn’t going to do much if your data can’t get to them fast enough.

What I’m doing should be find should it not? I switch off BLAS threading using set_num_threads(1) to directly compare multi-threaded and single threaded response. The system monitor indicates this switch is working since I only see the selected number of threads active when I do the runs.

I thought that the the * function for 2D arrays is just dispatching to BLAS and we control the number of threads C uses with set_num_threads() function? The only thing I know about partr is that it has something to do with Julia concurrency - majorly naive about that I’m afraid.

Yes, you’re correct. I must confess that I didn’t scroll down on your code sample far enough to actually see how you were benchmarking — I just saw the matmul thought that might be your issue. Sorry about that.

That said, you’ll almost never see perfect linear scaling. If you have a 6 core machine, then getting a 4-5.5x scaling with 12 threads is quite good. You might actually get better results with fewer threads (closer to the number of physical cores), but if you’re data-bandwidth-limited it may not have that big of an effect.

1 Like

What I need is one of those new Threadripper 3 chips to play with, they should have much more cache, even the Ryzen 3950x with 16 cores and 32 threads has a massive 64MB of cache compared to my measly 12MB, twice the amount of cache memory per thread, hmmm nice. :smile:

Apart from perf issues, this looks like a big data race to me. I think you meant to write

function threadedGram2(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = [zeros(Float64, p, p) for i in 1:Threads.nthreads()]
  @threads for i in 1:n
    LinearAlgebra.mul!(z[Threads.threadid()], x[i]', x[i], 1, 1)
  end
  r = pop!(z)
  for zz in z
      r .+= zz
  end
  return r
end
1 Like

Here is the benchmark to compare them all:

set_num_threads(1)

@time serialGram(x);
#  3.487206 seconds (20.01 k allocations: 763.779 MiB, 1.84% gc time)

@time threadedGram(x);
#  0.762595 seconds (20.09 k allocations: 763.788 MiB)

@time threadedGram2(x);
#  9.677635 seconds (123 allocations: 947.859 KiB)

You have a good point however, I have now noticed that the answer from threadedGram() is slightly different from serialGram() and the native multiplication (the latter two agree). I’ve modified your code and it now runs as fast as threadedGram():

function threadedGram2(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = [zeros(Float64, p, p) for i in 1:nthreads()]
  @threads for i in 1:n
    # mul!(z[threadid()], x[i]', x[i], 1, 1)
    z[threadid()] .+= x[i]' * x[i]
  end
  r = pop!(z)
  for zz in z
      r .+= zz
  end
  return r
end

@time threadedGram2(x);
# 0.784108 seconds (20.11 k allocations: 764.628 MiB, 0.45% gc time)

Sharing code is good. I may never have picked up on that bug!

It appears that I misused the new shiny 5-arg mul!. Oops!

Try LinearAlgebra.mul!(z[Threads.threadid()], x[i]', x[i], true, true) instead. I somehow thought the API would translate integer arguments into the proper type for the underlying BLAS call, but it apparently doesn’t (can go straight to slack/#gripes).

The main reason for the 5-arg mul dance is not so much speed of each iteration, but rather to keep allocations / gc out of the loop (gc in multi-threaded contexts is complex to reason about and a perf trap).

1 Like

I think you just need something like https://github.com/JuliaLang/julia/pull/33229 for ::Adjoint * ::Matrix. Julia issue or PR might be a better place for this.

1 Like

If you have a parallel mapreduce(f, op, x) (e.g., reduce(op, Map(f), x) from Transducers.jl), a neat way to minimize allocation and call mul! would be to use LazyArrays. Something like this (untested):

using LazyArrays: @~
using Transducers: Map

z = reduce(Map(x -> @~ x'x), x; init=nothing) do a, b
    a === nothing ? copy(b) : a .+= b
end

See also: Parallel reductions

2 Likes