Data structures for threaded computing

dataDiver · October 21, 2019, 12:07pm

I have noticed that I am not getting the correct multiple speedups I should be getting when using threads in Julia v1.3 rc4 for what are essentially independent calculations. When I try breaking up the matrices into arrays of arrays I get a < x4 speedup over single threaded workloads with 12 threads.

I am guessing the reason this is happening is because arrays can only be accessed by one thread at a time. My question is are there Julia data structures that allow multiple threads to access the data structure at the same time? For instance do dictionaries and/or tuples allow this to occur, and if so do I need to worry about locks when using these? The data is read only.

baggepinnen · October 21, 2019, 12:45pm

The speedup you see in practice will be influenced by a lot of things. The different threads do for instance share cache memory. If computations on one thread fill the cache, the other threads won’t speed things up very much.

kristoffer.carlsson · October 21, 2019, 1:05pm

That guess is not correct (but see False sharing - Wikipedia).

dataDiver · October 21, 2019, 1:42pm

I’ve tried reducing the block size so that for each block 12 threads will easily fit into cache but this doesn’t make a difference.

pixel27 · October 21, 2019, 2:39pm

First if you have a 6 core CPU then the 12 threads are using hyper-threading which means 2 threads try to share resources in a single core. If both threads are doing the same computations then there is likely conflict over those resources. In extreme situations this might mean the threads run at half speed. (If the threads are doing wildly different operations then there are less conflicts inside the core and the threads run near max speed and everyone is happy.) But that would only get you down to a x6 speed increase.

Next have you run some sort of process monitor i.e. top or htop? Are all cores fully utilized? If you are not seeing 12 cores staying at 100% utilization then you have some sort of contention that is causing the threads to wait.

This could be the amount of time a thread has to spend processing a “chunk”. If that is short then you could be hit by the synchronization of the thread “asking” for a new chunk to process. The longer each thread can run independently the less synchronization that has to happen, and the less time the threads wait to synchronize.

Also are you printing or writing information to disk during the process? This is another area of contention that could be slowing everything down.

Lastly I there could be contention with the garbage collection. I really haven’t played with threading and the garbage collector but the collector might be “stopping the world” to often. Not sure how to debug that other than trying to reduce memory allocation.

ckoe-bccms · October 21, 2019, 3:46pm

Are you perhaps threading a loop in global scope with Threads.@threads ? I am asking because you write “essentially independent”.

I did that in the beginning and the code ran slower with more than one thread than the serial version. Putting the threaded loop in a function gave the expected speed-up. I guess I introduced a hidden closure by using Threads.@threads.

dataDiver · October 21, 2019, 4:08pm

I think the best way is to use an example. Say we want to calculate the matrix dot product (Gram) x'x for very large (long) matrices where the calculation may not fit into memory, we could carry out the calculating in blocks, we could also parallelize the calculation using threads, the code is below. For some reason I’m now getting around x5.5 faster with the same data load as before:

using Base.Threads
using LinearAlgebra.BLAS: set_num_threads

nthreads()
# 12

# Generate the block matrices
function makeNestedMatrix(n::Integer = 1000, p::Integer = 100, chunk = 1000)
  x = Array{Array{Float64, 2}}(undef, n)
  for i in 1:n
    x[i] = rand(Float64, chunk, p)
  end
  return x
end

function threadedGram(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = zeros(Float64, p, p)
  @threads for i in 1:n
    z .+= x[i]' * x[i]
  end
  return z
end

function serialGram(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = zeros(Float64, p, p)
  for i in 1:n
    z .+= x[i]' * x[i]
  end
  return z
end

x = makeNestedMatrix(10_000, 100, 1000);

set_num_threads(1)
@time serialGram(x);
# 4.238534 seconds (20.01 k allocations: 763.779 MiB, 21.82% gc time)

@time threadedGram(x);
# 0.765528 seconds (20.10 k allocations: 763.788 MiB, 0.58% gc time)

# Check with native matrix multiplication

# Generate the dense matrix for comparison
y = zeros(size(x[1])[1]*length(x), size(x[1])[2]);

chunk = size(x[1])[1];
for i in 1:length(x)
  y[((i - 1)*chunk + 1):(i*chunk),:] .= x[i]
end

set_num_threads(12)
@time y' * y;
# 3.229251 seconds (7 allocations: 78.375 KiB)

dataDiver · October 21, 2019, 4:09pm

I now always use functions when doing these kind of comparisons since they are compiled and optimised for specific arguments.

ckoe-bccms · October 21, 2019, 4:19pm

Sure. Everything within that loop was a single function doing the actual computation which got called with changing parameters (parameter variation problem) over and over again. So from a performance perspective it was fine for serial code. With threading it was not.

mbauman · October 21, 2019, 4:32pm

If you’re using matrix multiplication (or other functionality from BLAS) — then your code is already multithreaded within the library call. BLAS’ multithreading doesn’t (yet) compose well with Julia’s threads, so this doesn’t play nicely at the moment, but that’s slated to change for 1.4.

https://github.com/JuliaLang/julia/issues/32786

dataDiver · October 21, 2019, 5:10pm

You might have a point there. I have only 32GB on my computer and generated 30,000 rather the original 10,000 chunks. The CPU utilization for single core and multi-threaded is given below. We hit 100% CPU utilization on threaded briefly. The times are 10.666989 s and 2.904637 s for single and multithreaded respectively.

I guess I’ll gave to rent some cloud compute briefly to test this properly.

With the original 10,000 chunks we don’t even his 100% for multi-threaded applications. The green spike in the middle is for single threaded run.

mbauman · October 21, 2019, 5:13pm

That looks like you’re data-bandwidth-limited. Having a gazillion cores isn’t going to do much if your data can’t get to them fast enough.

dataDiver · October 21, 2019, 5:17pm

What I’m doing should be find should it not? I switch off BLAS threading using set_num_threads(1) to directly compare multi-threaded and single threaded response. The system monitor indicates this switch is working since I only see the selected number of threads active when I do the runs.

I thought that the the * function for 2D arrays is just dispatching to BLAS and we control the number of threads C uses with set_num_threads() function? The only thing I know about partr is that it has something to do with Julia concurrency - majorly naive about that I’m afraid.

mbauman · October 21, 2019, 5:27pm

Yes, you’re correct. I must confess that I didn’t scroll down on your code sample far enough to actually see how you were benchmarking — I just saw the matmul thought that might be your issue. Sorry about that.

That said, you’ll almost never see perfect linear scaling. If you have a 6 core machine, then getting a 4-5.5x scaling with 12 threads is quite good. You might actually get better results with fewer threads (closer to the number of physical cores), but if you’re data-bandwidth-limited it may not have that big of an effect.

dataDiver · October 21, 2019, 6:04pm

What I need is one of those new Threadripper 3 chips to play with, they should have much more cache, even the Ryzen 3950x with 16 cores and 32 threads has a massive 64MB of cache compared to my measly 12MB, twice the amount of cache memory per thread, hmmm nice.

foobar_lv2 · October 22, 2019, 7:48am

dataDiver:

function threadedGram(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = zeros(Float64, p, p)
  @threads for i in 1:n
    z .+= x[i]' * x[i]
  end
  return z
end

Apart from perf issues, this looks like a big data race to me. I think you meant to write

function threadedGram2(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = [zeros(Float64, p, p) for i in 1:Threads.nthreads()]
  @threads for i in 1:n
    LinearAlgebra.mul!(z[Threads.threadid()], x[i]', x[i], 1, 1)
  end
  r = pop!(z)
  for zz in z
      r .+= zz
  end
  return r
end

dataDiver · October 22, 2019, 8:21am

foobar_lv2:

 function threadedGram(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = zeros(Float64, p, p)
  @threads for i in 1:n
    z .+= x[i]' * x[i]
  end
  return z
end

Here is the benchmark to compare them all:

set_num_threads(1)

@time serialGram(x);
#  3.487206 seconds (20.01 k allocations: 763.779 MiB, 1.84% gc time)

@time threadedGram(x);
#  0.762595 seconds (20.09 k allocations: 763.788 MiB)

@time threadedGram2(x);
#  9.677635 seconds (123 allocations: 947.859 KiB)

You have a good point however, I have now noticed that the answer from threadedGram() is slightly different from serialGram() and the native multiplication (the latter two agree). I’ve modified your code and it now runs as fast as threadedGram():

function threadedGram2(x::Array{Array{Float64, 2}})
  p = size(x[1])[2]
  n = length(x)
  z = [zeros(Float64, p, p) for i in 1:nthreads()]
  @threads for i in 1:n
    # mul!(z[threadid()], x[i]', x[i], 1, 1)
    z[threadid()] .+= x[i]' * x[i]
  end
  r = pop!(z)
  for zz in z
      r .+= zz
  end
  return r
end

@time threadedGram2(x);
# 0.784108 seconds (20.11 k allocations: 764.628 MiB, 0.45% gc time)

Sharing code is good. I may never have picked up on that bug!

foobar_lv2 · October 22, 2019, 8:42pm

It appears that I misused the new shiny 5-arg mul!. Oops!

Try LinearAlgebra.mul!(z[Threads.threadid()], x[i]', x[i], true, true) instead. I somehow thought the API would translate integer arguments into the proper type for the underlying BLAS call, but it apparently doesn’t (can go straight to slack/#gripes).

The main reason for the 5-arg mul dance is not so much speed of each iteration, but rather to keep allocations / gc out of the loop (gc in multi-threaded contexts is complex to reason about and a perf trap).

tkf · October 22, 2019, 9:07pm

I think you just need something like https://github.com/JuliaLang/julia/pull/33229 for ::Adjoint * ::Matrix. Julia issue or PR might be a better place for this.

tkf · October 22, 2019, 9:20pm

If you have a parallel mapreduce(f, op, x) (e.g., reduce(op, Map(f), x) from Transducers.jl), a neat way to minimize allocation and call mul! would be to use LazyArrays. Something like this (untested):

using LazyArrays: @~
using Transducers: Map

z = reduce(Map(x -> @~ x'x), x; init=nothing) do a, b
    a === nothing ? copy(b) : a .+= b
end

Topic		Replies	Views
Slower @threads than serial for array computations Julia at Scale	26	2716	May 7, 2020
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4013	February 11, 2019
Julia Threads vs BLAS threads Internals & Design	16	11034	July 26, 2018
Why doesn't multithreading help here? Performance	12	1436	August 22, 2020
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2501	July 13, 2020

Data structures for threaded computing

Related topics