Questions on parallel programming and Threads.@threads vs @parallel with SharedArray


#1

Hi!

I have a few questions on parallel programming and multi-threading. If I understand correctly, the Threads.@threads feature is modeled after OpenMP’s and Cilk’s multi-threading schemes, where one and only one thread is running on one logical processor, but somehow the heap memory is shared between all processors/processes/threads. The stack on the other hand is split between the threads. This seems verifiable from the following code which allocates constant memory regardless of the input vector size, and the input vector is a normal array not a SharedArray.

using BenchmarkTools

function g(a, n)
   @inbounds Threads.@threads for i in 1:n
       a[i] = i*rand() + rand()^2 * (i - 1)^2 - 2;
   end
   return
end

n = 100000; a = zeros(Float64, n); @btime g(a, n)
#  1.573 ms (1 allocation: 32 bytes)

The above code is run with Threads.nthreads() equal 8.

  1. Can someone please verify the above understanding? If my understanding is correct, then every stack-allocated variable read from a threaded block will be copied over to each thread’s stack when creating new threads, which is useful to keep in mind.

In contrast, @parallel requires an array to be a SharedArray to be visible to all logical processors/processes/threads or it needs to be sent over, so the heap and stack memories are not shared by default. The following shows a comparison of the running times and allocation using Threads.@threads and @parallel, which shows a clear advantage to the @parallel construct even though @parallel uses only 7 workers for 8 processes while Threads.@threads uses all the 8 threads AFAICT (correct me if I am wrong). On the other hand, @parallel has a larger overhead in both time and allocations which reverse the ranking when testing with simpler functions.

  1. So can someone explain the difference below, when it would be favorable to use Threads.@threads over @parallel, and if the following benchmarks are expected to change in v1.0? I currently don’t have access to nightly or Linux so this is on v0.6.1 on a Windows machine.
function f(a, n)
   @inbounds @parallel for i in 1:n
       a[i] = i*rand() + rand()^2 * (i - 1)^2 - 2;
   end
   return
end

function g(a, n)
   @inbounds Threads.@threads for i in 1:n
       a[i] = i*rand() + rand()^2 * (i - 1)^2 - 2;
   end
   return
end

n = 100000; a = SharedArray{Float64}(n); @btime f(a, n)
#  399.680 μs (942 allocations: 38.25 KiB)

n = 100000; a = zeros(Float64, n); @btime g(a, n)
#  1.661 ms (1 allocation: 32 bytes)
  1. Does the Julia backend of GPUArrays.jl use Threads.@threads or @parallel constructs?

  2. Is it possible to use @simd with multi-threading? It gives an error anywhere I try it.

  3. Don’t you think it would be nice to have an inplace pmap! like map!?

Thanks a lot in advance!


#2

It seems to be using coroutines/tasks/cooperative threading (@async routines) which are different from the multi-threading above in that the cooperative threads are not executed in parallel and they are all run on a single logical processor/process at different times.


#3

Yes, the heap is shared by threads and stacks are separate. But threads can still see each others’ stack memory, so I don’t think the extra copying you mention is needed.

Your example uses the default RNG which is not threaded, so the threads are competing for access. The simplest fix is to use a vector of RNGs, indexed by thread id. This should not only be more correct (i.e. actually produce reliably random numbers), but also avoids cache collisions.

I think you need to use @sync @parallel to get true execution times for the multiprocess version. On my Linux system that makes @parallel look less efficient than @threads. (A factor of 2 here, with the above-mentioned vector of RNGs for threads.)

So @threads is likely to be a bit more efficient and gives immediate access to all existing objects, at the cost of needing to watch out for resource contention and avoid thread-unsafe functions. Unfortunately since multithreading is still “experimental”, some causes of contention and lack of safety are not documented. (But Yuyichao has been doing great work to shrink their number!)

@simd should be fine in a threaded context, but you may need to build an inner SIMD loop to avoid confusing the threading macro.


#4

Yes using @simd together with @threads is possible. See http://slides.com/valentinchuravy/julia-parallelism#/5/6 for an example.

I would recommend looking into DistributedArrays.jl which does offer that.


#5

Hmm, this bit is kind of confusing, is there a reference I can take a look at? I got rid of rand() for easier benchmarking, and did the following instead and the results are very much in favor of @threads. Thanks!

using BenchmarkTools
using StaticArrays

function g(a, t, n)
   @inbounds Threads.@threads for i in 1:n
       a[i] = i + sum(t)/length(t) - log(sum(t))/2;
   end
   return
end

function f(a, t, n)
   @inbounds @sync @parallel for i in 1:n
       a[i] = i + sum(t)/length(t) - log(sum(t))/2;
   end
   return
end

n = 10000; a = zeros(Float64, n); t = SVector(Tuple(i for i in 1:1000)); @btime g($a, $t, $n)
#  432.866 μs (1 allocation: 7.94 KiB)

n = 10000; a = SharedArray{Float64}(n); t = SVector(Tuple(i for i in 1:1000)); @btime f($a, $t, $n)
#  2.385 ms (8326 allocations: 318.86 KiB)

n = 10000; a = zeros(Float64, n); t = SVector(Tuple(i for i in 1:100)); @btime g($a, $t, $n)
#  77.310 μs (1 allocation: 896 bytes)

n = 10000; a = SharedArray{Float64}(n); t = SVector(Tuple(i for i in 1:100)); @btime f($a, $t, $n)
#  2.009 ms (2018 allocations: 84.13 KiB)

n = 10000; a = zeros(Float64, n); t = SVector(Tuple(i for i in 1:10)); @btime g($a, $t, $n)
#  30.268 μs (1 allocation: 112 bytes)

n = 10000; a = SharedArray{Float64}(n); t = SVector(Tuple(i for i in 1:10)); @btime f($a, $t, $n)
#  1.918 ms (1547 allocations: 62.64 KiB)

Side note: when I used a Tuple instead of SVector, I was getting an inefficiency when summing tuples of size 15 and more, apparently the tuple was moved to the heap before summing as allocations were proportional to the tuple size. This is not related to parallel programming per se but I think it is relevant here.


#6

That presentation is awesome, thanks a lot!

Also DistributedArrays.jl seems great, but I think it only supports inplace map! with DArrays, not SharedArrays, so I think the latter would still be interesting to have.

Edit: actually this seems trivial to do with @parallel and a for loop, so unless the inplace pmap! also activates SIMD by default, it’s probably not much value added.


#7

The current implementation of threading uses the libuv interface; for Linux this is a thin wrapper around native pthreads and I presume the logic is made similar for other platforms. So manual pages for pthreads (easy to find online) and related discussions may be helpful.

With regard to stack variables, my understanding is this: @threads builds a function which will be called in the context of each thread. Arguments are put onto the appropriate stack for the thread where the function will execute (and local variables may be added there). Variables in earlier stack frames can be referenced by pointer and need not be in the local stack. Apparently large immutables like your SVector are copied (once!) to new memory on the heap, and the address of this is one such reference.

The strange handling of large tuples seems to be a trick to avoid overly expensive inference.


#8

Great, thanks for all the comments and pointer!