Question about Multi-threading Performance

randn isn’t threadsafe.

using Compat, Compat.Random

const twisters = [MersenneTwister() for i ∈ 1:Threads.nthreads()];

#Multithreaded.
function tf!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = twisters[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end

yields:

julia> @btime tf!($vec_of_mats,10_000)
  32.722 ms (1 allocation: 32 bytes)

julia> @btime tf_nt!($vec_of_mats,10_000)
  451.402 ms (0 allocations: 0 bytes)

This is on a computer with 16 cores and 32 threads. Not quite 16x, but pretty good.

EDIT:
DO NOT FOLLOW THE ABOVE EXAMPLE.
See these comments for an explanation:

Although that entire thread is on a similar topic to this one.

The key issues are:

  1. The twisters I generate at the top could overlap, meaning the results aren’t necessarily random (ie, parts of the sequences of two of them could be identical!).
  2. False sharing is hurting performance – unnecessary communication is going on between threads, as they update one another about what’s going on with the twisters.

Using the TRNG object from KissThreading solves both of these problems. Using the threaded map function from that library,

# on 0.6:
# Pkg.clone("https://github.com/bkamins/KissThreading.jl")
# on 0.7:
# ] add https://github.com/bkamins/KissThreading.jl 
using KissThreading, BenchmarkTools

vec_of_mats = [zeros(100,100) for tt=1:Threads.nthreads()];

function tf!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(TRNG[Threads.threadid()])
        end
        x
    end
    nothing
end

The way I split up the iteration range there is a little awkward. Maybe I should use cld instead of div or ÷, so that the earlier matrices will see more updates in both, but I figured the point was just to run threading.

This is correct in that none of the twisters should overlap, and the TRNG object is also created in a way so that the resulting twisters are not located next to each other in memory either, preventing false sharing:

julia> @btime tf!($vec_of_mats,10_000)
  21.976 ms (3 allocations: 96 bytes)

Using the original function, but simply subbing in TRNG in place of twisters:

function tfo!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = TRNG[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end

we also see that it is dramatically faster than the (also incorrect) version that suffered from false sharing:

julia> @btime tfo!($vec_of_mats,10_000);
  23.607 ms (1 allocation: 32 bytes)

Also, about twenty times faster than not using threading!

7 Likes