Question about Multi-threading Performance


#1

I was testing out some simple multithreaded code and noticed that in some instances the multithreaded version performs much more poorly than the single-threaded version. I’ve posted a minimum working example below. Does anyone see why the multithreaded version performs so poorly? The threads are all accessing different matrices, so they shouldn’t ever be trying to overwrite the same piece of memory. Thank you in advance for any help.

#Multithreaded.
function tf!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        @inbounds x_thd = x[Threads.threadid()]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn()
            end
        end
    end
    return nothing
end

#Single Threaded
function tf_nt!(x::Vector{Matrix{Float64}},N::Int64)
    for ii=1:N
        z = rand(1:Threads.nthreads())
		x_z = x[z]
        for nn=1:100
            for mm=1:100
                @inbounds x_z[mm,nn] += randn()
            end
        end
    end
    return nothing
end

vec_of_mats = [zeros(100,100) for tt=1:Threads.nthreads()]

@btime tf!($vec_of_mats,10_000)
2.457 s (1 allocation: 32 bytes)

@btime tf_nt!(vec_of_mats,10_000)
501.252 (0 allocation: 0 bytes)


#2

Not at a computer, but resource contention due to randn in the loop?


#3

randn isn’t threadsafe.

using Compat, Compat.Random

const twisters = [MersenneTwister() for i ∈ 1:Threads.nthreads()];

#Multithreaded.
function tf!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = twisters[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end

yields:

julia> @btime tf!($vec_of_mats,10_000)
  32.722 ms (1 allocation: 32 bytes)

julia> @btime tf_nt!($vec_of_mats,10_000)
  451.402 ms (0 allocations: 0 bytes)

This is on a computer with 16 cores and 32 threads. Not quite 16x, but pretty good.

EDIT:
DO NOT FOLLOW THE ABOVE EXAMPLE.
See these comments for an explanation:



Although that entire thread is on a similar topic to this one.

The key issues are:

  1. The twisters I generate at the top could overlap, meaning the results aren’t necessarily random (ie, parts of the sequences of two of them could be identical!).
  2. False sharing is hurting performance – unnecessary communication is going on between threads, as they update one another about what’s going on with the twisters.

Using the TRNG object from KissThreading solves both of these problems. Using the threaded map function from that library,

# on 0.6:
# Pkg.clone("https://github.com/bkamins/KissThreading.jl")
# on 0.7:
# ] add https://github.com/bkamins/KissThreading.jl 
using KissThreading, BenchmarkTools

vec_of_mats = [zeros(100,100) for tt=1:Threads.nthreads()];

function tf!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(TRNG[Threads.threadid()])
        end
        x
    end
    nothing
end

The way I split up the iteration range there is a little awkward. Maybe I should use cld instead of div or ÷, so that the earlier matrices will see more updates in both, but I figured the point was just to run threading.

This is correct in that none of the twisters should overlap, and the TRNG object is also created in a way so that the resulting twisters are not located next to each other in memory either, preventing false sharing:

julia> @btime tf!($vec_of_mats,10_000)
  21.976 ms (3 allocations: 96 bytes)

Using the original function, but simply subbing in TRNG in place of twisters:

function tfo!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = TRNG[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end

we also see that it is dramatically faster than the (also incorrect) version that suffered from false sharing:

julia> @btime tfo!($vec_of_mats,10_000);
  23.607 ms (1 allocation: 32 bytes)

Also, about twenty times faster than not using threading!


Poor performance on cluster multithreading
#4

@tkoolen and @Elrod, thanks! Mystery solved.


Poor performance on cluster multithreading