Question about Multi-threading Performance

Elrod · June 30, 2018, 6:30am

randn isn’t threadsafe.

using Compat, Compat.Random

const twisters = [MersenneTwister() for i ∈ 1:Threads.nthreads()];

#Multithreaded.
function tf!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = twisters[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end

yields:

julia> @btime tf!($vec_of_mats,10_000)
  32.722 ms (1 allocation: 32 bytes)

julia> @btime tf_nt!($vec_of_mats,10_000)
  451.402 ms (0 allocations: 0 bytes)

This is on a computer with 16 cores and 32 threads. Not quite 16x, but pretty good.

EDIT:
DO NOT FOLLOW THE ABOVE EXAMPLE.
See these comments for an explanation:

Although that entire thread is on a similar topic to this one.

The key issues are:

The twisters I generate at the top could overlap, meaning the results aren’t necessarily random (ie, parts of the sequences of two of them could be identical!).
False sharing is hurting performance – unnecessary communication is going on between threads, as they update one another about what’s going on with the twisters.

Using the TRNG object from KissThreading solves both of these problems. Using the threaded map function from that library,

# on 0.6:
# Pkg.clone("https://github.com/bkamins/KissThreading.jl")
# on 0.7:
# ] add https://github.com/bkamins/KissThreading.jl 
using KissThreading, BenchmarkTools

vec_of_mats = [zeros(100,100) for tt=1:Threads.nthreads()];

function tf!(x::Vector{Matrix{Float64}},N::Int64)
    tmap!(x, x) do x
        id = Threads.threadid()
        n = Threads.nthreads()
        @inbounds for i ∈ 1+(id-1)*N÷n:id*N÷n, j ∈ eachindex(x)
            x[j] += randn(TRNG[Threads.threadid()])
        end
        x
    end
    nothing
end

The way I split up the iteration range there is a little awkward. Maybe I should use cld instead of div or ÷, so that the earlier matrices will see more updates in both, but I figured the point was just to run threading.

This is correct in that none of the twisters should overlap, and the TRNG object is also created in a way so that the resulting twisters are not located next to each other in memory either, preventing false sharing:

julia> @btime tf!($vec_of_mats,10_000)
  21.976 ms (3 allocations: 96 bytes)

Using the original function, but simply subbing in TRNG in place of twisters:

function tfo!(x::Vector{Matrix{Float64}},N::Int64)
    Threads.@threads for ii=1:N
        id = Threads.threadid()
        twister = TRNG[id]
        @inbounds x_thd = x[id]
        for nn=1:100
            for mm=1:100
                @inbounds x_thd[mm,nn] += randn(twister)
            end
        end
    end
    return nothing
end

we also see that it is dramatically faster than the (also incorrect) version that suffered from false sharing:

julia> @btime tfo!($vec_of_mats,10_000);
  23.607 ms (1 allocation: 32 bytes)

Also, about twenty times faster than not using threading!

Topic		Replies	Views
Multithreading of a simple loop Performance performance , multithreading	6	2087	November 3, 2020
Memory allocations and performance with multithreading Performance multithreading	8	1271	May 10, 2021
@threads for loop performance Performance	6	710	December 11, 2020
A question about parallel performance in multithreading Performance question , performance , multithreading	10	653	May 12, 2022
Multithreading doesn't improve the performance Performance multithreading	6	126	February 18, 2025

Question about Multi-threading Performance

Related topics