Multithreaded synchronization using events allocates a lot

johnomotani · September 14, 2021, 5:44pm

Context

I’d like to use threads to parallelise a physics simulation code. The amount of work each thread has to do in a single loop might be quite small, so I want to keep all the threads running all the time as a lightweight Single-Program-Multiple-Data, but with shared memory (whether this is a good idea is another question I have, for a separate thread).

As part of this, I wanted a barrier() (analogous to MPI.Barrier()). I tried to use Event to implement one, but ran into problems because I can’t tell the difference between set=false because threads haven’t reached the barrier yet or set=false because it was reset on leaving the barrier. I didn’t fully understand what the problem was, but it was fixed by making a copy of Event that uses UInt8 instead of Bool for its state, so that there are more than two states and I can check that all threads have reached the same barrier.

It turns out that my current implementation of this barrier() is much slower than using MPI.Barrier(), I’m guessing this is related to the allocations happening within barrier() - see below.

The problem

When I profile the following code, there are lots of allocations (when running on multiple threads - on a single thread the code short-cuts both for speed and to avoid using a zero-element Array)

module ThreadedTest

using Base.Threads: @threads, threadid, nthreads
import Base: wait, notify

mutable struct EventUInt
    notify::Base.Threads.Condition
    value::UInt8
    EventUInt() = new(Base.Threads.Condition(), 0)
end

function wait(e::EventUInt, value::UInt8)
    e.value == value && return
    lock(e.notify)
    try
        while e.value != value
            wait(e.notify)
        end
    finally
        unlock(e.notify)
    end
    nothing
end

function notify(e::EventUInt)
    lock(e.notify)
    retval = UInt8(0)
    try
        e.value += UInt8(1)
        retval = e.value
        notify(e.notify)
    finally
        unlock(e.notify)
    end

    return retval
end

function get_value(e::EventUInt)
    retval = UInt8(0)
    lock(e.notify)
    try
        retval = e.value
    finally
        unlock(e.notify)
    end
    return retval
end

function reset(e::EventUInt)
    lock(e.notify)
    try
        e.value = 0
    finally
        unlock(e.notify)
    end
    return nothing
end

function reset(events::Vector{EventUInt})
    for e ∈ events
        reset(e)
    end
end

const _worker_events = [EventUInt() for _ ∈ 1:nthreads()-1]
const _master_event = EventUInt()
function barrier()
    myid = threadid()
    if nthreads() == 1
        return
    elseif myid == 1
        value = get_value(_master_event) + UInt8(1)
        for e in _worker_events
            wait(e, value)
        end
        notify(_master_event)
    else
        value = notify(_worker_events[myid-1])
        wait(_master_event, value)
    end
    return nothing
end

function driver(n)
    for i = 1:n
        barrier()
    end

    return nothing
end

function main(n)
    reset(_worker_events)
    reset(_master_event)
    @threads for i = 1:nthreads()
        driver(n)
    end
end

end # ThreadedTest

using .ThreadedTest: main
using Profile

main(1)
Profile.clear_malloc_data()
main(1000000)

The worst offender is the line wait(e.notify) in function wait(e::EventUInt, value::UInt8), which allocates ~47MB, but even the previous line while e.value != value allocates 6.8MB. Can anyone explain why this is happening? Apart from changing Bool → UInt8 and passing in a value, EventUInt and wait(e::EventUInt, value::Uint8) are copies of Event and wait(e::Event) from Base: https://github.com/JuliaLang/julia/blob/v1.6.2/base/lock.jl.

I tested mainly on julia-1.6.2, but noticed there were some recent improvements to profiling multi-threaded code (https://github.com/JuliaLang/julia/pull/41742), so repeated with the latest master (at 5fb27b2cd), but got the same results.

PS I did memory profiling by running the code snippet above with julia -t 2 -O3 --check-bounds=no --track-allocation=user threads_mwe.jl, and reading the values saved in threads_mwe.jl.*.mem

Sorry for the long post - thanks for reading!

Topic		Replies	Views
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	641	August 7, 2018
Use Dynamic Arrays (Vectors) during Multi-threading Julia at Scale multithreading , array , memory-allocation	7	2086	September 10, 2018
Threads memory allocations General Usage	2	506	January 22, 2020
Understanding Allocations in multithreaded code Performance multithreading	4	1300	June 15, 2020
Allocations in @threads loop General Usage multithreading	3	979	May 20, 2017

Multithreaded synchronization using events allocates a lot

Context

The problem

Related topics