Multithreaded synchronization using events allocates a lot

Context

I’d like to use threads to parallelise a physics simulation code. The amount of work each thread has to do in a single loop might be quite small, so I want to keep all the threads running all the time as a lightweight Single-Program-Multiple-Data, but with shared memory (whether this is a good idea is another question I have, for a separate thread).

As part of this, I wanted a barrier() (analogous to MPI.Barrier()). I tried to use Event to implement one, but ran into problems because I can’t tell the difference between set=false because threads haven’t reached the barrier yet or set=false because it was reset on leaving the barrier. I didn’t fully understand what the problem was, but it was fixed by making a copy of Event that uses UInt8 instead of Bool for its state, so that there are more than two states and I can check that all threads have reached the same barrier.

It turns out that my current implementation of this barrier() is much slower than using MPI.Barrier(), I’m guessing this is related to the allocations happening within barrier() - see below.

The problem

When I profile the following code, there are lots of allocations (when running on multiple threads - on a single thread the code short-cuts both for speed and to avoid using a zero-element Array)

module ThreadedTest

using Base.Threads: @threads, threadid, nthreads
import Base: wait, notify

mutable struct EventUInt
    notify::Base.Threads.Condition
    value::UInt8
    EventUInt() = new(Base.Threads.Condition(), 0)
end

function wait(e::EventUInt, value::UInt8)
    e.value == value && return
    lock(e.notify)
    try
        while e.value != value
            wait(e.notify)
        end
    finally
        unlock(e.notify)
    end
    nothing
end

function notify(e::EventUInt)
    lock(e.notify)
    retval = UInt8(0)
    try
        e.value += UInt8(1)
        retval = e.value
        notify(e.notify)
    finally
        unlock(e.notify)
    end

    return retval
end

function get_value(e::EventUInt)
    retval = UInt8(0)
    lock(e.notify)
    try
        retval = e.value
    finally
        unlock(e.notify)
    end
    return retval
end

function reset(e::EventUInt)
    lock(e.notify)
    try
        e.value = 0
    finally
        unlock(e.notify)
    end
    return nothing
end

function reset(events::Vector{EventUInt})
    for e ∈ events
        reset(e)
    end
end

const _worker_events = [EventUInt() for _ ∈ 1:nthreads()-1]
const _master_event = EventUInt()
function barrier()
    myid = threadid()
    if nthreads() == 1
        return
    elseif myid == 1
        value = get_value(_master_event) + UInt8(1)
        for e in _worker_events
            wait(e, value)
        end
        notify(_master_event)
    else
        value = notify(_worker_events[myid-1])
        wait(_master_event, value)
    end
    return nothing
end

function driver(n)
    for i = 1:n
        barrier()
    end

    return nothing
end

function main(n)
    reset(_worker_events)
    reset(_master_event)
    @threads for i = 1:nthreads()
        driver(n)
    end
end

end # ThreadedTest

using .ThreadedTest: main
using Profile

main(1)
Profile.clear_malloc_data()
main(1000000)

The worst offender is the line wait(e.notify) in function wait(e::EventUInt, value::UInt8), which allocates ~47MB, but even the previous line while e.value != value allocates 6.8MB. Can anyone explain why this is happening? Apart from changing BoolUInt8 and passing in a value, EventUInt and wait(e::EventUInt, value::Uint8) are copies of Event and wait(e::Event) from Base: julia/lock.jl at v1.6.2 · JuliaLang/julia · GitHub.

I tested mainly on julia-1.6.2, but noticed there were some recent improvements to profiling multi-threaded code (Profile: Thread and task-specific profiling by IanButterworth · Pull Request #41742 · JuliaLang/julia · GitHub), so repeated with the latest master (at 5fb27b2cd), but got the same results.

PS I did memory profiling by running the code snippet above with julia -t 2 -O3 --check-bounds=no --track-allocation=user threads_mwe.jl, and reading the values saved in threads_mwe.jl.*.mem

Sorry for the long post - thanks for reading!