Context
I’d like to use threads to parallelise a physics simulation code. The amount of work each thread has to do in a single loop might be quite small, so I want to keep all the threads running all the time as a lightweight Single-Program-Multiple-Data, but with shared memory (whether this is a good idea is another question I have, for a separate thread).
As part of this, I wanted a barrier() (analogous to MPI.Barrier()). I tried to use Event to implement one, but ran into problems because I can’t tell the difference between set=false because threads haven’t reached the barrier yet or set=false because it was reset on leaving the barrier. I didn’t fully understand what the problem was, but it was fixed by making a copy of Event that uses UInt8 instead of Bool for its state, so that there are more than two states and I can check that all threads have reached the same barrier.
It turns out that my current implementation of this barrier() is much slower than using MPI.Barrier(), I’m guessing this is related to the allocations happening within barrier() - see below.
The problem
When I profile the following code, there are lots of allocations (when running on multiple threads - on a single thread the code short-cuts both for speed and to avoid using a zero-element Array)
module ThreadedTest
using Base.Threads: @threads, threadid, nthreads
import Base: wait, notify
mutable struct EventUInt
notify::Base.Threads.Condition
value::UInt8
EventUInt() = new(Base.Threads.Condition(), 0)
end
function wait(e::EventUInt, value::UInt8)
e.value == value && return
lock(e.notify)
try
while e.value != value
wait(e.notify)
end
finally
unlock(e.notify)
end
nothing
end
function notify(e::EventUInt)
lock(e.notify)
retval = UInt8(0)
try
e.value += UInt8(1)
retval = e.value
notify(e.notify)
finally
unlock(e.notify)
end
return retval
end
function get_value(e::EventUInt)
retval = UInt8(0)
lock(e.notify)
try
retval = e.value
finally
unlock(e.notify)
end
return retval
end
function reset(e::EventUInt)
lock(e.notify)
try
e.value = 0
finally
unlock(e.notify)
end
return nothing
end
function reset(events::Vector{EventUInt})
for e ∈ events
reset(e)
end
end
const _worker_events = [EventUInt() for _ ∈ 1:nthreads()-1]
const _master_event = EventUInt()
function barrier()
myid = threadid()
if nthreads() == 1
return
elseif myid == 1
value = get_value(_master_event) + UInt8(1)
for e in _worker_events
wait(e, value)
end
notify(_master_event)
else
value = notify(_worker_events[myid-1])
wait(_master_event, value)
end
return nothing
end
function driver(n)
for i = 1:n
barrier()
end
return nothing
end
function main(n)
reset(_worker_events)
reset(_master_event)
@threads for i = 1:nthreads()
driver(n)
end
end
end # ThreadedTest
using .ThreadedTest: main
using Profile
main(1)
Profile.clear_malloc_data()
main(1000000)
The worst offender is the line wait(e.notify) in function wait(e::EventUInt, value::UInt8), which allocates ~47MB, but even the previous line while e.value != value allocates 6.8MB. Can anyone explain why this is happening? Apart from changing Bool → UInt8 and passing in a value, EventUInt and wait(e::EventUInt, value::Uint8) are copies of Event and wait(e::Event) from Base: https://github.com/JuliaLang/julia/blob/v1.6.2/base/lock.jl.
I tested mainly on julia-1.6.2, but noticed there were some recent improvements to profiling multi-threaded code (https://github.com/JuliaLang/julia/pull/41742), so repeated with the latest master (at 5fb27b2cd), but got the same results.
PS I did memory profiling by running the code snippet above with julia -t 2 -O3 --check-bounds=no --track-allocation=user threads_mwe.jl, and reading the values saved in threads_mwe.jl.*.mem
Sorry for the long post - thanks for reading!