Context
I’d like to use threads to parallelise a physics simulation code. The amount of work each thread has to do in a single loop might be quite small, so I want to keep all the threads running all the time as a lightweight Single-Program-Multiple-Data, but with shared memory (whether this is a good idea is another question I have, for a separate thread).
As part of this, I wanted a barrier()
(analogous to MPI.Barrier()). I tried to use Event
to implement one, but ran into problems because I can’t tell the difference between set=false
because threads haven’t reached the barrier yet or set=false
because it was reset on leaving the barrier. I didn’t fully understand what the problem was, but it was fixed by making a copy of Event
that uses UInt8
instead of Bool
for its state, so that there are more than two states and I can check that all threads have reached the same barrier.
It turns out that my current implementation of this barrier()
is much slower than using MPI.Barrier()
, I’m guessing this is related to the allocations happening within barrier()
- see below.
The problem
When I profile the following code, there are lots of allocations (when running on multiple threads - on a single thread the code short-cuts both for speed and to avoid using a zero-element Array)
module ThreadedTest
using Base.Threads: @threads, threadid, nthreads
import Base: wait, notify
mutable struct EventUInt
notify::Base.Threads.Condition
value::UInt8
EventUInt() = new(Base.Threads.Condition(), 0)
end
function wait(e::EventUInt, value::UInt8)
e.value == value && return
lock(e.notify)
try
while e.value != value
wait(e.notify)
end
finally
unlock(e.notify)
end
nothing
end
function notify(e::EventUInt)
lock(e.notify)
retval = UInt8(0)
try
e.value += UInt8(1)
retval = e.value
notify(e.notify)
finally
unlock(e.notify)
end
return retval
end
function get_value(e::EventUInt)
retval = UInt8(0)
lock(e.notify)
try
retval = e.value
finally
unlock(e.notify)
end
return retval
end
function reset(e::EventUInt)
lock(e.notify)
try
e.value = 0
finally
unlock(e.notify)
end
return nothing
end
function reset(events::Vector{EventUInt})
for e ∈ events
reset(e)
end
end
const _worker_events = [EventUInt() for _ ∈ 1:nthreads()-1]
const _master_event = EventUInt()
function barrier()
myid = threadid()
if nthreads() == 1
return
elseif myid == 1
value = get_value(_master_event) + UInt8(1)
for e in _worker_events
wait(e, value)
end
notify(_master_event)
else
value = notify(_worker_events[myid-1])
wait(_master_event, value)
end
return nothing
end
function driver(n)
for i = 1:n
barrier()
end
return nothing
end
function main(n)
reset(_worker_events)
reset(_master_event)
@threads for i = 1:nthreads()
driver(n)
end
end
end # ThreadedTest
using .ThreadedTest: main
using Profile
main(1)
Profile.clear_malloc_data()
main(1000000)
The worst offender is the line wait(e.notify)
in function wait(e::EventUInt, value::UInt8)
, which allocates ~47MB, but even the previous line while e.value != value
allocates 6.8MB. Can anyone explain why this is happening? Apart from changing Bool
→ UInt8
and passing in a value, EventUInt
and wait(e::EventUInt, value::Uint8)
are copies of Event
and wait(e::Event)
from Base: https://github.com/JuliaLang/julia/blob/v1.6.2/base/lock.jl.
I tested mainly on julia-1.6.2, but noticed there were some recent improvements to profiling multi-threaded code (https://github.com/JuliaLang/julia/pull/41742), so repeated with the latest master
(at 5fb27b2cd), but got the same results.
PS I did memory profiling by running the code snippet above with julia -t 2 -O3 --check-bounds=no --track-allocation=user threads_mwe.jl
, and reading the values saved in threads_mwe.jl.*.mem
Sorry for the long post - thanks for reading!