Fundamentally the issue is that threadid
is only marginally better than asking “what cpu core am I currently running on”.
All the attempts to write a single loop, use @threads
and then negotiate shared resources are deeply silly if you think about what that macro roughly does:
model = fit(Model, data)
state = [deepcopy(model) for i in 1:Threads.nthreads()]
Threads.@threads for i in 1:1000
m = state[Threads.threadid()]
predict!(m, newdata)
end
is equivalent to
model = fit(Model, data)
state = [deepcopy(model) for i in 1:Threads.nthreads()]
@sync for chunk in ChunkSplitters.index_chunks(1:1000;n= Threads.nthreads())
@spawn for item in chunk
m = state[Threads.threadid()]
predict!(m, newdata)
end
end
Look at that code. It is silly stupid: In the for item in chunk
loop, the relevant threadid
/ m
is constant. Or, according to your mental model it’s supposed to be constant, and it’s non-constancy is what causes the bugs.
You already know what you need to do: The same kind of transformation that is the compiler’s bread and butter, i.e. loop invariant code motion:
model = fit(Model, data)
state = [deepcopy(model) for i in 1:Threads.nthreads()]
@sync for (chunk_idx, chunk) in enumerate(ChunkSplitters.index_chunks(1:1000;n= Threads.nthreads()))
@spawn begin
m = state[chunk_idx]
for item in chunk
predict!(m, newdata)
end
end
end
Acquiring the sparse resource moves out of the inner loop.
None of you would ever write the silly stupid code that Threads.@threads
expands to.
It’s just that Threads.@threads does 2 nice things: It deals with chunking the input, and it deals with the @sync
/@spawn
, and because it conflates these things you don’t see the obvious place to put your state (inside the outer loop, outside the inner loop).
I think that’s partly the fault of Base for not exporting a blindingly obvious equivalent of ChunkSplitters.index_chunks
.
Long term, Threads.@threads should be deprecated. It teaches the wrong abstractions and is toxic to how people think. (can’t be deprecated now since Base can’t be arsed to export a proper way to split the input into chunks)