I did some profiling with ProfileView
and found that in one of the slow runs only one thread was executing code that I wrote. The other three were executing code in the ThreadingUtilities
package.
In this screenshot the left chunk is my code (or something that I recognize as my code), whereas everything to the right comes from ThreadingUtilities
:
Threads 2-4 only contain the part to the right (everything to the right of the leftmost red bar). Maybe this is by design, not sure.
Looks like it’s by design, since that code runs LoopVectorization.TURBO
from LoopVectorization/tSQDi/src/codegen/lower_threads.jl:10
(the “tower” to the right). The two rightmost blocks are operators +
and <
from int.jl
(Base Julia, I assume). Which means that about 50% of the time is spent in these operators???