I had something similar going on in my recent attempts to parallelize Agents.jl models [solved] Agents.jl spending ¾ of its time doing something before it even starts to compute?
yes the initial question was about profiling, but in the end I was seeing failure to spread the load across available threads.
one thing I discovered was that @spawn tasks do not migrate across threads yet, so if you get unlucky and have two tasks on one thread, they serialize.
actually more related post: Threading usage patterns: Worker pools for Agents.jl and SQLite.jl