This is a well known misfeature of the windows OS that needs a minor bit of code at thread creation time to work around. thread libraries like intel TBB do this for you, but otherwise you need a few lines of code.
You can see an article about it here, including how to fix it in c code:
In my test case, there are 250 windows threads created by Julia as I asked, perhaps by using pthreads which I assume lacks this code as well.
When they run in a tight loop that is not memory bound, you will see 256 non-idle threads, but only 1/4 CPU usage in the task manager. Using any good windows diagnostic tool, you can see that julia has created 250 threads, but they are all assigned to the same thread group instead of using the 4 groups that are present in this machine. Thus, the 256 OS threads only use 64 HW threads out of the 256 available
If you run a third party program called “process lasso”, it can be told to “spread the threads out over all the processor groups” for a running process. Applying this to julia makes it able to take advantage of all the cores. You will see that now the process manager shows 100% CPU usage on all cores, and that you get corresponding speedups.
I managed to get a 150x speedup vs single threaded code on this 128/256 core system by doing this.
Memory bandwidth doesn’t figure into this. Even if the threads were horribly memory bound, they would consume 100% of all of the cpus, though they might be spending more time servicing cache misses then doing math, and so not see a scalable speedup.
Its an easy problem to repro, and an easy one tix if you have a system to test on.