This issue is almost certainly because of windows processor thread groups.
On dual socket systems, or more than 64 threads, by default, threads created in windows by a process will all be placed on the same physical core or subset of the threads.
This problem is easy to work around with a little code in thread creation in c++, and I have writtent his code for my c++ code. If I can figure out how to put it in Julia for windows, I’ll take a stab at it.
I have a 128 core/256 thread AMD system. Julia threads on it will only use 1/4 of the cores :-(.
If anyone could help me with where I could code this, its a simple loop that iterates the thread handles and does a couple windows calls.
Perhaps it can be done wittout modifying julia or its libraries at all, if I can iterate the windows htread handles and make some calls into the windows dlls.