Yes, in this case. It could be the number of cores…
Well the XEON does have 8 cores, so … you think it would be good up to 7
Really interesting data points that you have provided. I really like to see these kind of metrics in the Julia forum !
edit: math is hard. 2x xeon = 16. so 1 for windows, and 1 for the virus scanner in windows leaves 14
Actually, each of the sockets has 8 cores. 16 in total then. I don’t have an explanation for the poor performance.
Thanks for investigating this issue further Prof. Krysl. Sorry, I could not find time to come back to the issue linked by Kristoffer yet (but I still follow the thread). From a performance perspective the assembly on a single thread is more or less compute bound. Parallelizing assembly with a low number of threads should not be a big issue (with sufficient memory bandwidth and cache). However, with more threads you increase the pressure on all memory lanes. Here I still think it is a mixture of cache/bandwidth issues (i.e. bad or even conflicting cache access patterns+memory bus cannot keep up with the CPUs read/write access) and frequency boosting (i.e. at lower total load each core has higher frequency). Fore some discussion I highly recommend the WorkStream paper (dx.doi.org/10.1145/2851488), because we basically reproduce Figure 4 from this paper. However, take this with a grain of salt, as I still have to confirm everything in more detailed benchmarks.
Another relevant thread is How to achieve perfect scaling with Threads (Julia 1.7.1) - #10 by carstenbauer which discusses some of the mentioned problems in more detail.
Also, I think I solved the problem with the tasks: I think I need to start (N+1) threads in order to have N tasks. Apparently, one thread needs to be the main to carry the spine of the computation. If I do this, the tasks finish in more or less the same time (there is still a bit of variation, on the order of 10 %, but nothing like the huge differences between fast and slow tasks before).
This is the speed up of Firenze/Windows 10/with tasks.
Edit: The tasks take about 0.03 sec to start and another 0.03 seconds to spawn. So that can add up.
Alas, I spoke to soon. No, the problem of the slow tasks still not solved. On the Horntail, even starting N+1 (or even N+8) threads for N tasks, some of the tasks are still delayed by a significant factor (100-200%) relative to others.
How do I debug this?