@carstenbauer Cool package! I want to de derail this further so let me open a new topic
When I first looked at it, the performance advantage of scattered pinning was puzzling for me. Intuitively, you would want to avoid touching remote memory. Then it occurred to me that not all computations require frequent cross-thread communication. So, I’m guessing scattered pinning is advantageous since it effectively “expand” the memory bus and the L3 cache? Is this intuition compatible with the benchmark?
Related to this, did you look at the effect of GC with the scattered pinning? I wonder if the GC can reuse the remote memory. If so, I’m guessing the performance with scattered pinning to degrade over time (unless extra care is taken with libnuma API or sth?).
I’d say it is a valid point but not the crux of the plots (if I understand your argument correctly). What was surprising to me is that the compact pinning curve shows (modulo measurment uncertainties) linear scaling when putting threads on the second socket (# cores > 10). Intuitively, I would have expected two saturation curves, indicating the saturation of the memory bandwidth of each socket, respectively.
The reason why this linear scaling appears is the implicit barrier at the end of a parallel block (in julia at the end of @threads for ..., same goes for #pragma omp parallel for). Once you’ve saturated the first socket, each thread placed on the second (almost empty) socket could be faster than those on the filled first socket but they aren’t, since they are synchronized at the barrier. I.e. they are forced to wait for the threads on the saturated socket to finish as well. Hence, every thread added on the second socket contributes the average memory bandwidth of the saturated socket. The wasted time (due to waiting at the barrier) is what gives the linear scaling and also makes compact pinning suboptimal here.
With scattered pinning, this effect can be avoided since we saturate both sockets simultaneously, so the amount of useless waiting at the barrier is minimized. This results in overall better performance (except at the end point where # cores = # available cores and both curves match) and a single big saturation curve.
Hope this clears things up!
I did not. But I believe to have noted that results where slightly different when running each measurement (i.e. each value of # cores) in a fresh Julia session compared to running them in the same Julia session one after another (the latter is what I use for the data / plots in the README.md). However, the effect was marginal, so may well be just a fluctuation effect, and also might be completely unrelated to GC.
Hmm… I was actually not super surprised by the late linear scaling of compact pinning. Since both strategies yield the same configuration at maximum cores (= 20), the two plots should meet there. Since scatter pinning saturates first, the compact pinning needs to “approach from below.” [^1] Of course, this argument is given the observation that the scatter pinning outperforms compact pinning. So, maybe what is “obvious” depends on the phenomena that we noticed/were surprised at first.
Also, I was more interested in that the difference is maximized at 10 cores. It’s natural that we observe a difference there since that’s where the thread configuration is maximally different. But I wanted to explain this difference at this point, without a “scaling” argument that requires varying the number of threads. Isn’t the effective use of per-socket resources like the L3 and local memory enough to explain this difference?
[^1]: This is a rough argument and I think your explanation illustrates more mechanical details.