Thread overhead variability across machines

Yes, certainly the performance difference can be huge when TSC is not being used. I am very unclear on how often TSC is turned off on multi-processor, multi-core systems, but it doesn’t not seem to be that rare.

I don’t know much about the intended use, but I wonder if the code that is executed when PROFILE_JL_THREADING=1 could instead be run only when profiling is requested, similar to other profiling calls in julia. I see very few calls to jl_threading_profile across GitHub so it may not be commonly used.

GitHub search for jl_threading_profile

The JULIA_THREAD_SLEEP_THRESHOLD variable is also a bit interesting in that although there is a benefit in sleeping spinning threads, the code that is executed has a performance impact even when the threads are working. This seems also to be true for systems with TSC. On the OSX machine:

Standard Julia

nthreads = 1 :   258.688 ns (1 allocation: 32 bytes)
nthreads = 2 :   443.434 ns (1 allocation: 32 bytes)
nthreads = 4 :   751.000 ns (1 allocation: 32 bytes)

Setting PROFILE_JL_THREADING=0

nthreads = 1 :   97.810 ns (1 allocation: 32 bytes)
nthreads = 2 :   236.790 ns (1 allocation: 32 bytes)
nthreads = 4 :   437.035 ns (1 allocation: 32 bytes)

JULIA_THREAD_SLEEP_THRESHOLD=0 and PROFILE_JL_THREADING=0

nthreads = 1 :   52.644 ns (1 allocation: 32 bytes)
nthreads = 2 :   169.125 ns (1 allocation: 32 bytes)
nthreads = 4 :   408.315 ns (1 allocation: 32 bytes)

Not nearly as striking as for the E5 with TSC off, but could still make a difference.