Yes, certainly the performance difference can be huge when TSC is not being used. I am very unclear on how often TSC is turned off on multi-processor, multi-core systems, but it doesn’t not seem to be that rare.
I don’t know much about the intended use, but I wonder if the code that is executed when PROFILE_JL_THREADING=1
could instead be run only when profiling is requested, similar to other profiling calls in julia. I see very few calls to jl_threading_profile
across GitHub so it may not be commonly used.
GitHub search for jl_threading_profile
The JULIA_THREAD_SLEEP_THRESHOLD
variable is also a bit interesting in that although there is a benefit in sleeping spinning threads, the code that is executed has a performance impact even when the threads are working. This seems also to be true for systems with TSC. On the OSX machine:
Standard Julia
nthreads = 1 : 258.688 ns (1 allocation: 32 bytes)
nthreads = 2 : 443.434 ns (1 allocation: 32 bytes)
nthreads = 4 : 751.000 ns (1 allocation: 32 bytes)
Setting PROFILE_JL_THREADING=0
nthreads = 1 : 97.810 ns (1 allocation: 32 bytes)
nthreads = 2 : 236.790 ns (1 allocation: 32 bytes)
nthreads = 4 : 437.035 ns (1 allocation: 32 bytes)
JULIA_THREAD_SLEEP_THRESHOLD=0
and PROFILE_JL_THREADING=0
nthreads = 1 : 52.644 ns (1 allocation: 32 bytes)
nthreads = 2 : 169.125 ns (1 allocation: 32 bytes)
nthreads = 4 : 408.315 ns (1 allocation: 32 bytes)
Not nearly as striking as for the E5 with TSC off, but could still make a difference.