Thread overhead variability across machines

I am being hit by large overheads associated with running Threads.@threads for on one machine, and I don’t understand the source of the issue. Here is minimal working example code in a file called threadOverhead.jl

using BenchmarkTools, Compat

function foo()
  Threads.@threads for i = 1:Threads.nthreads()
  end
end

print("nthreads = $(Threads.nthreads()) : ")

@btime foo()

which I am calling from a script as follows:

#!/bin/bash

JULIA_NUM_THREADS=1 julia -O3 threadOverhead.jl
JULIA_NUM_THREADS=2 julia -O3 threadOverhead.jl
JULIA_NUM_THREADS=4 julia -O3 threadOverhead.jl
JULIA_NUM_THREADS=8 julia -O3 threadOverhead.jl
JULIA_NUM_THREADS=16 julia -O3 threadOverhead.jl

The output (which is good) on an OSX laptop is:

nthreads = 1 :   257.848 ns (1 allocation: 32 bytes)
nthreads = 2 :   443.732 ns (1 allocation: 32 bytes)
nthreads = 4 :   729.160 ns (1 allocation: 32 bytes)

The output (which is also fine) on a fairly old Xeon X5675 with CentOS 6.2 is:

nthreads = 1 :   248.243 ns (1 allocation: 32 bytes)
nthreads = 2 :   1.485 μs (1 allocation: 32 bytes)
nthreads = 4 :   1.388 μs (1 allocation: 32 bytes)
nthreads = 8 :   1.748 μs (1 allocation: 32 bytes)
nthreads = 12 :   2.050 μs (1 allocation: 32 bytes)

The output (which is not good) on a newer Xeon E5-2667 v3 with Ubuntu 16.04.3 is:

nthreads = 1 :   2.918 μs (1 allocation: 32 bytes)
nthreads = 2 :   6.258 μs (1 allocation: 32 bytes)
nthreads = 4 :   10.266 μs (1 allocation: 32 bytes)
nthreads = 8 :   21.790 μs (1 allocation: 32 bytes)
nthreads = 16 :   45.048 μs (1 allocation: 32 bytes)

In certain, non-ideal but still important situations, this overhead ends up being very dominant in terms of time.

I have tried on one other Linux machine, which had good performance. I can’t seem to find what is causing the large differences. Does anyone know? I’d also be happy to see what kind of performance people get on other machines.

Analogous OpenMP thread overhead is not large for the machine in question, which leads me to think the issue has something to do with the combination of the machine and Julia, and perhaps libuv?

1 Like

Did you check for type-instability due to https://github.com/JuliaLang/julia/issues/23618 ?

Thanks. I don’t believe there is type-instability here, but would be happy to learn otherwise.

There seems to always be one allocation associated with any Threads.@threads for call.

I observe the same (also with a Haswell Xeon on Linux, with Julia 0.6.1 and 0.5.2).

Profiling suggests that almost all the time is spent in the spinup of ti_threadgroup_fork(), specifically in a call to clock_gettime() in the Linux runtime library.

So why would clock_gettime() be slow? It is supposed to use a CPU feature called TSC (time stamp counter). But on my system

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 

gives hpet instead of tsc. The HPET (high-precision event timer) is off-chip, so much slower then TSC, and evidently serialized (I infer from our timings).

My boot log says

TSC synchronization [CPU#0 -> CPU#1]:
Measured 2703588 cycles TSC warp between CPUs, turning off TSC clock.
tsc: Marking TSC unstable due to check_tsc_sync_source failed

There’s also a line with “Your BIOS is broken” so maybe that’s the real problem (CPUs not properly initialized?).

2 Likes

Thanks Ralph. This is the same issue for me, in the syslog

TSC synchronization [CPU#0 -> CPU#1]:
Measured 2979160 cycles TSC warp between CPUs, turning off TSC clock.
tsc: Marking TSC unstable due to check_tsc_sync_source failed

and current_clocksource gives hpet. The other machines with good performance are all using TSC.

I haven’t been able to see how to fix this, and I wonder how common the issue is.

I was puzzled that using libuv's thread pool led to all of these calls to clock_gettime(), especially since OpenMP doesn’t suffer from this issue. In fact, I found that the ultimate source of almost all of the calls to clock_gettime() are in Julia’s threading.c, and in particular can be avoided by changing

#define PROFILE_JL_THREADING 1

to

#define PROFILE_JL_THREADING 0

in https://github.com/JuliaLang/julia/blob/master/src/threading.h

Making this change to the 0.6.1 src and recompiling without any other changes or special flags gives:

nthreads = 1 :   74.727 ns (1 allocation: 32 bytes)
nthreads = 2 :   763.809 ns (1 allocation: 32 bytes)
nthreads = 4 :   1.718 μs (1 allocation: 32 bytes)
nthreads = 8 :   4.563 μs (1 allocation: 32 bytes)
nthreads = 16 :   8.521 μs (1 allocation: 32 bytes)

which is really much better. Is it the plan that PROFILE_JL_THREADING remain enabled in releases?

There are some remaining calls to uv_hrtime which lead to clock_gettime() in https://github.com/JuliaLang/julia/blob/master/src/threadgroup.c . Looking at the code suggests that we can avoid these by setting an environment variable

JULIA_THREAD_SLEEP_THRESHOLD=0 or JULIA_THREAD_SLEEP_THRESHOLD=infinite

before calling julia. This results in

nthreads = 1 :   45.408 ns (1 allocation: 32 bytes)
nthreads = 2 :   382.115 ns (1 allocation: 32 bytes)
nthreads = 4 :   459.112 ns (1 allocation: 32 bytes)
nthreads = 8 :   490.691 ns (1 allocation: 32 bytes)
nthreads = 16 :   621.306 ns (1 allocation: 32 bytes)

Perhaps there is a reason for non-zero JULIA_THREAD_SLEEP_THRESHOLD ? My code runs fine with it set to 0 / infinite.

2 Likes

There are several thread performance-scaling issues discussed in #17395. I think the current status starts from around here as some issues have been fixed. Disabling PROFILE_JL_THREADING on systems with TSC problems seems very reasonable (I suspect it was done in the preprocessor for convenience, and could be made a startup option).

Yes, certainly the performance difference can be huge when TSC is not being used. I am very unclear on how often TSC is turned off on multi-processor, multi-core systems, but it doesn’t not seem to be that rare.

I don’t know much about the intended use, but I wonder if the code that is executed when PROFILE_JL_THREADING=1 could instead be run only when profiling is requested, similar to other profiling calls in julia. I see very few calls to jl_threading_profile across GitHub so it may not be commonly used.

GitHub search for jl_threading_profile

The JULIA_THREAD_SLEEP_THRESHOLD variable is also a bit interesting in that although there is a benefit in sleeping spinning threads, the code that is executed has a performance impact even when the threads are working. This seems also to be true for systems with TSC. On the OSX machine:

Standard Julia

nthreads = 1 :   258.688 ns (1 allocation: 32 bytes)
nthreads = 2 :   443.434 ns (1 allocation: 32 bytes)
nthreads = 4 :   751.000 ns (1 allocation: 32 bytes)

Setting PROFILE_JL_THREADING=0

nthreads = 1 :   97.810 ns (1 allocation: 32 bytes)
nthreads = 2 :   236.790 ns (1 allocation: 32 bytes)
nthreads = 4 :   437.035 ns (1 allocation: 32 bytes)

JULIA_THREAD_SLEEP_THRESHOLD=0 and PROFILE_JL_THREADING=0

nthreads = 1 :   52.644 ns (1 allocation: 32 bytes)
nthreads = 2 :   169.125 ns (1 allocation: 32 bytes)
nthreads = 4 :   408.315 ns (1 allocation: 32 bytes)

Not nearly as striking as for the E5 with TSC off, but could still make a difference.

Just to give some more realistic data on the effect of these options. I am working on a package, hopefully soon to be ready for release, providing a specific type of numerical integration method that is amenable to shared memory, multi-processor implementation. In some important cases, this method will be run hundreds of thousands of times in a loop, so performance is important even when the times seem small. In the benchmarking results, N is the number of particles and n the number steps in the numerical integration method. The complexity is O(Nn).

The results indicate that disabling PROFILE_JL_THREADING on a no-TSC system is extremely beneficial. The use of JULIA_THREAD_SLEEP_THRESHOLD=0 also makes a big difference on a no-TSC system, but this is quite unfortunate since it clearly can impact the performance of serial code or any other jobs running on the system, and is generally speaking not a good idea.

It’s really great how light-weight the threads in Julia are / could be, and I wonder if there may be options in the future to toggle dynamically the sleep-checks from user code? This could be much better than running a whole application with spinning threads. Or perhaps a less expensive routine for sleep-checking is being looked at?

Here are some benchmarks with standard Julia 0.6.1:

$ julia -O3 SequentialMonteCarlo.jl/test/runbenchmarks.jl 
Running Benchmarks: Tue, 28 Nov 2017 08:45:35
x86_64-pc-linux-gnu ; Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell) ; 8 Physical, 16 Logical
Linear Gaussian Model, n = 10
log₂N  Threads  Benchmark
10     01       462.701 μs (0 allocations: 0 bytes)
10     16       3.437 ms (68 allocations: 3.03 KiB)
12     01       1.868 ms (0 allocations: 0 bytes)
12     16       3.539 ms (68 allocations: 3.03 KiB)
14     01       7.519 ms (0 allocations: 0 bytes)
14     16       4.221 ms (68 allocations: 3.03 KiB)
16     01       30.168 ms (0 allocations: 0 bytes)
16     16       7.161 ms (68 allocations: 3.03 KiB)
18     01       126.488 ms (0 allocations: 0 bytes)
18     16       21.227 ms (68 allocations: 3.03 KiB)
20     01       536.513 ms (0 allocations: 0 bytes)
20     16       106.297 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 1, n = 10
log₂N  Threads  Benchmark
10     01       929.663 μs (0 allocations: 0 bytes)
10     16       3.428 ms (68 allocations: 3.03 KiB)
12     01       3.736 ms (0 allocations: 0 bytes)
12     16       3.771 ms (68 allocations: 3.03 KiB)
14     01       14.922 ms (0 allocations: 0 bytes)
14     16       5.316 ms (68 allocations: 3.03 KiB)
16     01       60.139 ms (0 allocations: 0 bytes)
16     16       11.921 ms (68 allocations: 3.03 KiB)
18     01       265.790 ms (0 allocations: 0 bytes)
18     16       43.471 ms (68 allocations: 3.03 KiB)
20     01       1.009 s (0 allocations: 0 bytes)
20     16       169.747 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 2, n = 10
log₂N  Threads  Benchmark
10     01       1.102 ms (0 allocations: 0 bytes)
10     16       3.589 ms (68 allocations: 3.03 KiB)
12     01       4.468 ms (0 allocations: 0 bytes)
12     16       3.951 ms (68 allocations: 3.03 KiB)
14     01       17.877 ms (0 allocations: 0 bytes)
14     16       5.571 ms (68 allocations: 3.03 KiB)
16     01       71.282 ms (0 allocations: 0 bytes)
16     16       12.209 ms (68 allocations: 3.03 KiB)
18     01       305.750 ms (0 allocations: 0 bytes)
18     16       49.480 ms (68 allocations: 3.03 KiB)
20     01       1.239 s (0 allocations: 0 bytes)
20     16       195.557 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 4, n = 10
log₂N  Threads  Benchmark
10     01       1.174 ms (0 allocations: 0 bytes)
10     16       3.495 ms (68 allocations: 3.03 KiB)
12     01       4.754 ms (0 allocations: 0 bytes)
12     16       3.925 ms (68 allocations: 3.03 KiB)
14     01       19.630 ms (0 allocations: 0 bytes)
14     16       5.733 ms (68 allocations: 3.03 KiB)
16     01       76.595 ms (0 allocations: 0 bytes)
16     16       12.953 ms (68 allocations: 3.03 KiB)
18     01       323.331 ms (0 allocations: 0 bytes)
18     16       52.247 ms (68 allocations: 3.03 KiB)
20     01       1.292 s (0 allocations: 0 bytes)
20     16       208.725 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 8, n = 10
log₂N  Threads  Benchmark
10     01       1.204 ms (0 allocations: 0 bytes)
10     16       3.511 ms (68 allocations: 3.03 KiB)
12     01       4.913 ms (0 allocations: 0 bytes)
12     16       3.963 ms (68 allocations: 3.03 KiB)
14     01       19.842 ms (0 allocations: 0 bytes)
14     16       5.707 ms (68 allocations: 3.03 KiB)
16     01       77.291 ms (0 allocations: 0 bytes)
16     16       12.990 ms (68 allocations: 3.03 KiB)
18     01       325.592 ms (0 allocations: 0 bytes)
18     16       52.412 ms (68 allocations: 3.03 KiB)
20     01       1.352 s (0 allocations: 0 bytes)
20     16       205.747 ms (68 allocations: 3.03 KiB)
SMC Sampler Example, n = 12
log₂N  Threads  Benchmark
10     01       7.276 ms (0 allocations: 0 bytes)
10     16       4.874 ms (82 allocations: 3.66 KiB)
12     01       29.345 ms (0 allocations: 0 bytes)
12     16       7.109 ms (82 allocations: 3.66 KiB)
14     01       117.302 ms (0 allocations: 0 bytes)
14     16       16.393 ms (82 allocations: 3.66 KiB)
16     01       469.471 ms (0 allocations: 0 bytes)
16     16       53.248 ms (82 allocations: 3.66 KiB)
18     01       1.916 s (0 allocations: 0 bytes)
18     16       218.752 ms (82 allocations: 3.66 KiB)
20     01       7.689 s (0 allocations: 0 bytes)
20     16       871.899 ms (82 allocations: 3.66 KiB)
Finished: Tue, 28 Nov 2017 09:07:39

Here are benchmarks with JULIA_THREAD_SLEEP_THRESHOLD=0 and Julia 0.6.1 compiled with PROFILE_JL_THREADING=0:

$ JULIA_THREAD_SLEEP_THRESHOLD=0 julia-fixed -O3 SequentialMonteCarlo.jl/test/runbenchmarks.jl 
Running Benchmarks: Tue, 28 Nov 2017 08:12:21
x86_64-linux-gnu ; Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell) ; 8 Physical, 16 Logical
Linear Gaussian Model, n = 10
log₂N  Threads  Benchmark
10     01       644.779 μs (0 allocations: 0 bytes)
10     16       159.169 μs (68 allocations: 3.03 KiB)
12     01       2.603 ms (0 allocations: 0 bytes)
12     16       346.625 μs (68 allocations: 3.03 KiB)
14     01       10.473 ms (0 allocations: 0 bytes)
14     16       984.768 μs (68 allocations: 3.03 KiB)
16     01       41.410 ms (0 allocations: 0 bytes)
16     16       3.673 ms (68 allocations: 3.03 KiB)
18     01       169.404 ms (0 allocations: 0 bytes)
18     16       19.562 ms (68 allocations: 3.03 KiB)
20     01       704.918 ms (0 allocations: 0 bytes)
20     16       100.459 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 1, n = 10
log₂N  Threads  Benchmark
10     01       1.591 ms (0 allocations: 0 bytes)
10     16       241.932 μs (68 allocations: 3.03 KiB)
12     01       6.250 ms (0 allocations: 0 bytes)
12     16       610.975 μs (68 allocations: 3.03 KiB)
14     01       24.829 ms (0 allocations: 0 bytes)
14     16       2.213 ms (68 allocations: 3.03 KiB)
16     01       97.583 ms (0 allocations: 0 bytes)
16     16       8.500 ms (68 allocations: 3.03 KiB)
18     01       407.901 ms (0 allocations: 0 bytes)
18     16       40.722 ms (68 allocations: 3.03 KiB)
20     01       1.596 s (0 allocations: 0 bytes)
20     16       167.738 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 2, n = 10
log₂N  Threads  Benchmark
10     01       1.781 ms (0 allocations: 0 bytes)
10     16       251.080 μs (68 allocations: 3.03 KiB)
12     01       7.323 ms (0 allocations: 0 bytes)
12     16       678.721 μs (68 allocations: 3.03 KiB)
14     01       28.616 ms (0 allocations: 0 bytes)
14     16       2.438 ms (68 allocations: 3.03 KiB)
16     01       113.704 ms (0 allocations: 0 bytes)
16     16       9.584 ms (68 allocations: 3.03 KiB)
18     01       469.149 ms (0 allocations: 0 bytes)
18     16       46.618 ms (68 allocations: 3.03 KiB)
20     01       1.872 s (0 allocations: 0 bytes)
20     16       191.293 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 4, n = 10
log₂N  Threads  Benchmark
10     01       1.999 ms (0 allocations: 0 bytes)
10     16       269.449 μs (68 allocations: 3.03 KiB)
12     01       7.959 ms (0 allocations: 0 bytes)
12     16       733.547 μs (68 allocations: 3.03 KiB)
14     01       31.521 ms (0 allocations: 0 bytes)
14     16       2.687 ms (68 allocations: 3.03 KiB)
16     01       124.232 ms (0 allocations: 0 bytes)
16     16       10.098 ms (68 allocations: 3.03 KiB)
18     01       511.506 ms (0 allocations: 0 bytes)
18     16       49.373 ms (68 allocations: 3.03 KiB)
20     01       2.052 s (0 allocations: 0 bytes)
20     16       204.563 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 8, n = 10
log₂N  Threads  Benchmark
10     01       1.967 ms (0 allocations: 0 bytes)
10     16       264.560 μs (68 allocations: 3.03 KiB)
12     01       7.878 ms (0 allocations: 0 bytes)
12     16       718.880 μs (68 allocations: 3.03 KiB)
14     01       31.305 ms (0 allocations: 0 bytes)
14     16       2.573 ms (68 allocations: 3.03 KiB)
16     01       123.987 ms (0 allocations: 0 bytes)
16     16       10.513 ms (68 allocations: 3.03 KiB)
18     01       509.400 ms (0 allocations: 0 bytes)
18     16       49.274 ms (68 allocations: 3.03 KiB)
20     01       2.045 s (0 allocations: 0 bytes)
20     16       202.500 ms (68 allocations: 3.03 KiB)
SMC Sampler Example, n = 12
log₂N  Threads  Benchmark
10     01       10.587 ms (0 allocations: 0 bytes)
10     16       880.215 μs (82 allocations: 3.66 KiB)
12     01       41.959 ms (0 allocations: 0 bytes)
12     16       3.149 ms (82 allocations: 3.66 KiB)
14     01       165.880 ms (0 allocations: 0 bytes)
14     16       12.580 ms (82 allocations: 3.66 KiB)
16     01       662.951 ms (0 allocations: 0 bytes)
16     16       48.526 ms (82 allocations: 3.66 KiB)
18     01       2.687 s (0 allocations: 0 bytes)
18     16       209.933 ms (82 allocations: 3.66 KiB)
20     01       10.740 s (0 allocations: 0 bytes)
20     16       841.519 ms (82 allocations: 3.66 KiB)
Finished: Tue, 28 Nov 2017 08:44:17

Finally, benchmarks with Julia 0.6.1 compiled with PROFILE_JL_THREADING=0 but not changing JULIA_THREAD_SLEEP_THRESHOLD.

$ julia-fixed -O3 SequentialMonteCarlo.jl/test/runbenchmarks.jl 
Running Benchmarks: Tue, 28 Nov 2017 09:08:18
x86_64-linux-gnu ; Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell) ; 8 Physical, 16 Logical
Linear Gaussian Model, n = 10
log₂N  Threads  Benchmark
10     01       449.781 μs (0 allocations: 0 bytes)
10     16       765.745 μs (68 allocations: 3.03 KiB)
12     01       1.823 ms (0 allocations: 0 bytes)
12     16       890.482 μs (68 allocations: 3.03 KiB)
14     01       7.343 ms (0 allocations: 0 bytes)
14     16       1.537 ms (68 allocations: 3.03 KiB)
16     01       29.151 ms (0 allocations: 0 bytes)
16     16       4.305 ms (68 allocations: 3.03 KiB)
18     01       120.707 ms (0 allocations: 0 bytes)
18     16       18.588 ms (68 allocations: 3.03 KiB)
20     01       514.423 ms (0 allocations: 0 bytes)
20     16       101.536 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 1, n = 10
log₂N  Threads  Benchmark
10     01       941.256 μs (0 allocations: 0 bytes)
10     16       828.741 μs (68 allocations: 3.03 KiB)
12     01       3.798 ms (0 allocations: 0 bytes)
12     16       1.179 ms (68 allocations: 3.03 KiB)
14     01       15.628 ms (0 allocations: 0 bytes)
14     16       2.711 ms (68 allocations: 3.03 KiB)
16     01       59.672 ms (0 allocations: 0 bytes)
16     16       8.953 ms (68 allocations: 3.03 KiB)
18     01       244.600 ms (0 allocations: 0 bytes)
18     16       40.924 ms (68 allocations: 3.03 KiB)
20     01       988.874 ms (0 allocations: 0 bytes)
20     16       166.979 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 2, n = 10
log₂N  Threads  Benchmark
10     01       1.066 ms (0 allocations: 0 bytes)
10     16       853.256 μs (68 allocations: 3.03 KiB)
12     01       4.316 ms (0 allocations: 0 bytes)
12     16       1.267 ms (68 allocations: 3.03 KiB)
14     01       17.859 ms (0 allocations: 0 bytes)
14     16       3.092 ms (68 allocations: 3.03 KiB)
16     01       67.002 ms (0 allocations: 0 bytes)
16     16       9.840 ms (68 allocations: 3.03 KiB)
18     01       281.559 ms (0 allocations: 0 bytes)
18     16       46.449 ms (68 allocations: 3.03 KiB)
20     01       1.135 s (0 allocations: 0 bytes)
20     16       190.629 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 4, n = 10
log₂N  Threads  Benchmark
10     01       1.205 ms (0 allocations: 0 bytes)
10     16       877.002 μs (68 allocations: 3.03 KiB)
12     01       4.876 ms (0 allocations: 0 bytes)
12     16       1.336 ms (68 allocations: 3.03 KiB)
14     01       19.420 ms (0 allocations: 0 bytes)
14     16       3.198 ms (68 allocations: 3.03 KiB)
16     01       77.090 ms (0 allocations: 0 bytes)
16     16       10.708 ms (68 allocations: 3.03 KiB)
18     01       319.223 ms (0 allocations: 0 bytes)
18     16       49.946 ms (68 allocations: 3.03 KiB)
20     01       1.283 s (0 allocations: 0 bytes)
20     16       201.582 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 8, n = 10
log₂N  Threads  Benchmark
10     01       1.181 ms (0 allocations: 0 bytes)
10     16       870.926 μs (68 allocations: 3.03 KiB)
12     01       4.841 ms (0 allocations: 0 bytes)
12     16       1.302 ms (68 allocations: 3.03 KiB)
14     01       19.055 ms (0 allocations: 0 bytes)
14     16       3.184 ms (68 allocations: 3.03 KiB)
16     01       74.834 ms (0 allocations: 0 bytes)
16     16       10.758 ms (68 allocations: 3.03 KiB)
18     01       313.737 ms (0 allocations: 0 bytes)
18     16       51.431 ms (68 allocations: 3.03 KiB)
20     01       1.262 s (0 allocations: 0 bytes)
20     16       217.092 ms (68 allocations: 3.03 KiB)
SMC Sampler Example, n = 12
log₂N  Threads  Benchmark
10     01       7.422 ms (0 allocations: 0 bytes)
10     16       1.628 ms (82 allocations: 3.66 KiB)
12     01       29.615 ms (0 allocations: 0 bytes)
12     16       3.926 ms (82 allocations: 3.66 KiB)
14     01       116.803 ms (0 allocations: 0 bytes)
14     16       13.557 ms (82 allocations: 3.66 KiB)
16     01       467.073 ms (0 allocations: 0 bytes)
16     16       49.751 ms (82 allocations: 3.66 KiB)
18     01       1.898 s (0 allocations: 0 bytes)
18     16       213.336 ms (82 allocations: 3.66 KiB)
20     01       7.602 s (0 allocations: 0 bytes)
20     16       853.096 ms (82 allocations: 3.66 KiB)
Finished: Tue, 28 Nov 2017 09:30:16
1 Like

You could try changing the uv_hrtime() calls in ti_threadgroup_fork() to rdtscp(). It looks as if the logic is ok even if the TSC varies between cores (but let’s ask @yuyichao for advice). If that helps maybe it could be a build configuration option.

By the way, does your code really benefit from hyperthreading? My recollection is that thread count should usually be the number of physical cores for this sort of thing.

I don’t think that rdtscp is available on all systems, unfortunately. I do wonder how the major openmp implementations choose how to sleep spinning threads; I can only speculate from glancing at the code, which seems to be quite involved.

As for hyperthreading, I do usually obtain relatively small performance improvements (e.g. 5%) for N >= 2^12 with very little difference for lower N; I suspect that this is due to some of the memory operations parts of my code but don’t know for sure.

As far as I can recall, PROFILE_JL_THREADING can be turned off and left off – it was really only intended for performance debugging (and used rdtscp() back then).

The sleep threshold exists because I expected the default JULIA_NUM_THREADS to be 2 or 4, and this may yet be the case going forward. Clearly, if we start any extra threads, we don’t want them to spin doing nothing in non-threaded codes. For a threaded code, the sleep threshold eliminates the thread wakeup penalty and removes the need to add a dummy threaded loop right before some threaded code whose performance you care about, or set some environment variable. I don’t like the idea of removing it – IMO default behavior should be fast for multi-threaded code.

And: I’m still making slow progress on the new threading code and hope to get the PR TODOs all checked off by the end of the year.

5 Likes

It would be great if PROFILE_JL_THREADING could be off, even on a normal machine I think it burns cycles unnecessarily in most situations.

Just to make sure I’m not misunderstanding, JULIA_THREAD_SLEEP_THRESHOLD=0 is equivalent to JULIA_THREAD_SLEEP_THRESHOLD=infinite, i.e. it turns off checking times and sending the threads to sleep entirely. So what I have been looking at is removing the time checks entirely, but then also the threads never sleep.

Nevertheless, I agree completely that it is necessary to have a mechanism to avoid threads spinning needlessly for long periods of time. It could be that the issue I have on this one machine is very rare and anyway it can be worked around by calling a multi-threaded script with infinite / 0 sleep threshold (anyway the threads would have been awake so this does nothing but save cycles from the time checks), and then saving the resulting data.

The only thing I wonder is whether it would be at all feasible to set the sleep threshold, or toggle it between the default value and infinity / 0, in user code rather than by setting an environment variable. Even if it was a good idea and simple to implement, I realize it would probably be quite low priority.

Perhaps one of the regular committers can turn off PROFILE_JL_THREADING in master. @yuyichao or @jameson?

JULIA_THREAD_SLEEP_THRESHOLD specifies, in nanoseconds, how long spinning threads should wait before sleeping. The word “infinite” is translated to 0 (haha!), which actually means: disable thread sleeping. I should have made it -1 instead.

Having a way to set the sleep threshold dynamically makes sense and I’ll likely add this to the new threading code… but.

A key goal for Julia’s threading model is high productivity and hence a simple and minimal interface is desirable (though obviously, not at the expense of important functionality). Complicated threading interfaces (like OpenMP) start getting complicated in just this way – by adding reasonable features. Before you know it, you have dozens of such features. It’s worth debating the addition of every one.

2 Likes