Just to give some more realistic data on the effect of these options. I am working on a package, hopefully soon to be ready for release, providing a specific type of numerical integration method that is amenable to shared memory, multi-processor implementation. In some important cases, this method will be run hundreds of thousands of times in a loop, so performance is important even when the times seem small. In the benchmarking results, N is the number of particles and n the number steps in the numerical integration method. The complexity is O(Nn).
The results indicate that disabling PROFILE_JL_THREADING on a no-TSC system is extremely beneficial. The use of JULIA_THREAD_SLEEP_THRESHOLD=0 also makes a big difference on a no-TSC system, but this is quite unfortunate since it clearly can impact the performance of serial code or any other jobs running on the system, and is generally speaking not a good idea.
It’s really great how light-weight the threads in Julia are / could be, and I wonder if there may be options in the future to toggle dynamically the sleep-checks from user code? This could be much better than running a whole application with spinning threads. Or perhaps a less expensive routine for sleep-checking is being looked at?
Here are some benchmarks with standard Julia 0.6.1:
$ julia -O3 SequentialMonteCarlo.jl/test/runbenchmarks.jl
Running Benchmarks: Tue, 28 Nov 2017 08:45:35
x86_64-pc-linux-gnu ; Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell) ; 8 Physical, 16 Logical
Linear Gaussian Model, n = 10
log₂N Threads Benchmark
10 01 462.701 μs (0 allocations: 0 bytes)
10 16 3.437 ms (68 allocations: 3.03 KiB)
12 01 1.868 ms (0 allocations: 0 bytes)
12 16 3.539 ms (68 allocations: 3.03 KiB)
14 01 7.519 ms (0 allocations: 0 bytes)
14 16 4.221 ms (68 allocations: 3.03 KiB)
16 01 30.168 ms (0 allocations: 0 bytes)
16 16 7.161 ms (68 allocations: 3.03 KiB)
18 01 126.488 ms (0 allocations: 0 bytes)
18 16 21.227 ms (68 allocations: 3.03 KiB)
20 01 536.513 ms (0 allocations: 0 bytes)
20 16 106.297 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 1, n = 10
log₂N Threads Benchmark
10 01 929.663 μs (0 allocations: 0 bytes)
10 16 3.428 ms (68 allocations: 3.03 KiB)
12 01 3.736 ms (0 allocations: 0 bytes)
12 16 3.771 ms (68 allocations: 3.03 KiB)
14 01 14.922 ms (0 allocations: 0 bytes)
14 16 5.316 ms (68 allocations: 3.03 KiB)
16 01 60.139 ms (0 allocations: 0 bytes)
16 16 11.921 ms (68 allocations: 3.03 KiB)
18 01 265.790 ms (0 allocations: 0 bytes)
18 16 43.471 ms (68 allocations: 3.03 KiB)
20 01 1.009 s (0 allocations: 0 bytes)
20 16 169.747 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 2, n = 10
log₂N Threads Benchmark
10 01 1.102 ms (0 allocations: 0 bytes)
10 16 3.589 ms (68 allocations: 3.03 KiB)
12 01 4.468 ms (0 allocations: 0 bytes)
12 16 3.951 ms (68 allocations: 3.03 KiB)
14 01 17.877 ms (0 allocations: 0 bytes)
14 16 5.571 ms (68 allocations: 3.03 KiB)
16 01 71.282 ms (0 allocations: 0 bytes)
16 16 12.209 ms (68 allocations: 3.03 KiB)
18 01 305.750 ms (0 allocations: 0 bytes)
18 16 49.480 ms (68 allocations: 3.03 KiB)
20 01 1.239 s (0 allocations: 0 bytes)
20 16 195.557 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 4, n = 10
log₂N Threads Benchmark
10 01 1.174 ms (0 allocations: 0 bytes)
10 16 3.495 ms (68 allocations: 3.03 KiB)
12 01 4.754 ms (0 allocations: 0 bytes)
12 16 3.925 ms (68 allocations: 3.03 KiB)
14 01 19.630 ms (0 allocations: 0 bytes)
14 16 5.733 ms (68 allocations: 3.03 KiB)
16 01 76.595 ms (0 allocations: 0 bytes)
16 16 12.953 ms (68 allocations: 3.03 KiB)
18 01 323.331 ms (0 allocations: 0 bytes)
18 16 52.247 ms (68 allocations: 3.03 KiB)
20 01 1.292 s (0 allocations: 0 bytes)
20 16 208.725 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 8, n = 10
log₂N Threads Benchmark
10 01 1.204 ms (0 allocations: 0 bytes)
10 16 3.511 ms (68 allocations: 3.03 KiB)
12 01 4.913 ms (0 allocations: 0 bytes)
12 16 3.963 ms (68 allocations: 3.03 KiB)
14 01 19.842 ms (0 allocations: 0 bytes)
14 16 5.707 ms (68 allocations: 3.03 KiB)
16 01 77.291 ms (0 allocations: 0 bytes)
16 16 12.990 ms (68 allocations: 3.03 KiB)
18 01 325.592 ms (0 allocations: 0 bytes)
18 16 52.412 ms (68 allocations: 3.03 KiB)
20 01 1.352 s (0 allocations: 0 bytes)
20 16 205.747 ms (68 allocations: 3.03 KiB)
SMC Sampler Example, n = 12
log₂N Threads Benchmark
10 01 7.276 ms (0 allocations: 0 bytes)
10 16 4.874 ms (82 allocations: 3.66 KiB)
12 01 29.345 ms (0 allocations: 0 bytes)
12 16 7.109 ms (82 allocations: 3.66 KiB)
14 01 117.302 ms (0 allocations: 0 bytes)
14 16 16.393 ms (82 allocations: 3.66 KiB)
16 01 469.471 ms (0 allocations: 0 bytes)
16 16 53.248 ms (82 allocations: 3.66 KiB)
18 01 1.916 s (0 allocations: 0 bytes)
18 16 218.752 ms (82 allocations: 3.66 KiB)
20 01 7.689 s (0 allocations: 0 bytes)
20 16 871.899 ms (82 allocations: 3.66 KiB)
Finished: Tue, 28 Nov 2017 09:07:39
Here are benchmarks with JULIA_THREAD_SLEEP_THRESHOLD=0
and Julia 0.6.1 compiled with PROFILE_JL_THREADING=0
:
$ JULIA_THREAD_SLEEP_THRESHOLD=0 julia-fixed -O3 SequentialMonteCarlo.jl/test/runbenchmarks.jl
Running Benchmarks: Tue, 28 Nov 2017 08:12:21
x86_64-linux-gnu ; Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell) ; 8 Physical, 16 Logical
Linear Gaussian Model, n = 10
log₂N Threads Benchmark
10 01 644.779 μs (0 allocations: 0 bytes)
10 16 159.169 μs (68 allocations: 3.03 KiB)
12 01 2.603 ms (0 allocations: 0 bytes)
12 16 346.625 μs (68 allocations: 3.03 KiB)
14 01 10.473 ms (0 allocations: 0 bytes)
14 16 984.768 μs (68 allocations: 3.03 KiB)
16 01 41.410 ms (0 allocations: 0 bytes)
16 16 3.673 ms (68 allocations: 3.03 KiB)
18 01 169.404 ms (0 allocations: 0 bytes)
18 16 19.562 ms (68 allocations: 3.03 KiB)
20 01 704.918 ms (0 allocations: 0 bytes)
20 16 100.459 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 1, n = 10
log₂N Threads Benchmark
10 01 1.591 ms (0 allocations: 0 bytes)
10 16 241.932 μs (68 allocations: 3.03 KiB)
12 01 6.250 ms (0 allocations: 0 bytes)
12 16 610.975 μs (68 allocations: 3.03 KiB)
14 01 24.829 ms (0 allocations: 0 bytes)
14 16 2.213 ms (68 allocations: 3.03 KiB)
16 01 97.583 ms (0 allocations: 0 bytes)
16 16 8.500 ms (68 allocations: 3.03 KiB)
18 01 407.901 ms (0 allocations: 0 bytes)
18 16 40.722 ms (68 allocations: 3.03 KiB)
20 01 1.596 s (0 allocations: 0 bytes)
20 16 167.738 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 2, n = 10
log₂N Threads Benchmark
10 01 1.781 ms (0 allocations: 0 bytes)
10 16 251.080 μs (68 allocations: 3.03 KiB)
12 01 7.323 ms (0 allocations: 0 bytes)
12 16 678.721 μs (68 allocations: 3.03 KiB)
14 01 28.616 ms (0 allocations: 0 bytes)
14 16 2.438 ms (68 allocations: 3.03 KiB)
16 01 113.704 ms (0 allocations: 0 bytes)
16 16 9.584 ms (68 allocations: 3.03 KiB)
18 01 469.149 ms (0 allocations: 0 bytes)
18 16 46.618 ms (68 allocations: 3.03 KiB)
20 01 1.872 s (0 allocations: 0 bytes)
20 16 191.293 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 4, n = 10
log₂N Threads Benchmark
10 01 1.999 ms (0 allocations: 0 bytes)
10 16 269.449 μs (68 allocations: 3.03 KiB)
12 01 7.959 ms (0 allocations: 0 bytes)
12 16 733.547 μs (68 allocations: 3.03 KiB)
14 01 31.521 ms (0 allocations: 0 bytes)
14 16 2.687 ms (68 allocations: 3.03 KiB)
16 01 124.232 ms (0 allocations: 0 bytes)
16 16 10.098 ms (68 allocations: 3.03 KiB)
18 01 511.506 ms (0 allocations: 0 bytes)
18 16 49.373 ms (68 allocations: 3.03 KiB)
20 01 2.052 s (0 allocations: 0 bytes)
20 16 204.563 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 8, n = 10
log₂N Threads Benchmark
10 01 1.967 ms (0 allocations: 0 bytes)
10 16 264.560 μs (68 allocations: 3.03 KiB)
12 01 7.878 ms (0 allocations: 0 bytes)
12 16 718.880 μs (68 allocations: 3.03 KiB)
14 01 31.305 ms (0 allocations: 0 bytes)
14 16 2.573 ms (68 allocations: 3.03 KiB)
16 01 123.987 ms (0 allocations: 0 bytes)
16 16 10.513 ms (68 allocations: 3.03 KiB)
18 01 509.400 ms (0 allocations: 0 bytes)
18 16 49.274 ms (68 allocations: 3.03 KiB)
20 01 2.045 s (0 allocations: 0 bytes)
20 16 202.500 ms (68 allocations: 3.03 KiB)
SMC Sampler Example, n = 12
log₂N Threads Benchmark
10 01 10.587 ms (0 allocations: 0 bytes)
10 16 880.215 μs (82 allocations: 3.66 KiB)
12 01 41.959 ms (0 allocations: 0 bytes)
12 16 3.149 ms (82 allocations: 3.66 KiB)
14 01 165.880 ms (0 allocations: 0 bytes)
14 16 12.580 ms (82 allocations: 3.66 KiB)
16 01 662.951 ms (0 allocations: 0 bytes)
16 16 48.526 ms (82 allocations: 3.66 KiB)
18 01 2.687 s (0 allocations: 0 bytes)
18 16 209.933 ms (82 allocations: 3.66 KiB)
20 01 10.740 s (0 allocations: 0 bytes)
20 16 841.519 ms (82 allocations: 3.66 KiB)
Finished: Tue, 28 Nov 2017 08:44:17
Finally, benchmarks with Julia 0.6.1 compiled with PROFILE_JL_THREADING=0
but not changing JULIA_THREAD_SLEEP_THRESHOLD
.
$ julia-fixed -O3 SequentialMonteCarlo.jl/test/runbenchmarks.jl
Running Benchmarks: Tue, 28 Nov 2017 09:08:18
x86_64-linux-gnu ; Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell) ; 8 Physical, 16 Logical
Linear Gaussian Model, n = 10
log₂N Threads Benchmark
10 01 449.781 μs (0 allocations: 0 bytes)
10 16 765.745 μs (68 allocations: 3.03 KiB)
12 01 1.823 ms (0 allocations: 0 bytes)
12 16 890.482 μs (68 allocations: 3.03 KiB)
14 01 7.343 ms (0 allocations: 0 bytes)
14 16 1.537 ms (68 allocations: 3.03 KiB)
16 01 29.151 ms (0 allocations: 0 bytes)
16 16 4.305 ms (68 allocations: 3.03 KiB)
18 01 120.707 ms (0 allocations: 0 bytes)
18 16 18.588 ms (68 allocations: 3.03 KiB)
20 01 514.423 ms (0 allocations: 0 bytes)
20 16 101.536 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 1, n = 10
log₂N Threads Benchmark
10 01 941.256 μs (0 allocations: 0 bytes)
10 16 828.741 μs (68 allocations: 3.03 KiB)
12 01 3.798 ms (0 allocations: 0 bytes)
12 16 1.179 ms (68 allocations: 3.03 KiB)
14 01 15.628 ms (0 allocations: 0 bytes)
14 16 2.711 ms (68 allocations: 3.03 KiB)
16 01 59.672 ms (0 allocations: 0 bytes)
16 16 8.953 ms (68 allocations: 3.03 KiB)
18 01 244.600 ms (0 allocations: 0 bytes)
18 16 40.924 ms (68 allocations: 3.03 KiB)
20 01 988.874 ms (0 allocations: 0 bytes)
20 16 166.979 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 2, n = 10
log₂N Threads Benchmark
10 01 1.066 ms (0 allocations: 0 bytes)
10 16 853.256 μs (68 allocations: 3.03 KiB)
12 01 4.316 ms (0 allocations: 0 bytes)
12 16 1.267 ms (68 allocations: 3.03 KiB)
14 01 17.859 ms (0 allocations: 0 bytes)
14 16 3.092 ms (68 allocations: 3.03 KiB)
16 01 67.002 ms (0 allocations: 0 bytes)
16 16 9.840 ms (68 allocations: 3.03 KiB)
18 01 281.559 ms (0 allocations: 0 bytes)
18 16 46.449 ms (68 allocations: 3.03 KiB)
20 01 1.135 s (0 allocations: 0 bytes)
20 16 190.629 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 4, n = 10
log₂N Threads Benchmark
10 01 1.205 ms (0 allocations: 0 bytes)
10 16 877.002 μs (68 allocations: 3.03 KiB)
12 01 4.876 ms (0 allocations: 0 bytes)
12 16 1.336 ms (68 allocations: 3.03 KiB)
14 01 19.420 ms (0 allocations: 0 bytes)
14 16 3.198 ms (68 allocations: 3.03 KiB)
16 01 77.090 ms (0 allocations: 0 bytes)
16 16 10.708 ms (68 allocations: 3.03 KiB)
18 01 319.223 ms (0 allocations: 0 bytes)
18 16 49.946 ms (68 allocations: 3.03 KiB)
20 01 1.283 s (0 allocations: 0 bytes)
20 16 201.582 ms (68 allocations: 3.03 KiB)
Multivariate Linear Gaussian Model, d = 8, n = 10
log₂N Threads Benchmark
10 01 1.181 ms (0 allocations: 0 bytes)
10 16 870.926 μs (68 allocations: 3.03 KiB)
12 01 4.841 ms (0 allocations: 0 bytes)
12 16 1.302 ms (68 allocations: 3.03 KiB)
14 01 19.055 ms (0 allocations: 0 bytes)
14 16 3.184 ms (68 allocations: 3.03 KiB)
16 01 74.834 ms (0 allocations: 0 bytes)
16 16 10.758 ms (68 allocations: 3.03 KiB)
18 01 313.737 ms (0 allocations: 0 bytes)
18 16 51.431 ms (68 allocations: 3.03 KiB)
20 01 1.262 s (0 allocations: 0 bytes)
20 16 217.092 ms (68 allocations: 3.03 KiB)
SMC Sampler Example, n = 12
log₂N Threads Benchmark
10 01 7.422 ms (0 allocations: 0 bytes)
10 16 1.628 ms (82 allocations: 3.66 KiB)
12 01 29.615 ms (0 allocations: 0 bytes)
12 16 3.926 ms (82 allocations: 3.66 KiB)
14 01 116.803 ms (0 allocations: 0 bytes)
14 16 13.557 ms (82 allocations: 3.66 KiB)
16 01 467.073 ms (0 allocations: 0 bytes)
16 16 49.751 ms (82 allocations: 3.66 KiB)
18 01 1.898 s (0 allocations: 0 bytes)
18 16 213.336 ms (82 allocations: 3.66 KiB)
20 01 7.602 s (0 allocations: 0 bytes)
20 16 853.096 ms (82 allocations: 3.66 KiB)
Finished: Tue, 28 Nov 2017 09:30:16