Julia seems to be running multithreaded when I don't want it to

Does Threads.nthreads() return 1 inside the REPL too?

Starting Julia with

julia -t 1

might also be a good ideaā€¦

Or put

@assert Threads.nthreads() == 1

at the beginning of your scriptā€¦

I believe OpenBLAS should respect the number of Julia threads but you may want to set its variable separately to be sure.

% julia -t 1 -E 'using LinearAlgebra; BLAS.get_num_threads()'   
4
2 Likes

Hmm. That seems not ideal and possibly like a regression?

1 Like

I think we always defaulted to multithreaded BLAS.

4 Likes

Should we? If you want Julia to only use one thread, why would you want BLAS to use more?

2 Likes
julia -t 1 -E 'using LinearAlgebra; BLAS.get_num_threads()'

36

Yes.

But my script does contain:

using LinearAlgebra: BLAS
BLAS.set_num_threads(1)

Plus, I tried to start it with julia -t 1. And the execution is still multithreaded. So I really do not understand whatā€™s going on.

Okay. Problem solved. It was my fault, sorry about that.

It turns out the CGS2 function was inside a module, and so was the line BLAS.set_num_threads(1). I was using this module inside my script. I finally moved the line BLAS.set_num_threads(1) from within the module to within the script calling the module, and everything is fine now, it is finally single threaded.

7 Likes

And how does Julia compare to C? Perhaps add using MKL at the top of your Julia code if your C code is also using MKLā€¦

Edit:
On my machine, when using one thread OpenBLAS is actually faster than MKL:
1:33 with OpenBLAS and 1:43 with MKLā€¦

2 Likes

Not supported by my measurements:

Threads.nthreads() = 1
BLAS.get_num_threads() = 1
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
150.167704 seconds (142.55 k allocations: 7.550 MiB, 0.16% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 8.326672684688674e-15
[ Info: qr!(A)
 44.681267 seconds (262.59 k allocations: 3.739 GiB, 0.10% gc time, 0.66% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.217248937900877e-15
[ Info: CGS2_alt(A)
126.559068 seconds (2.74 M allocations: 2.004 GiB, 0.11% gc time, 1.52% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.995204332975845e-15

Threads.nthreads() = 2
BLAS.get_num_threads() = 2
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
150.459621 seconds (144.57 k allocations: 7.692 MiB, 0.15% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.9823416119036905e-15
[ Info: qr!(A)
 36.173330 seconds (262.59 k allocations: 3.739 GiB, 0.12% gc time, 0.79% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.551115123125783e-15
[ Info: CGS2_alt(A)
159.016720 seconds (2.74 M allocations: 2.004 GiB, 0.09% gc time, 1.18% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.773159728050814e-15

Threads.nthreads() = 8
BLAS.get_num_threads() = 8
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
118.425464 seconds (154.03 k allocations: 8.474 MiB, 0.22% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 8.265606691389323e-15
[ Info: qr!(A)
 29.127472 seconds (262.59 k allocations: 3.739 GiB, 0.16% gc time, 1.00% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.995204332975845e-15
[ Info: CGS2_alt(A)
147.615824 seconds (2.74 M allocations: 2.004 GiB, 0.10% gc time, 1.29% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.773159728050814e-15

Threads.nthreads() = 32
BLAS.get_num_threads() = 32
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
118.304363 seconds (190.72 k allocations: 11.566 MiB, 0.22% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 8.43769498715119e-15
[ Info: qr!(A)
 38.874669 seconds (262.59 k allocations: 3.739 GiB, 0.12% gc time, 0.75% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.995204332975845e-15
[ Info: CGS2_alt(A)
184.848760 seconds (2.74 M allocations: 2.004 GiB, 0.08% gc time, 1.04% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.217248937900877e-15
[pkrysl@horntail data]$

The function mgsortho! runs faster with an increase in the number of threads, while CGS2_alt
runs slower with more threads.

Iā€™m pretty sure that if you were to run this on a distributed machine, your CGS2 would eventually be faster than MGS in parallel due to the high cost of synchronizations. CGS2 is often preferred over MGS in communication-avoiding implementations. Also, CGS2 relies on BLAS-2, unlike MGS, this helps provide more efficient data flow and eventually compensates for the difference in complexity.

It is very weird that your CGS2 is faster than MGS on a single core, and then slower on multiple cores. This goes against my understanding of these algorithms.

Here is my MGS code in Julia:

function MGS(X::Array{Float64,2})
  n, m = size(X)
  Q = Array{Float64,2}(undef, n, m)
  q = Vector{Float64}(undef, n)
  R = zeros(Float64, m, m)
  w = Vector{Float64}(undef, n)
  R[1, 1] = norm(X[:, 1])
  Q[:, 1] = X[:, 1] ./ R[1, 1]
  for i in 2:m 
    w .= X[:, i]
    for j in 1:i-1
      q .= Q[:, j]
      R[j, i] = q'w
      w .-= R[j, i] .* q
    end
    R[i, i] = norm(w)
    Q[:, i] = w ./ R[i, i]
  end
  return Q, R
end

I ran MGS and CGS2 with 1, 2, 4, 8, 16 and 32 cores with n = 1_000_000, m = 250 and X = rand(X). Here are the results I get:

1 core
MGS: 162.43s
CGS2: 169.78s

2 cores
MGS: 167.40s
CGS2: 91.04s

4 cores
MGS: 170.67s
CGS2: 49.24s

8 cores
MGS: 149.25s
CGS2: 33.92s

16 cores
MGS: 239.04s
CGS2: 32.80s

32 cores
MGS: 296.23s
CGS2: 29.36s

Here is my MGS code in C:

void MGS(int n, int m, double *X, double *Q, double *R) {
  R[0] = sqrt(cblas_ddot(n, &X[0], 1, &X[0], 1));
  cblas_dcopy(n, &X[0], 1, &Q[0], 1);
  cblas_dscal(n, 1./R[0], &Q[0], 1);
  for (int j=1; j<m; j++)
    R[j] = 0.;
  for (int j=1; j<m; j++) {
    cblas_dcopy(n, &X[j * n], 1, &Q[j * n], 1);
    for (int i=0; i<j; i++) {
      R[j * m + i] = cblas_ddot(n, &Q[j * n], 1, &Q[i * n], 1);
      cblas_daxpy(n, -R[j * m + i], &Q[i * n], 1, &Q[j * n], 1);
    }
    R[j * m + j] = sqrt(cblas_ddot(n, &Q[j * n], 1, &Q[j * n], 1));
    for (int i=j+1; i<m; i++)
      R[j * m + i] = 0.;
    cblas_dscal(n, 1./R[j * m + j], &Q[j * n], 1);
  }
}

Again, I ran MGS and CGS2 with 1, 2, 4, 8, 16 and 32 cores with n = 1_000_000, m = 250 and X = rand(X). Here are the results I get in C:

1 core
MGS: 64.74s
CGS2: 87.42s

2 cores
MGS: 31.51s
CGS2: 40.09s

4 cores
MGS: 18.96s
CGS2: 22.58s

8 cores
MGS: 9.35s
CGS2: 14.47s

16 cores
MGS: 4.92s
CGS2: 10.19s

32 cores
MGS: 5.28s
CGS2: 13.05s

These results are not great. The Julia results are weird for two reasons. First, the single threaded MGS is barely faster than CGS2. I would expect the difference to be larger. Again, the FLOP count of CGS2 is double that of MGS. Second, for some reason I ignore, the multithreaded MGS does not seem to be parallelized at all. In C, the multithreading does impact both algorithms. However, we cannot really say that CGS2 scales better than MGS with the number of threads. If instead, the parallelization was made on a distributed machine, I am pretty sure that CGS2 would scale better with the number of processes than MGS, due to a larger synchronization cost of MGS. Unfortunately, I currently only have access to shared memory machines and cannot show it to you.

Did you try the code suggested above at Julia seems to be running multithreaded when I don't want it to - #22 by mikmoore? In your julia code you leave lots of performance on the table. If you have many memory allocations in hot loops already in single-threaded code, when you go to multiple threads youā€™re going to spend more time in garbage collection than actual computation.

Iā€™ll check this with more attention. Thanks for the suggestion.

Interesting, that is for sure. The MGS may have too many allocations. Here are my results:

Threads.nthreads() = 1
BLAS.get_num_threads() = 1
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
265.021534 seconds (142.55 k allocations: 7.550 MiB, 0.09% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 8.659739592076221e-15
[ Info: qr!(A)
 52.406777 seconds (262.59 k allocations: 3.739 GiB, 0.10% gc time, 0.64% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 4.440892098500626e-15
[ Info: CGS2_alt(A)
127.722590 seconds (2.74 M allocations: 2.004 GiB, 0.14% gc time, 1.55% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 4.218847493575595e-15
[ Info: MGS(A)
395.876514 seconds (63.26 k allocations: 237.511 GiB, 0.11% gc time)
norm(A' * A - LinearAlgebra.I, Inf) = 1.6686904814747563e-14
Threads.nthreads() = 2
BLAS.get_num_threads() = 2
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
206.413285 seconds (144.42 k allocations: 7.687 MiB, 0.11% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 8.43769498715119e-15
[ Info: qr!(A)
 52.939834 seconds (262.59 k allocations: 3.739 GiB, 0.11% gc time, 0.59% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.773159728050814e-15
[ Info: CGS2_alt(A)
270.572132 seconds (2.74 M allocations: 2.004 GiB, 0.08% gc time, 0.77% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 5.773159728050814e-15
[ Info: MGS(A)
518.225836 seconds (63.26 k allocations: 237.511 GiB, 0.10% gc time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.837092402108523e-15
Threads.nthreads() = 8
BLAS.get_num_threads() = 8
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
148.832363 seconds (153.70 k allocations: 8.464 MiB, 0.17% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 7.105427357601002e-15
[ Info: qr!(A)
 34.746100 seconds (262.59 k allocations: 3.739 GiB, 0.16% gc time, 0.92% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 4.551914400963142e-15
[ Info: CGS2_alt(A)
191.672508 seconds (2.74 M allocations: 2.004 GiB, 0.09% gc time, 1.08% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 4.6629367034256575e-15
[ Info: MGS(A)
514.607145 seconds (63.26 k allocations: 237.511 GiB, 0.11% gc time)
norm(A' * A - LinearAlgebra.I, Inf) = 4.6629367034256575e-15
Threads.nthreads() = 32
BLAS.get_num_threads() = 32
[ Info: mgsortho!(A)
(m, n) = size(A) = (1000000, 250)
 88.613310 seconds (190.33 k allocations: 11.554 MiB, 0.31% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 7.255341177836699e-15
[ Info: qr!(A)
 38.081088 seconds (262.59 k allocations: 3.739 GiB, 0.16% gc time, 0.79% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.994405055138486e-15
[ Info: CGS2_alt(A)
186.352147 seconds (2.74 M allocations: 2.004 GiB, 0.10% gc time, 1.09% compilation time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.5503158452884236e-15
[ Info: MGS(A)
487.275284 seconds (63.26 k allocations: 237.511 GiB, 0.11% gc time)
norm(A' * A - LinearAlgebra.I, Inf) = 6.661338147750939e-15
[pkrysl@horntail data]$

Fwiw, you could have also set the environment variable OPENBLAS_NUM_THREADS=1 outside of Julia to control the number of BLAS threads.

2 Likes