Julia seems to be running multithreaded when I don't want it to

I coded an algorithm called CGS2, i.e., re-orthogonalized classical Gram-Schmidt, which I need to benchmark. I absolutely need my test to run on a single thread. To make sure it is the case, I write the following at the beginning of my script:

using LinearAlgebra: BLAS
BLAS.set_num_threads(1)

If I run the code on a machine with 2 cores, it takes around 190s. Now, if I load an LSF machine with 72 cores, which I need to have more memory, the same code runs in 40s. It is as if Julia is running multithreaded despite my instruction to run on a single thread.

What can I do to solve this issue?

Are you basing your conclusion only on the observation that the code on two different machines runs in different times? Did you actually monitor resource usage with a tool like top or htop while the code was running? Is it possible that a single CPU on the machine with 72 cores is simply faster than the CPUs on your machine with 2 cores?

5 Likes

What about GC time? It may be that on the smaller memory machine GC has to run more frequently.

I did not check with htop. However, I have an optimized serial version of the code in C, which I also run on the machine with 72 cores, and the Julia code, which is less optimized, runs 3X faster than the C code.

Actually, I have so many cores on the LSF machine that htop cannot display the processes, the whole display is filled with processor gauges…

But you aren’t comparing apples to apples. You’re comparing

  • one code on two different machines,
  • two different codes (are you 1000% certain they’re identical?) on the same machine.

You can’t really draw any conclusion from these observations.

To ensure Julia itself runs with a single thread, export the environment variable

export JULIA_NUM_THREADS=1

before starting julia, and inside the julia process run Threads.nthreads() to get the number of threads. But this has no effect on what external libraries might be doing. You’re setting BLAS.set_num_threads(1) which is good for the BLAS library, then we don’t know what you’re doing for the rest, if you’re calling other external libraries, whether they’re multithreaded, what’s the bottleneck for your code, etc.

1 Like

When I run the C code and the Julia code on the 2-cores machine, the C code is faster, which is what I expect. Only when I use the 72-cores machine does the Julia code becomes 3X faster than the C code. No library is used in the Julia code, and MKL is used in the C code. Assuming OpenBLAS (used by default in Julia) and MKL have similar speeds, the C code should be a bit faster as there few extra computations carried over in the Julia code to avoid creating temporary arrays through array slicing… I tried

export JULIA_NUM_THREADS=1

and I still observe the same behavior. I am very confused.

Have you considered the possibility the julia code doesn’t do what you think and maybe there’s a bug which makes it appear faster than it should?

Could you change the number of BLAS threads and check how that impacts the runtime? This should tell you if BLAS is indeed multithreaded

Both the Julia and C code have been proofread by several developers and I’m pretty confident it does exactly what I intend. For a while this result lead me to think that the C code was maybe not optimized as much as it could. I had several several developers look at it and no one could significantly improve its runtime. So now, I’m left with a Julia code that’s 3X faster than my C code while it’s supposed to be slower, I’m very confused…

In case that helps someone figuring out my issue, this is the Julia code:

using LinearAlgebra: norm, I

function CGS2(X::Array{Float64,2})
    n, m = size(X)
    Q = zeros(Float64, n, m)
    R = zeros(Float64, m, m)
    r = zeros(Float64, m)
    w = zeros(Float64, n)
    R[1, 1] = norm(X[:, 1])
    Q[:, 1] = X[:, 1] ./ R[1, 1]
    for i in 2:m
      w .= X[:, i]
      r .= (w'Q)'
      r[i:m] .= 0.
      R[:, i] = r
      w .-= Q * r
      r .= (w'Q)'
      r[i:m] .= 0.
      w .-= Q * r
      R[i, i] = norm(w)
      Q[:, i] = w ./ R[i, i]
    end
    return Q, R
  end

n = 1_000_000
m = 250
X = rand(n, m)

dt_CGS2 = @elapsed Q, R = CGS2(X)
println("dt_CGS2 = $dt_CGS2 sec")
print("extrema(Q'Q - I): ")
println(extrema(Q'Q - I))
print("extrema(X - Q * R): ")
println(extrema(X - Q * R))
println()

And this is my C code:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <sys/time.h>
#include <stdbool.h>
#include <mkl.h>

void CGS2(int n, int m, double *X, double *Q, double *R);

/*
void CGS2(int n, int m, double *X, double *Q, double *R)
Computes the classical Gram-Schmidt orthogonalization with 
re-orthogonalization of m n-vectors.

In:
  - int n: Dimension of the vectors.
  - int m: Number of vectors to orthogonalize.
  - double *X: m n-dimensional vectors assumed linearly independent
               and stored by columns in an unrolled column major array.

Out:
  - double *Q: Matrix Q of QR decomposition of X stored by columns in an 
               unrolled column major array.
  - double *R: Matrix R of QR decomposition of X stored by columns in an 
               unrolled column major array.
*/
void CGS2(int n, int m, double *X, double *Q, double *R) {
  double *Rtmp = (double*)malloc(m * sizeof(double));
  R[0] = sqrt(cblas_ddot(n, &X[0], 1, &X[0], 1));
  cblas_dcopy(n, &X[0], 1, &Q[0], 1);
  cblas_dscal(n, 1./R[0], &Q[0], 1);
  for (int j=1; j<m; j++)
    R[j] = 0.;
  for (int j=1; j<m; j++) {
    cblas_dgemv(CblasColMajor, CblasTrans, n, j, 1., &Q[0], n, &X[j * n], 1, 0., &R[j * m], 1);
    cblas_dcopy(n, &X[j * n], 1, &Q[j * n], 1);
    cblas_dgemv(CblasColMajor, CblasNoTrans, n, j, -1., &Q[0], n, &R[j * m], 1, 1., &Q[j * n], 1);
    cblas_dgemv(CblasColMajor, CblasTrans, n, j, 1., &Q[0], n, &Q[j * n], 1, 0., Rtmp, 1);
    cblas_dgemv(CblasColMajor, CblasNoTrans, n, j, -1., &Q[0], n, Rtmp, 1, 1., &Q[j * n], 1);
    R[j * m + j] = sqrt(cblas_ddot(n, &Q[j * n], 1, &Q[j * n], 1));
    cblas_dscal(n, 1./R[j * m + j], &Q[j * n], 1);
    for (int i=j+1; i<m; i++)
      R[j * m + i] = 0.;
  }
}

int main(int argc, char *argv[]) {

  bool check_orthogonality_brgs = true;

  int n, m, P;
  n = 1000000;
  m = 250;
  P = 1;

  // Generate random vectors
  // X contains m n-dimensional vectors stored by row in an unrolled row major array
  double *X = (double*)mkl_malloc(n * m * sizeof(double), 64);
  for (int i=0; i<m; i++) {
    for (int j=0; j<n; j++) {
      X[i * n + j] = rand() / (double) RAND_MAX;
    }
  }

  // Set the number of processes
  omp_set_num_threads(P);
  
  // Allocate data structures for QR decompositions
  double *Q = (double*)mkl_malloc(n * m * sizeof(double), 64);
  double *R = (double*)mkl_malloc(m * m * sizeof(double), 64);
  double *S = (double*)mkl_malloc(k * m * sizeof(double), 64);
  double *QtQ = (double*)mkl_malloc(m * m * sizeof(double), 64);
  double *QR = (double*)mkl_malloc(n * m * sizeof(double), 64);
  double dQtQ, dX;

  long seconds, microseconds;
  double dt;
  struct timeval begin, end;
  
  gettimeofday(&begin, 0);
  CGS2(n, m, X, Q, R);
  gettimeofday(&end, 0);
  seconds = end.tv_sec - begin.tv_sec;
  microseconds = end.tv_usec - begin.tv_usec;  
  dt = seconds + microseconds * 1e-6;
  printf("Time elapsed for CGS2: %f sec.\n", dt);

  cblas_dgemm(CblasColMajor, CblasTrans, CblasNoTrans, m, m, n, 1., Q, n, Q, n, 0., QtQ, m);
  for (int i=0; i<m; i++)
    QtQ[i * m + i] -= 1.;
  dQtQ = matrix_2norm(QtQ, m, m);
  printf("||I - Q^TQ||_2 = %g\n", dQtQ);
  
  cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, m, m, 1., Q, n, R, m, 0., QR, n);
  cblas_daxpy(n * m, -1., X, 1, QR, 1);
  dX = matrix_2norm(QR, n, m);
  printf("||X - QR||_2 = %g\n", dX);

  mkl_free(X);
  mkl_free(Q);
  mkl_free(R);
  mkl_free(S);
  mkl_free(QtQ);
  mkl_free(QR);

  return 0;
}

Can you share how you call the CGS2 function? The size of the array will matter. Perhaps the larger machine has a larger cache?

And if htop doesn’t work, what is the output of
ps aux | grep julia

?

Thanks for posting the code in both languages—that’s very helpful. It would help if you can run the full matrix of languages and machines:

  • set JULIA_NUM_THREADS=1 in your environment everywhere
  • run the C version on both machines and report those times
  • run the Julia version on both machines and report those times

The sample codes I put above have the details of how CGS2 is called. It’s something like this for the Julia code:

n = 1_000_000
m = 250
X = rand(n, m)
dt_CGS2 = @elapsed Q, R = CGS2(X)

Yes, I have problems displaying anything else than the processor gauges with htop. The command ps aux | grep julia returns

nxf93431 27747 0.0 0.0 9104 684 pts/21 SN+ 10:02 0:00 grep --color=auto julia
on the 72-cores machine.

nxf93431 73958 0.0 0.0 9096 680 pts/3 S+ 10:02 0:00 grep --color=auto julia
on the 2-cores machine.

I made a mistake. To see the Julia threads (if any) you need to type:

ps auxH | grep julia

If I run the following script:

using LinearAlgebra: norm, I, BLAS

BLAS.set_num_threads(1)

function CGS2(X::Array{Float64,2})
    n, m = size(X)
    Q = zeros(Float64, n, m)
    R = zeros(Float64, m, m)
    r = zeros(Float64, m)
    w = zeros(Float64, n)
    R[1, 1] = norm(X[:, 1])
    Q[:, 1] = X[:, 1] ./ R[1, 1]
    for i in 2:m
      w .= X[:, i]
      r .= (w'Q)'
      r[i:m] .= 0.
      R[:, i] = r
      w .-= Q * r
      r .= (w'Q)'
      r[i:m] .= 0.
      w .-= Q * r
      R[i, i] = norm(w)
      Q[:, i] = w ./ R[i, i]
    end
    return Q, R
  end

n = 1_000_000
m = 250
X = rand(n, m)
dt_CGS2 = @elapsed Q, R = CGS2(X)

using the command:

julia -t 1 bench.jl 

and in a second terminal:

ps auxH | grep julia

I get the output:

ufechner   45390  101 18.0 6224772 4339936 pts/0 Rl+  17:24   0:04 julia -t 1 bench.jl
ufechner   45390  0.0 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.5 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.2 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.2 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.2 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.5 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.2 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.2 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.0 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.2 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.5 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45390  2.0 18.0 6224772 4339936 pts/0 Sl+  17:24   0:00 julia -t 1 bench.jl
ufechner   45405  0.0  0.0  11788  2448 pts/1    S+   17:24   0:00 grep --color=auto julia

This is confusing…

OK, but htop shows that only one core is active…

So ps auxH | grep julia is not the correct way to see how many cores are used…

I set JULIA_NUM_THREADS=1 everywhere and ran the full matrix of languages and machines. Here are the results:

2-cores machine:
Julia code: 166.49 s
C code: 89.68 s

72-cores machine
Julia code: 35.96 s
C code: 84.54 s

Can you share the output of:

versioninfo()

of both machines?

Iiuc you are using OpenBLAS, did you try setting OpenBLAS threads to 1?

Here is what is returned by versioninfo() on each machine:

2-cores machine:

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 Ã Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 1 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/vpl/2022.1.0/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/tbb/2021.6.0/env/../lib/intel64/gcc4.8:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mpi/2021.6.0//libfabric/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mpi/2021.6.0//lib/release:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mpi/2021.6.0//lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mkl/2022.1.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/itac/2021.6.0/slib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/ippcp/2021.6.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/ipp/2021.6.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/debugger/2021.6.0/gdb/intel64/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/debugger/2021.6.0/libipt/intel64/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/debugger/2021.6.0/dep/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/dal/2021.6.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/lib/x64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/lib/oclfpga/host/linux64/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/compiler/lib/intel64_lin:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/ccl/2021.6.0/lib/cpu_gpu_dpcpp
  JULIA_NUM_THREADS = 1

72-cores machine:

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 72 Ã Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 72 virtual cores
Environment:
  LD_LIBRARY_PATH = /proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/vpl/2022.1.0/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/tbb/2021.6.0/env/../lib/intel64/gcc4.8:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mpi/2021.6.0//libfabric/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mpi/2021.6.0//lib/release:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mpi/2021.6.0//lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/mkl/2022.1.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/itac/2021.6.0/slib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/ippcp/2021.6.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/ipp/2021.6.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/dnnl/2022.1.0/cpu_dpcpp_gpu_dpcpp/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/debugger/2021.6.0/gdb/intel64/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/debugger/2021.6.0/libipt/intel64/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/debugger/2021.6.0/dep/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/dal/2021.6.0/lib/intel64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/lib/x64:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/lib/oclfpga/host/linux64/lib:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/compiler/2022.1.0/linux/compiler/lib/intel64_lin:/proj/advt_mica9/tools/pkg/intel-oneAPI-/2022/ccl/2021.6.0/lib/cpu_gpu_dpcpp:/opt/lsf/lib
  JULIA_NUM_THREADS = 1