For distributed matmul, you should set the number of BLAS threads per core to 1 with BLAS.set_num_threads(1), otherwise you’ll oversubscribe.
For distributed matmul, you should set the number of BLAS threads per core to 1 with BLAS.set_num_threads(1), otherwise you’ll oversubscribe.