Why passing a distributed matrix to functions hurts performance?

For distributed matmul, you should set the number of BLAS threads per core to 1 with BLAS.set_num_threads(1), otherwise you’ll oversubscribe.