Slowdown with multiple instances per node?

sebastian-steiner · May 27, 2020, 4:33pm

I am currently comparing MPI performance between C and Julia and when I only have a single task per node, Julia’s performance is pretty much exactly the same as C’s. But as soon as I utilize all 32 cores on both sockets, I see a slowdown with a factor of ~2 compared to C. Has anybody else observed something like this?

This happens only at larger message sizes like 1024000 bytes.

P.S.: I use slurm with the parameter --ntasks-per-node 32 to have multiple instances per node if that matters.

Oscar_Smith · May 27, 2020, 5:09pm

Is your code using BLAS (matrix multiplication and the like)? If so, set blas threads to 1 when you are running on all cores.

sebastian-steiner · May 27, 2020, 5:20pm

No, not really. I only call Allreduce in a tight loop:

send = zeros(UInt8, msize)
recv = zeros(UInt8, msize)
for i in 1:args.nrep
    # start sync
    MPI.Barrier(comm)

    times[i] = MPI.Wtime()
    MPI.Allreduce!(send, recv, msize, args.operation, comm)
    times[i] = MPI.Wtime() - times[i]
end

Could I still benefit from setting blas threads to 1?

Topic		Replies	Views
Parallel without communication using MPI Julia at Scale	3	766	October 8, 2018
Slow down when running several parallel julia processes which use BLAS (MWE is provided) Performance	10	1410	January 26, 2018
How to prevent BLAS from thrashing with Julia? General Usage parallel	5	2228	May 30, 2017
Parallel computation of multiplication of large matrices Performance	5	2457	April 8, 2019
Running several Julia engines Performance	6	867	January 27, 2020

Slowdown with multiple instances per node?

Related topics