Slowdown with multiple instances per node?

I am currently comparing MPI performance between C and Julia and when I only have a single task per node, Julia’s performance is pretty much exactly the same as C’s. But as soon as I utilize all 32 cores on both sockets, I see a slowdown with a factor of ~2 compared to C. Has anybody else observed something like this?

This happens only at larger message sizes like 1024000 bytes.

P.S.: I use slurm with the parameter --ntasks-per-node 32 to have multiple instances per node if that matters.

Is your code using BLAS (matrix multiplication and the like)? If so, set blas threads to 1 when you are running on all cores.

No, not really. I only call Allreduce in a tight loop:

send = zeros(UInt8, msize)
recv = zeros(UInt8, msize)
for i in 1:args.nrep
    # start sync
    MPI.Barrier(comm)

    times[i] = MPI.Wtime()
    MPI.Allreduce!(send, recv, msize, args.operation, comm)
    times[i] = MPI.Wtime() - times[i]
end

Could I still benefit from setting blas threads to 1?