I am currently comparing MPI performance between C and Julia and when I only have a single task per node, Julia’s performance is pretty much exactly the same as C’s. But as soon as I utilize all 32 cores on both sockets, I see a slowdown with a factor of ~2 compared to C. Has anybody else observed something like this?
This happens only at larger message sizes like 1024000 bytes.
P.S.: I use slurm with the parameter --ntasks-per-node 32
to have multiple instances per node if that matters.