@luraess I am having trouble getting good performance on this example which uses the hide-comms functionnality, I am runing it on a few nodes composed of four 16core CPUs each, having 128 threads per nodes. (USE_GPU=false)
With some script like this
for N in {1,2,3}
do
mpirun -n ${N} julia -t 128 ../travail/julia-heatdiff/scripts/run.jl
done
I get the following results, the bandwidth decreases when I increase the number of compute nodes
Global grid: 512x512x512 (nprocs: 1, dims: 1x1x1)
time_s=5.118296146392822 T_eff=56.64195353063296
Global grid: 1022x512x512 (nprocs: 2, dims: 2x1x1)
time_s=5.26022481918335 T_eff=55.11366955699977
Global grid: 1532x512x512 (nprocs: 3, dims: 3x1x1)
time_s=8.300725936889648 T_eff=34.92589620283643
Global grid: 1022x1022x512 (nprocs: 4, dims: 2x2x1)
time_s=8.083126068115234 T_eff=35.86611046728351
Am i launching it wrong?