I have written a version of my code that uses MPI.jl to achieve a potentially large scale parallelization. I already had MPI installed on my PC and used the instructions from here to make sure that the same mpiexec
is being used by MPI.jl and the system.
I tested out my code on my PC to confirm that the code is running and producing the correct outputs (that agree with the serial version), they can be found on my other thread. The full code is given in a post in my other thread by me. The only change being that I have used dd=11
in the below benchmark that I ran on the cluster.
however on running the job on my cluster, I am being presented with the following error
[details=âErrorâ]
[cn152][[15784,1],2][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],33][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: cn152
Local device: mlx5_0
--------------------------------------------------------------------------
[cn152][[15784,1],39][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],31][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],29][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],23][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],5][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],20][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],17][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],13][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],35][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],9][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],18][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],32][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],38][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],26][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],30][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],15][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],6][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],37][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],11][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],22][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],24][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],27][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],1][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],19][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],8][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],36][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],7][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],14][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],28][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],21][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],34][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],25][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],3][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],4][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],10][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],12][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],0][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152][[15784,1],16][btl_openib_component.c:1705:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
[cn152:447286] 39 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn152:447286] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: ERROR: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: ERROR: LoadError: LoadError: LoadError: LoadError: ERROR: LoadError: LoadError: LoadError: LoadError: LoadError: ERROR: LoadError: LoadError: LoadError: LoadError: ERROR: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: LoadError: ArgumentError: ArgumentError: ichunk must be less or equal to nchunks
Stacktrace:
ArgumentError: ichunk must be less or equal to nchunks
Stacktrace:
[1] [1] getchunk(getchunk(array::array::ichunk must be less or equal to nchunks
Stacktrace:
ArgumentError: ichunk must be less or equal to nchunks
Stacktrace:
ArgumentError: [1] ichunk must be less or equal to nchunks
Stacktrace:
getchunk(array:: [1] getchunk(array:: [1] getchunkArgumentError: ichunk must be less or equal to nchunks
.
.
.
.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[15784,1],21]
Exit code: 1
--------------------------------------------------------------------------
I used mpiexec -n 40 julia mpi_parallel_timeloop.jl
on the SLURM script to run my code.
It seems like the system is out of disk space (?) But then again it is also showing an error regarding the variable ichunks
which did not occur when running it on my pc.