Hi! I am confused about running Julia code in a Slurm cluster.
I am solving a nonlinear PDE so I have to solve it step by step. In each step a linear equation system Ax=b (from FEM so A is sparse) is solved by \
which actually adopts the SuitSparse
package. In my personal laptop and the desktop in office, the \
is automatically multithreaded even run with ‘julia -t 1’.
Today I try to run my code in a Slurm cluster. When I applied for 4 cores in a single node, the output file is weird. It looks like my code is running in each core independently. Attached is my script to submit the job to cluster and the output file.
#!/bin/bash
#SBATCH -J Test
#SBATCH -p intel
#SBATCH -N 1
#SBATCH -n 4
#SBATCH --ntasks-per-node=4
hostname
srun julia -t 4 StaMmdSys.jl
sleep 100
The weird output file in which each incremental step and iteration step is computed several times
cpui03
Num Ste 1
Num Ste 1
Num Ste 1
Num Ste 1
Num Ite 1
ResErr 1.1815291149415338e-26 LoaFac 31.340781306503
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 2
Num Ite 1
ResErr 1.1815291149415338e-26 LoaFac 31.340781306503
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 2
Num Ite 1
ResErr 1.1815291149415338e-26 LoaFac 31.340781306503
Max Dam 0.0 Min Deg 1.000000000000001
Num Ite 1
ResErr 5.613078015512632e-27 LoaFac 62.681562613006
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 2
Num Ste 3
Num Ite 1
ResErr 1.1815291149415338e-26 LoaFac 31.340781306503
Max Dam 0.0 Min Deg 1.000000000000001
Num Ite 1
ResErr 5.613078015512632e-27 LoaFac 62.681562613006
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 3
Num Ste 2
Num Ite 1
ResErr 3.098702727267662e-7 LoaFac 94.022343919509
Max Dam 0.0008378798567051027 Min Deg 0.983676759929105
Num Ste 4
Num Ite 1
ResErr 3.098702727267662e-7 LoaFac 94.022343919509
Max Dam 0.0008378798567051027 Min Deg 0.983676759929105
Num Ste 4
Num Ite 1
ResErr 1.565185266046758e-5 LoaFac 125.36136059678219
Max Dam 0.010911503021434944 Min Deg 0.8866271918773956
Num Ite 1
ResErr 5.613078015512632e-27 LoaFac 62.681562613006
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 3
Num Ite 1
ResErr 5.613078015512632e-27 LoaFac 62.681562613006
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 3
Num Ite 1
ResErr 1.565185266046758e-5 LoaFac 125.36136059678219
Max Dam 0.010911503021434944 Min Deg 0.8866271918773956
Num Ite 1
ResErr 3.098702727267662e-7 LoaFac 94.022343919509
Max Dam 0.0008378798567051027 Min Deg 0.983676759929105
Num Ite 1
ResErr 3.098702727267662e-7 LoaFac 94.022343919509
Max Dam 0.0008378798567051027 Min Deg 0.983676759929105
Num Ite 2
ResErr 1.645042974830353e-7 LoaFac 125.17614239795542
Max Dam 0.01105091853894029 Min Deg 0.8842490683695091
Num Ste 4
Num Ste 4
Num Ite 2
ResErr 1.645042974830353e-7 LoaFac 125.17614239795542
Max Dam 0.01105091853894029 Min Deg 0.8842490683695091
Num Ste 5
Num Ite 1
ResErr 2.9130673485754505e-5 LoaFac 156.467617106916
Max Dam 0.04471510355125102 Min Deg 0.7560656522648532
Num Ite 1
ResErr 1.565185266046758e-5 LoaFac 125.36136059678219
Max Dam 0.010911503021434944 Min Deg 0.8866271918773956
Num Ste 5
Num Ite 2
ResErr 1.645042974830353e-7 LoaFac 125.17614239795542
Max Dam 0.01105091853894029 Min Deg 0.8842490683695091
Num Ste 5
Num Ite 1
ResErr 1.565185266046758e-5 LoaFac 125.36136059678219
Max Dam 0.010911503021434944 Min Deg 0.8866271918773956
Num Ite 1
ResErr 2.9130673485754505e-5 LoaFac 156.467617106916
Max Dam 0.04471510355125102 Min Deg 0.7560656522648532
Num Ite 2
ResErr 3.927810503996626e-6 LoaFac 155.8450166806799
Max Dam 0.056704006161101106 Min Deg 0.7223674694734912
Num Ite 2
ResErr 3.927810503996626e-6 LoaFac 155.8450166806799
Max Dam 0.056704006161101106 Min Deg 0.7223674694734912
Num Ite 3
ResErr 5.410640813519663e-7 LoaFac 155.657378580163
Max Dam 0.060797662055178815 Min Deg 0.7137250018025755
A normal output file should look like this (get it from cluster with only 1 core)
cpui03
Num Ste 1
Num Ite 1
ResErr 1.1815291149415338e-26 LoaFac 31.340781306503
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 2
Num Ite 1
ResErr 5.613078015512632e-27 LoaFac 62.681562613006
Max Dam 0.0 Min Deg 1.000000000000001
Num Ste 3
Num Ite 1
ResErr 3.098702727267662e-7 LoaFac 94.022343919509
Max Dam 0.0008378798567051027 Min Deg 0.983676759929105
Num Ste 4
Num Ite 1
ResErr 1.565185266046758e-5 LoaFac 125.36136059678219
Max Dam 0.010911503021434944 Min Deg 0.8866271918773956
Num Ite 2
ResErr 1.645042974830353e-7 LoaFac 125.17614239795542
Max Dam 0.01105091853894029 Min Deg 0.8842490683695091
Num Ste 5
Num Ite 1
ResErr 2.9130673485754505e-5 LoaFac 156.467617106916
Max Dam 0.04471510355125102 Min Deg 0.7560656522648532
Num Ite 2
ResErr 3.927810503996626e-6 LoaFac 155.8450166806799
Max Dam 0.056704006161101106 Min Deg 0.7223674694734912
Num Ite 3
ResErr 5.410640813519663e-7 LoaFac 155.657378580163
Max Dam 0.060797662055178815 Min Deg 0.7137250018025755
Num Ste 6
Num Ite 1
ResErr 7.935999964334881e-5 LoaFac 186.77680857130252
Max Dam 0.1238986630949555 Min Deg 0.5772076022373513
Num Ite 2
ResErr 1.1999112834906476e-5 LoaFac 185.54765605633034
Max Dam 0.15347340755070013 Min Deg 0.5289334799119432
Num Ite 3
ResErr 1.5354390771604877e-6 LoaFac 185.0418806205562
Max Dam 0.16195976821638491 Min Deg 0.5141634286765151
Num Ite 4
ResErr 4.0455113498668363e-7 LoaFac 184.83622847734313
Max Dam 0.16516550863977345 Min Deg 0.5067114390938846
I wonder that do I have to modify my code to run it properly in a slurm cluser? Or I should submit the job to cluster in a different manner?