I am running my codes on my university server. Typically I run a few random initial conditions on the same set of parameters.
However, some of these scripts run through, while some of them run for a while and failed suddenly. Which ones run through seems rather random. I am sure that this error is related to the multithreading issues, since when I don’t do multithreading, all my codes run through.
The typically error message I get from slurm is the following
“[132038] signal (11.1): Segmentation fault in expression starting at /home/users/ttx2000/code/HF_Moire/data_collection/TMD_exciton_controlled/main_test.jl:43 Allocations: 7561571680 (Pool: 7561347099; Big: 224581); GC: 2943 /var/spool/slurmd/job44926069/slurm_script: line 23: 132038 Segmentation fault julia /home/users/ttx2000/code/HF_Moire/data_collection/TMD_exciton_controlled/main_test.jl 0.35 0.4 0.35 8.0 270.0 14.0 190.0 10.0 80.0 5.0 220.0 2.0 0.0 2”
So the last line of this error, that bunch of number following main_test.jl are the parameters I pass on to the scripts.
According to this error message. my understanding is that there is an error in my “main_test.jl” line 43, which calls a function. And the only place that function uses multithreading is the following
Threads.@threads for jk in 1:Nq^2
Fk = FockMatrix[jk]
for jk1 in 1:Nq^2
dmk = input_DensityMatrix[jk1]
q=allowedq[jk1]-allowedq[jk]
for (dg,loop_dic_dg) in loop_dic
CoulF=Coulomb(q+dg,T1,T2)
for (gg2,loop_dic_dg_gg2) in loop_dic_dg
for g1g3 in loop_dic_dg_gg2
Fk[1][g1g3[5]:g1g3[6],gg2[3]:gg2[4]]+=dmk[1][g1g3[3]:g1g3[4],gg2[5]:gg2[6]]*CoulF
Fk[2][g1g3[5]:g1g3[6],gg2[3]:gg2[4]]+=dmk[2][g1g3[3]:g1g3[4],gg2[5]:gg2[6]]*CoulF
end
end
end
end
end
So I do a multithreading on index jk. FockMatrix is a vector of vector of matrix. So FockMatrix[jk] is a two-element vector of Matrix. Then I let Fk be equal to FockMatrix[jk], then I do some operations on Fk, which totally doesn’t depend on jk. Now I don’t understand why I seem to get segmentation error total randomlly.
FockMatrix[jk][vi] is a roughly a 100-by-100 matrix. jk goes from 1 to 9 (this changes depending on my parameters, could be as large as 100), vi goes from 1 to 2
I assign 3 to 5 CPU to each jobs, the exact numbers of CPU used by each job is specified in my slurm scripts. I never specify how many threads there are, as I assume that when I do Threads.@threads, Julia will automatically use all the available threads
No, I think the memory are sufficient.
It seems that when I specify the number of CPU per job to be fewer (for example,3 CPUs or even just 1, currently I use 5), I get less frequent segemetation errors. I think when I just set CPU per task to be 1, I don’t get segementation error, though occacionally I get initilization error, but that’s separate issues.
the packages I used includes LinearAlgebra, JLD2, Arpack, Combinatorics, Plots
6.the variable I call CoulF is just a number, if that’s what you worried about. dmk[1]and dmk[2] is a matrix.
I am not actually super sure how to do this. So what I can do is write in my slurm scripts how many cpus I want to assign to this job.
But then how should I know how many threads I can use? (I used to think that the number of threads is just the number of cpus)
You need to start Julia with the correct amount of threads. In your slurm script you’ll have line that starts Julia and there you need to pass the amount of threads via the -t switch like so:
julia -t 5 main_test.jl 0.35 0.4 ...
Inside your script you can do e.g.
@log "" Threads.nthreads()
to print out the number of threads that Julia has available.
Usually you’ll want 1 thread per physical CPU core. Hyperthreading complicates this a bit and it might be that using 2 threads per physical core (= 1 thread per logical core) is better for your workload. Generally, you need to experiment a bit with these things as they depend on your workload and hardware configuration.
Aside: Your inner loop allocates a lot of temporary arrays IIUC. Try using views and broadcasting:
@views for g1g3 in loop_dic_dg_gg2
Fk[1][g1g3[5]:g1g3[6],gg2[3]:gg2[4]] .+= dmk[1][g1g3[3]:g1g3[4],gg2[5]:gg2[6]] .* CoulF
Fk[2][g1g3[5]:g1g3[6],gg2[3]:gg2[4]] .+= dmk[2][g1g3[3]:g1g3[4],gg2[5]:gg2[6]] .* CoulF
end
I would advise running a Slum job and after you give the job parameters simply run
env
env | grep SLURM
The second will show you all the Slurm related environment variables which your batch system is configuring for you
I always run an ‘env’ when constructing batch scripts - it really does show what is going on ‘under the hood’
There are a number of ways to set up a slurm cluster, but most likely you should use the SLURM_CPUS_ON_NODE environment variable. I.e. in your slurm script you should start julia with:
julia -t $SLURM_CPUS_ON_NODE ...
or set the environment variable:
export JULIA_NUM_THREADS=$SLURM_CPUS_ON_NODE
Julia will otherwise use all the cores available on the node, unless slurm has been set up to start your job with cpu-affinity limits (which julia honours). Though, I find it a bit weird that you run out of memory with such relatively small matrices, even with many threads.
Hi, I reviewed the way I wrote y slurm scripts,here are function for writing and submitting a sbatch files. I think I have set the number of threads though. Is what I am doing here wrong?
function submit_job(filepath, dirpath, job_prefix,args; nodes=1, ntasks=1, time="00:120:00", cpus_per_task=1, mem=64, partition="owners,simes")
outpath = joinpath(dirpath, "out")
slurmpath = joinpath(dirpath, "slurmfiles")# Why is there a job_prefix semicolon there?
mkpath(outpath)
mkpath(slurmpath)
name = "$(args[1])mt$(args[2])mm$(args[3])mb$(args[4])Vt$(args[5])phit$(args[6])Vm$(args[7])phim$(args[8])Vb$(args[9])phib$(args[10])er$(args[11])Eg$(args[12])theta$(args[13])w$(args[14])holenum$(args[15])trytime$(args[16])Nq$(args[17])seed"
filestr = """#!/bin/bash
#SBATCH --job-name=$(job_prefix*"_"*name)
#SBATCH --partition=$partition
#SBATCH --time=$time
#SBATCH --nodes=$nodes
#SBATCH --ntasks=$ntasks
#SBATCH --cpus-per-task=$cpus_per_task
#SBATCH --mem=$(mem)G
#SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --mail-user=________
#SBATCH --output=$outpath/$(job_prefix*"_"*name)_output.txt
#SBATCH --error=$outpath/$(job_prefix*"_"*name)_error.txt
#SBATCH --open-mode=append
#SBATCH --sockets-per-node=2
# load Julia module
ml julia/1.10.0
# multithreading
export JULIA_NUM_THREADS=\$SLURM_CPUS_ON_NODE
# run the script
julia $filepath $(args[1]) $(args[2]) $(args[3]) $(args[4]) $(args[5]) $(args[6]) $(args[7]) $(args[8]) $(args[9]) $(args[10]) $(args[11]) $(args[12]) $(args[13]) $(args[14]) $(args[15]) $(args[16]) $(args[17])"""
open("$slurmpath/$(name).slurm", "w") do io
write(io, filestr)
end
run(`sbatch $(slurmpath)/$(name).slurm`)
end
Hi,
the version is 1.10.0. I believe it’s installed by the others on my University server.
And I just posted the function with which I write and submit the sbatchfiles.
I think I did set the number of threads by the line, but I still get random segemetation error.
Could you report the output Threads.nthreads() before and after you set JULIA_NUM_THREADS?
Also reporting the full output of versioninfo() would be very useful to figure out who compiled or installed Julia for you or if there are any modifications that could be inferring with how Julia works. It could be that there is an error in how Julia was compiled or which libraries that Julia is loading.
If you are not running Julia in interactive mode, you might need to manually load InteractiveUtils.
using InteractiveUtils
versioninfo()
The easiest way to do this might be to start an interactive slurm job. Please see your cluster documentation for the best to do this.
Also if there is a simpler, self-contained script that uses Threads.@threads and also segfaults that would be a very useful minimum working example.