I’m not necessarily new to Julia but new to parallel processing and writing fast code. I am trying to optimize this piece of code:
BLAS.set_num_threads(16)
MAX_CONC = 4 # limit to 4 concurrent tasks
@sync begin
sem = Base.Semaphore(MAX_CONC)
for i in 1:length(clusters)
Base.acquire(sem) @spawn begin
try
for p in 1:2*N #loop over particles
for totSz in range(maximum([-p, -2 * N + p]), minimum([p + 1, 2 * N - p + 1]), step=2)
states = genBasisStates(N, p, totSz)# creates states
hubHam = HubbardMatrix(U, t, tPrime, N, clusters[i], states) #Hamiltonian matrix
eigenvalues, eigenvecs = eigen(hubHam)
end physics thermal averaging stuff
finally
Base.release(sem)
end
end
end
end
Basically I am looping over different cluster geometries creating their hamiltonian and getting the eigenvalues from the hamiltonian and doing some thermal averaging. What I am confused the most is how many threads the eigenfunction will use. For example if I set it to 16 will it use at maximum at a time 16 or with each call of eigenfunction will it use 16? When I looked at how many cores are running the highest that I saw running was 19. This might also not be the best way of doing parallel calculations but is what I got to work.
Another thing that I am not super sure how it works is how many cores I should be using. The HPC that I am using has 32 cores per node so should I try to use all 32 at the same time or leave a few unused? Any tips are welcome!
The BLAS library has its own thread pool independent of Julia’s. Unfortunately, this means that you can’t currently do multithreaded BLAS calls in parallel from multiple Julia tasks, as I understand it—it will effectively do a single BLAS call at a time.
It may be better to set the number of BLAS threads to 1 and do lots of serial BLAS calls in parallel.
BLAS threading has very strange interactions with Julia. This also depends on what BLAS you use, e.g. MKL.jl behaves differently than the default OpenBLAS. See here for a write-up
That being said: I did similar simulations. Here is my suggestion: if your total memory permits it, you’ll want to do what @stevengj suggested and use Julia’s threads while setting BLAS.set_num_threads(1). This has the lowest overheads and is also simple to code. If your Hamiltonians get too large and thus you can’t find enough of them into memory at once, then you’ll need to use Distributed.jl to spawn several Julia processes (workers) and have each of them use a couple of threads for BLAS such that the total number of BLAS threads matches the number of cores. This works because each worker has its own “instance” of BLAS and thus you are “parallelizing” it that way.
Alternatively, you can try to use MKL.jl and keep on using Julia threads (because MKL.jl spawns BLAS threads for each Julia thread, so you’ll end up with NxM threads when using N Julia threads and BLAS.set_num_threads(M)). However, I found that doesn’t work well in practice. Mostly, because the GC tends to block things a lot when there are long running external operations. So the worker process approach is more robust and also doesn’t depend on the specific behavior of a specific BLAS implementation.