Threads stop working when scaling computation

I have been stumped by this for a while now, so I thought I’d put it up here.

I developed a script that calculates some properties of a simulation “self-consistently”. I.e. the system is set up as a matrix, diagonalized, then some properties are calculated from the eigenvectors and plugged back in to the matrix. This process is then repeated until convergence (or some n_max_iter). I then want to do this for a bunch of different parameters, so I pass them to different threads using Threads.@threads.

The script works fine (fast & results look right) on my desktop, and for the smaller system sizes I used when developing the code. When I put the same, small system size (800x800 matrices), on our slurm cluster, (96 threads per node, I usually request 72) it works fine.

Increasing the system size to 1800x1800 matrices the script still works fine on my desktop (16 threads, 32G memory), however when I run it on the cluster and monitor with htop, initially all threads are running at 100% until suddenly after a minute or so, all threads go into S or sleep mode. They do not recover.

In fact, this already occurs at 20 threads, but doesn’t at 16 threads on the same machine. I’m stumped.

Things I tried include:

  • Making sure it’s not a (too-little) memory issue. This is “easy” since the cluster has up to 256G memory available.
  • Forcing garbage collection. I know my code allocates a lot (a consequence of running LinearAlgebra.eigen() hundreds of times

I made myself a sort of minimal example, where I diagonalize a bunch of random matrices in the same sort of loop structure as in my real code, and that doesn’t cause the sleeping to occur. So I conclude that the problem is somehow related to the way I update the matrix. For completeness, this is the exact code:

function update_H!(H, erray, N, E, uv, U, β, μ0, t, ϵ)
	erray[1] = 0.0
	erray[2] = 0.0
	for i in 1:N
		δi = 0+0im
		mi = μ0
		for (k, e) in enumerate(E)
			δi += U[i] * uv[2i-1, k] * conj(uv[2i, k]) * tanh(β * e / 2)
			mi -= U[i] * abs2(uv[2i-1, k]) * fermi_dirac(e, β)

		if abs(H[2i-1, 2i] - δi) > erray[1]
			erray[1] = abs(H[2i-1, 2i] - δi)
		if abs(H[2i-1, 2i-1] - mi) > erray[2]
			erray[2] = abs(H[2i-1, 2i-1] - (4t + ϵ[i] - mi))

		H[2i-1, 2i] = δi
		H[2i, 2i-1] = conj(δi)

		H[2i-1, 2i-1] = 4t + ϵ[i] - mi
		H[2i, 2i] = -4t - ϵ[i] + mi
	return nothing

function get_self_consistent(n, m, t, μ, μ0, ϵ, Δ, U, pbc_x, pbc_y, β; epsa=1e-4, N_iter=40)
    i = 0
    erra = 1
	erray = [0.0, 0.0]
	H = H_BdG(n, m, t, μ, ϵ, Δ, pbc_x=pbc_x, pbc_y=pbc_y)
	E = Array{Float64}(undef, n*m)
	uv = similar(H)

    while (i < N_iter) && (erra > epsa)

		E, uv = eigen(Hermitian(H))

		update_H!(H, erray, n*m, E, uv, U, β, μ0, t, ϵ)
        erra = max(erray...)

        i += 1
		print("i = ", i, " | ϵ(abs) = ", round.(erray,digits=5), "\n")

	Δ = calc_Δ(n*m, E, uv, U, β)
    μ = calc_μ(n*m, E, uv, U, β, µ0)
	# print("i = ", i, " | ϵ(abs) = ", round(erra,digits=5), "\n")
    return Δ, μ

So H is the matrix being diagonalized, update_H! updates H in place, and keeps track of the difference with the previous iteration (erray). There are also a bunch of parameters which I won’t explain, but left in for completeness. The print statement is only there for debugging purposes, and informed me that most workers don’t make it past the first few iterations of the self-consistent loop. The bug is also present without printing anything.

I’m sorry if this is a bit much, but I would appreciate any insight into what’s going on :slight_smile: