Threads stop working when scaling computation

SpuriousEigenstate · May 17, 2024, 12:06pm

I have been stumped by this for a while now, so I thought I’d put it up here.

I developed a script that calculates some properties of a simulation “self-consistently”. I.e. the system is set up as a matrix, diagonalized, then some properties are calculated from the eigenvectors and plugged back in to the matrix. This process is then repeated until convergence (or some n_max_iter). I then want to do this for a bunch of different parameters, so I pass them to different threads using Threads.@threads.

The script works fine (fast & results look right) on my desktop, and for the smaller system sizes I used when developing the code. When I put the same, small system size (800x800 matrices), on our slurm cluster, (96 threads per node, I usually request 72) it works fine.

Increasing the system size to 1800x1800 matrices the script still works fine on my desktop (16 threads, 32G memory), however when I run it on the cluster and monitor with htop, initially all threads are running at 100% until suddenly after a minute or so, all threads go into S or sleep mode. They do not recover.

In fact, this already occurs at 20 threads, but doesn’t at 16 threads on the same machine. I’m stumped.

Things I tried include:

Making sure it’s not a (too-little) memory issue. This is “easy” since the cluster has up to 256G memory available.
Forcing garbage collection. I know my code allocates a lot (a consequence of running LinearAlgebra.eigen() hundreds of times

I made myself a sort of minimal example, where I diagonalize a bunch of random matrices in the same sort of loop structure as in my real code, and that doesn’t cause the sleeping to occur. So I conclude that the problem is somehow related to the way I update the matrix. For completeness, this is the exact code:

function update_H!(H, erray, N, E, uv, U, β, μ0, t, ϵ)
	erray[1] = 0.0
	erray[2] = 0.0
	for i in 1:N
		δi = 0+0im
		mi = μ0
		
		for (k, e) in enumerate(E)
			δi += U[i] * uv[2i-1, k] * conj(uv[2i, k]) * tanh(β * e / 2)
			mi -= U[i] * abs2(uv[2i-1, k]) * fermi_dirac(e, β)
		end

		if abs(H[2i-1, 2i] - δi) > erray[1]
			erray[1] = abs(H[2i-1, 2i] - δi)
		end
		
		if abs(H[2i-1, 2i-1] - mi) > erray[2]
			erray[2] = abs(H[2i-1, 2i-1] - (4t + ϵ[i] - mi))
		end

		H[2i-1, 2i] = δi
		H[2i, 2i-1] = conj(δi)

		H[2i-1, 2i-1] = 4t + ϵ[i] - mi
		H[2i, 2i] = -4t - ϵ[i] + mi
	end
	return nothing
end

function get_self_consistent(n, m, t, μ, μ0, ϵ, Δ, U, pbc_x, pbc_y, β; epsa=1e-4, N_iter=40)
    
    i = 0
    erra = 1
	erray = [0.0, 0.0]
	H = H_BdG(n, m, t, μ, ϵ, Δ, pbc_x=pbc_x, pbc_y=pbc_y)
	E = Array{Float64}(undef, n*m)
	uv = similar(H)

    while (i < N_iter) && (erra > epsa)

		E, uv = eigen(Hermitian(H))

		update_H!(H, erray, n*m, E, uv, U, β, μ0, t, ϵ)
		
        erra = max(erray...)

        i += 1
		print("i = ", i, " | ϵ(abs) = ", round.(erray,digits=5), "\n")
		flush(stdout)
    end

	Δ = calc_Δ(n*m, E, uv, U, β)
    μ = calc_μ(n*m, E, uv, U, β, µ0)
	# print("i = ", i, " | ϵ(abs) = ", round(erra,digits=5), "\n")
    return Δ, μ
end

So H is the matrix being diagonalized, update_H! updates H in place, and keeps track of the difference with the previous iteration (erray). There are also a bunch of parameters which I won’t explain, but left in for completeness. The print statement is only there for debugging purposes, and informed me that most workers don’t make it past the first few iterations of the self-consistent loop. The bug is also present without printing anything.

I’m sorry if this is a bit much, but I would appreciate any insight into what’s going on

Topic		Replies	Views
Inconsistent CPU utilisation in @threads loops Performance parallel , multithreading , threads , threading	20	876	February 23, 2024
BLAS vs Threads on a cluster Performance	6	522	April 23, 2024
Threads.@threads slow down when arrays become too large New to Julia	1	895	November 29, 2021
Help me understand multi-threaded scaling for matrix multiplication Performance question	22	622	April 16, 2024
Computer freezes with parallel Julia at Scale	0	496	January 28, 2020

Threads stop working when scaling computation

Related topics