Julia silently stop working while computing the rank of some not that big matrix

I wrote a custom function computing the rank of some matrices (I need an exact computation, and the entries are mod p integers for some small prime p which is why I’m not using the built-in rank function). This is nothing fancy, just a fairly straightforward Gaussian elimination, the code is below. I’ve been using it on larger and larger matrices and it’s working pretty well and giving me an answer after a couple minutes.

Yet, when the matrix is big enough (roughly 40.000x70.000), the computation starts, do something like 70% of the job fairly quickly and then just stop working, ie it’s not crashing or throwing any error, it’s just not doing anything at all.

Weirdly enough the following ugly hack works: I modified my function in order to periodically save the current state of the matrix on the disk. Every time this happens, I kill Julia, start again, load the matrix I just save and start computing the rank again. If I do this one or two times I get my result again in a couple of minutes.

It kinda seems like this is a memory issue, although my matrices are not that big… Does anyone knowns what’s going on ?

Here’s the fairly naive code for the rank. A sparse matrix is represented as a Vector of SparseVector’s. The first function swaps the first column with a short(er) one. The second is just basic Gaussian elimination, deleting the pivot after it’s been used to save some memory.

function sparseFindShort!(n,ll)
	i=0
	j=length(ll[1])+1
	@inbounds for k in 1:n 
		m=nnz(ll[k])
		if m==1
			i=k
			break
		end
		if m<j && m>0
			i=k
			j=nnz(ll[k])
		end
	end
	if i!=0
		ll[1],ll[i]=ll[i],ll[1]
	end
	return nothing
end


function sparseRank!(n,l)
	if n>=1
		sparseFindShort!(n,l)
		if nnz(l[1])==0
			deleteat!(l,1)
			return sparseRank!(n-1,l)
		else
			ind=l[1].nzind[1]
			l[1]/=(l[1].nzval[1])
			@inbounds Threads.@threads for j in 2:n
				if l[j][ind]!=0
					l[j]-= l[j][ind]*l[1]
				end
			end
			deleteat!(l,1)
			return 1+sparseRank!(n-1,l)
		end
	else
		return 0
	end
end

The first thing I would suspect is a bug in your code. Please provide a self-contained MWE, with a matrix that produces the problem. You could also test building blocks.

1 Like

what happens when you remove threading?

@Tamas_Papp I get your point of course, unfortunately this is not that easy, the matrices I’m having trouble with are too big to be shared here, and with randomly generated ones I haven’t been able to pinpoint a working example. Also the code works (and gives the expected answer) consistently with the slightly smaller matrices I’m interested in. It might, of course, be very well be that something is wrong with the way I wrote that piece of code, but I was hoping that experienced Julia users would be able to tell by looking at it. Also, I suppose if it was a bug then something would happen…

@ChrisRackauckas the exact same thing happens if I remove multi-threading.

1 Like

You can put the complete, runnable code for the MWE somewhere, e.g. in a gist, and link that here, after reducing it as much as possible.

1 Like

@dpsanders I’m aware, but as I alluded to doing that was somewhat tricky (the code producing the matrix in question is not just long and fairly dense, it also takes the good part of an hour to do so, not sure it would be useful to share that part).

Anyway, after some trial and error I think I got a MWE with random matrices. It doesn’t seems in fact this is a memory issue, rather something to do with the mere size of the matrix (namely that its length is just above 2^15), and the fact that its rank is big so that they aren’t many columns removed easily. Feel free to try and run the code below. It says hello every once in a while showing how much work remain to be done. On my computer it reaches 5000 in a couple seconds and then just stop. I just know realize: can it have something to do with my function being recursive ? Although I imagine it would throw an error if it was the case.

Note the code is slightly modified from the one below by hard including mod 5 arithmetic which in my case is implemented via a custom struct. This allows to apply this to a matrix of UInt8 and avoid heavy memory usage.

using SparseArrays
function sparseFindShort!(n,ll)
	i=0
	j=length(ll[1])+1
	for k in 1:n 
		m=nnz(ll[k])
		if m==1
			i=k
			break
		end
		if m<j && m>0
			i=k
			j=nnz(ll[k])
		end
	end
	if i!=0
		ll[1],ll[i]=ll[i],ll[1]
	end
	return nothing
end


function sparseRank!(n,l)
	if n>=1
		if n%5000==0
			println("Still working... $n")
		end
		sparseFindShort!(n,l)
		if nnz(l[1])==0
			deleteat!(l,1)
			return sparseRank!(n-1,l)
		else
			ind=l[1].nzind[1]
			l[1]=(l[1]*invmod(l[1].nzval[1],5))
			Threads.@threads for j in 2:n
				if l[j][ind] !=0
					l[j]=(l[j]-l[j][ind]*l[1]).%5
				end
			end
			deleteat!(l,1)
			return 1+sparseRank!(n-1,l)
		end
	else
		return 0
	end
end
n=35000 #I think this value is somehow key
l=Vector{SparseVector{UInt8,UInt32}}(undef,n)
for i=1:n #randomly generates a matrix which is very sparse, but still of high rank.
	u=sprand(UInt8,2*n,0.0001).%5
	u[i]=1
	l[i]=u
end
sparseRank!(length(l),l)

Well, it seems it had to do with recursion indeed, I changed to an iterative version and it works like a charm.

using SparseArrays
function sparseFindShort!(j,ll)
	i=j+1
	len=length(ll[1])+1
	for k in j:-1:1
		m=nnz(ll[k])
		if m==1
			i=k
			break
		end
		if m<len && m>0
			i=k
			len=nnz(ll[k])
		end
	end
	if i!=j+1
		ll[j],ll[i]=ll[i],ll[j]
	end
	return nothing
end


function sparseRank!(l)
	r=0
	for j in length(l):-1:1
		sparseFindShort!(j,l)
		if nnz(l[j])!=0
			ind=l[j].nzind[1]
			l[j]=(l[j]*invmod(l[j].nzval[1],5))
			Threads.@threads for s in j-1:-1:1
				if l[s][ind] !=0
					l[s]=(l[s]-l[s][ind]*l[j]).%5
				end
			end
			r+=1
		end
		deleteat!(l,j)
	end
	return r
end
n=35000
l=Vector{SparseVector{UInt8,UInt32}}(undef,n)
for i=1:n
	u=sprand(UInt8,2*n,0.0001).%5
	u[i]=1
	l[i]=u
end
sparseRank!(l)
2 Likes

Anyhow, so now everything’s working fine but I’m still curious: should that be considered a bug that Julia doesn’t throw an error in that case ?

Also, in the code above there’s a “u[i]=1” whose purpose is to make sure the rank of the randomly generated matrix is high. Strangely enough, if I don’t include this line I get a stackoverflow error almost right away.

Please try to isolate the issue to a minimal working example that is easier to understand. It is harder to say if something is a bug when one has to reason about the correctness of some algorithm.

Plug: depending on what your next steps are (other primes? extensions of 𝔽₅?) you may be interested in GitHub - tkluck/GaloisFields.jl: Finite fields for Julia . Just like you describe, it implements modular arithmetic in the smallest integer type that fits.