Improving performance of CUDA GPU kernel: LU factorization

mstewart · October 28, 2025, 10:37am

I do find some of your notation confusing. I’m not sure I know what the difference is between all the lower and upper case letters. In [L(a);c]*[U(a)] = [A;C] it seems clear that L(a) is part of an LU panel factorization of [A;C], but you aren’t using similar notation for c, which should also be part of a lower triangular factor. In any event, from the code, it does seem like you have the idea. I would have thought it would be a bit faster than you are seeing, but I am on a 13 year old computer at the moment and probably shouldn’t benchmark anything. You might want to compare it fo lu!.

I don’t work with finite fields, but I think the principles are the same. Lower level optimizations can help a bit, but first and foremost you need to structure your algorithm to make good utilization of cache. Typically this involves further blocking. The simplest thing is probably a cache oblivious recursive algorithm. In that regard, I think these posts might be helpful

GEMM with some links to implementations.
Cache oblivious matrix multiplication

The second link with stevengj’s implementation of a cache oblivious algorithm is probably the lowest barrier to entry in trying to get a faster matrix multiply, but it doesn’t seem like a great starting place for CUDA. The older version of the book “Programming Massively Parallel Processors: A Hands-on Approach 4th Edition” by Wen-mei W. Hwu, David B. Kirk, and Izzat El Hajj works through a blocked matrix multiplication as an example and explains the importance of following that approach. That was in the 1st edition, I don’t know about the 4th. It’s a nice example, so I would guess they probably kept it. That would be closer to what you ultimately want, although I think it would be informative to play around with the recursive algorithm.

I do not think this is something that is going to end up being graceful or all that easy to describe. The short answer is that, you do need to swap in columns from the end of the matrix. A longer version is: In a finite field, all you need is a nonzero pivot, which simplifies things a bit. But the complication is that if you have done k steps (where k is less than the number of columns in the panel) of the panel factorization on [A;C] and find that you have no nonzero pivots, then you need to look elsewhere in the matrix. But the columns corresponding to [B;D] have not had any elimination steps applied to them. You can’t just look at the unmodified columns to find a nonzero pivot. You need to do the elimination steps on any columns in which you hope to find a pivot as if they were part of the [A;C] panel all along. But you can’t do them to the whole matrix [B;D] at that stage because you are trying to defer that computation until after you have factored the panel so you can use level 3 BLAS (aka gemm or mul!). So one option is to start doing updates to individual columns to try to find a nonzero pivot.

The rook pivoting code I linked to does this by updating individual columns and rows until a suitable pivot is found. That’s a reasonable strategy in floating point, because you are likely to find an acceptable pivot (one that is the largest magnitude element in its column and row), even if the matrix is numerically singular and every possible pivot is very small. If you were very unlucky and kept finding exactly zero pivots, you could even perturb a matrix element to create a nonzero pivot without compromising stability.

However, in a finite field, it could be that after you are part way through your panel factorization, there are no nonzero pivots anywhere, even in [B;D] (once suitable updates have been made to those columns). Checking them one column at a time will kill your performance. I don’t know what the best strategy would be for a finite field. Perhaps you can just remove the columns that have no nonzero pivots from your panel factorization and bring in a significant number of other columns, update those as if they were part of [A;C] originally, and then look for pivots there. But it will be messy, and not a simple modification of what you have. And I’m just guessing at what might be reasonable to try. People who actually work with finite fields might have their own tricks for this. Maybe you could look and see what is done in FLINT.

Topic		Replies	Views
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3730	February 11, 2020
How to improve efficiency of my linear algebra code (matrix equivalency) Performance linearalgebra , memory-allocation	33	1869	July 29, 2024
How to check if LU factorization failed? Numerics question , linearalgebra	29	5140	September 20, 2017
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7836	May 21, 2017
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5901	July 5, 2020

Improving performance of CUDA GPU kernel: LU factorization

Related topics