Improving performance of CUDA GPU kernel: LU factorization

Leo_I · October 8, 2025, 10:27am

I am learning GPU programming via CUDA.jl. I wish to implement an efficient LU factorization for matrices over a finite field (I implemented it as UInt8/UInt16/... with operations + mod and * mod).

For the sake of simplicity, however, in this post I will work over floats and focus solely on the rank-1 update (the real bottleneck), so no pivoting etc. The simplified CPU code is:

function _rank1_update!(X::Matrix{tw}, k, m, n)  where {tw}
   @inbounds begin 
      for j = k+1 : n 
         @simd for i = k+1 : m  
            X[i,j] -= X[i,k] * X[k,j] end end end end;

function main_fn(X)
   m,n = size(X)
   for k = 1 : min(m,n)
      _rank1_update!(X, k, m, n) end end;

With ChatGPT’s help, I transform this into CUDA code in 3 different ways (parallelizing on all entries / on columns / on rows), and these kernels give the correct result:

using LinearAlgebra, CUDA

@inline _idx1D() = (blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x;
@inline _idx2D() = (blockIdx().y-Int32(1)) * blockDim().y + threadIdx().y, (blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x;

function _kernel1!(X::CuDeviceMatrix{tw}, k, m, n)  where {tw}
   @inbounds begin   i,j = _idx2D();   i += k;   j += k;   
      !(k < i ≤ m && k < j ≤ n) && return;
      X[i,j] -= X[i,k] * X[k,j];   
return end end;

function _kernel2!(X::CuDeviceMatrix{tw}, k, m, n)  where {tw}
   @inbounds begin   j = k + _idx1D();   !(k < j ≤ n) && return; 
      xkj = X[k,j];   i = k+1;
      while i ≤ m   X[i,j] -= X[i,k] * xkj;   i += 1; end;   
return end end;

function _kernel3!(X::CuDeviceMatrix{tw}, k, m, n)  where {tw}
   @inbounds begin   i = k + _idx1D();   !(k < i ≤ m) && return; 
      xik = X[i,k];   j = k+1;
      while j ≤ n   X[i,j] -= xik * X[k,j];   j += 1; end;   
return end end;

function _rank1_update1!(X::CuMatrix{tw}, k, m, n)  where {tw}
   @inbounds begin k==min(m,n) && return nothing;  
      ker = @cuda launch=false always_inline=true _kernel1!(X,k,m,n);
      Φ = CUDA.launch_configuration(ker.fun);  
      t = Φ.threads;  h=m-k; w=n-k; r=w/h;  
      tx = clamp(round(Int,sqrt(t*r)), 1, w);  bx = cld(w,tx);
      ty = clamp(round(Int,sqrt(t/r)), 1, h);  by = cld(h,ty);
      CUDA.@sync ker(X,k,m,n; threads=(tx,ty), blocks=(bx,by));   
   end end;  # use the compiler/Φ to determine the optimal number of threads for this input

function _rank1_update2!(X::CuMatrix{tw}, k, m, n)  where {tw}
   @inbounds begin k==min(m,n) && return nothing;  
      ker = @cuda launch=false always_inline=true _kernel2!(X,k,m,n); 
      Φ = CUDA.launch_configuration(ker.fun);  
      t = clamp(Φ.threads, Int32(1), n-k);  b = cld(n-k,t);   
      CUDA.@sync ker(X,k,m,n; threads=t, blocks=b) end end;

function _rank1_update3!(X::CuMatrix{tw}, k, m, n)  where {tw}
   @inbounds begin k==min(m,n) && return;  
      ker = @cuda launch=false always_inline=true _kernel3!(X,k,m,n);
      Φ = CUDA.launch_configuration(ker.fun);  
      t = clamp(Φ.threads, Int32(1), m-k);  b = cld(m-k,t);   
      CUDA.@sync ker(X,k,m,n; threads=t, blocks=b) end end;

function main_fn!(X::CuMatrix{tw}, mtd::Int=3)  where {tw}
   @inbounds begin m,n = size(X);  
   for k = 1 : min(m,n)  
      if     mtd==1 _rank1_update1!(X, Int32(k), Int32(m), Int32(n)) 
      elseif mtd==2 _rank1_update2!(X, Int32(k), Int32(m), Int32(n)) 
      elseif mtd==3 _rank1_update3!(X, Int32(k), Int32(m), Int32(n)) end
end end end;

When benchmarking (on Nvidia Geforce RTX 2080 Ti), we see that mtd=3 is the most performant of the 3, but still pales in comparison with the built-in lu!:

julia> m=n=30_000;  X = cu(rand(-1:Float32(1e-6):1,m,n));  CUDA.@time lu!(X);
  2.179116 seconds (73 CPU allocations: 3.000 KiB, 7.00% gc time) (1 GPU allocation: 117.188 KiB, 0.00% memmgmt time)

julia> X = cu(rand(-1:Float32(1e-6):1,m,n));  CUDA.@time main_fn!(X,1);
527.408564 seconds (1.47 M CPU allocations: 60.880 MiB)

julia> X = cu(rand(-1:Float32(1e-6):1,m,n));  CUDA.@time main_fn!(X,2);
1386.822122 seconds (389.96 k CPU allocations: 7.783 MiB, 1.05% gc time)

julia> X = cu(rand(-1:Float32(1e-6):1,m,n));  CUDA.@time main_fn!(X,3);
294.605763 seconds (389.13 k CPU allocations: 7.769 MiB, 0.05% gc time), 0.00% GPU memmgmt time

I used some of the official performance tips: passing Int32 instead of Int64 indices, maxregs=32, always_inline=true, replaced for j=k+1:n with a while loop, but none of it helped.

Most of all, I would appreciate any suggestions on how to improve mtd=3 performance.
Why are there so many CPU allocations? How can I cut down on them?
How come mtd=3 is more efficient than mtd=2? I thought CuMatrix’s are column-major.

Also, note that if in main_fn! I replace _rank1_update3! with just functions for finding the pivot, swapping the column and row, and multiplying the column of the pivot (which are all GPU functions), the execution is very fast (4sec on 100K x 100K matrix), so the problem should be in how _rank1_update3! is implemented, not in calling the kernel many times.

Leo_I · October 15, 2025, 7:46am

Anyone?

mstewart · October 15, 2025, 3:38pm

I am not experienced with writing GPU kernels, but no one else has jumped in and I at least know a little about LU on CPUs. I think it is very likely that you are not implementing a competitive algorithm. LAPACK LU does panel factorization to allow highly optimized rank k updates using xGEMM. Simple rank 1 updates are not even close to being competitive on a CPU and, with what little knowledge I have of GPU code, I think that the same is true on GPUs. The LAPACK working notes have a number of notes that seem relevant to doing LU factorization on GPUs. From what I can see, it looks like they do, in fact, consider factorization algorithms with a panel factorization and a higher rank update of a trailing matrix. If being competitive with the best code is your goal, this seems like a fairly ambitious project, although perhaps you could go part way and just implement the panel factorization yourself and use whatever optimized code is available for CUDA gemm. If you want to do more, you could then move on to your own gemm.

eldee · October 15, 2025, 4:21pm

This is presumably due to memory coalescing. Put simply, the memory operations from threads within a warp (a group of 32 adjacent threads, operating (mostly) in lockstep) will be combined for higher memory throughput. As a CuMatrix is indeed column-major, you then need to put the index corresponding to threadIdx().x into the first axis. This is the case in _kernel3! (where i corresponds to the thread index), but not in _kernel2!, where the threadIdx-variable j is used for the second axis.

Leo_I · October 16, 2025, 11:11pm

Thank you both for your replies!
@mstewart You’re saying the above algorithm itself is inappropriate for achieving high performance. Thank you for the link you posted, though it seems overwhelming (300 articles). So I have a few follow-up questions, for anyone kind enough to be willing to help.

Disclaimer: I do not know C/C++, but I do know Julia, Python, Mathematica. I have a PhD in math, though not in linear algebra (my topic was algebraic topology). I do not have any knowledge of compiler technology, like gcc, LLVM, … I have some experience using LLMs for coding.

CPU questions:

When running LinearAlgebra.lu!(X::Matrix{<:AbstractFloat}), this calls a C function from LAPACK package, correct? Is it from this repo? GitHub - Reference-LAPACK/lapack: LAPACK development repository
My LLM tells me that the implementation is in the following links, is it correct?
- lapack/LAPACKE/src/lapacke_sgetrf.c at master · Reference-LAPACK/lapack · GitHub
- lapack/LAPACKE/src/lapacke_sgetrf_work.c at master · Reference-LAPACK/lapack · GitHub
Let’s say someone who knows his way around LAPACK minimally edited the code so that lu! accepted also matrices over a finite field, e.g. ModInt8{11} = \mathbb{Z}/11, implemented simply as UInt8 with operations mod(x+y,11), mod(x*y,11), invmod(x,11). If nothing else was changed from the code, what performance would you expect, compared to over Float32? I would think it should be even faster (less RAM used, simpler arithmetic, except perhaps invmod, and pivoting is simpler, i.e. finding the first nonzero entry), no?
How doable would it be to extract with an LLM a minimal implementation of lu! from LAPACK, convert it to Julia, then minimally edit it to accept ModInt8{11}? How much code does the lu! part of LAPACK require? 1K lines? 2K? 5K? 10K? Does its C/Fortran code contain any specific low-level parts for which there is no Julia counterpart?
I have come to know that the C++ compiler gcc has a flag -floop-nest-optimize, which is supposed to speed up the code a lot with SIMD, vectorization, cache blocking, etc. Does Julia’s LLVM compiler have any similar option I could somehow use? If I used an LLM to write in C the basic lu! function via rank-1 update and used that gcc flag, what would the performance be? Is that flag a magic solution that gives extreme performance boosts?
There are also a bunch of other flags for optimization: Optimize Options (Using the GNU Compiler Collection (GCC)) Which would help most on the rank-1 implementation of lu!? How does Julia’s LLVM compiler compare to gcc? Are there any analogous optimization options in Julia?

GPU questions:

When running LinearAlgebra.lu!(X::CuMatrix{<:AbstractFloat}), this calls a function from cuBLAS, which is not open-source, and uses Nvidia’s CUDA compiler, which is not free, correct?
Is the situation with AMDGPU.jl similar? No code available to implement GPU lu!?
Is there any open-source code for an implementation of lu! on GPU? I’m especially interested in finite field coefficients but any implementation is welcome.

Note: I realize that Nemo.jl (wrapper around GitHub - flintlib/flint: FLINT (Fast Library for Number Theory)) supports CPU lu! on matrices over finite fields. It is fast, but incomparably slower than LAPACK’s. What bothers me also, is that it is not multithreaded, it causes millions of allocations and requires waaay too much memory. I suspect it uses Int64/UInt64 behind the scenes, regardless whether the modulus is small and UInt8 would suffice. So I’m looking for a faster, leaner solution.

yolhan_mannes · October 16, 2025, 11:23pm

lu is in CuSolver not CuBLAS, and the one GPT gives looks like

using CUDA

const TILE = 16

function tile_lu_kernel!(A, n)
    tx = threadIdx().x
    ty = threadIdx().y
    bx = (blockIdx().x - 1) * TILE
    by = (blockIdx().y - 1) * TILE

    # ✅ Correct shared memory allocation
    tile = CuStaticSharedArray(Float32, (TILE, TILE))

    # Load the tile into shared memory
    if bx + tx <= n && by + ty <= n
        @inbounds tile[ty, tx] = A[by + ty, bx + tx]
    end
    sync_threads()

    # ✅ LU factorization within the tile (no pivoting)
    for k in 1:TILE
        pivot = tile[k, k]
        sync_threads()
        if ty > k
            @inbounds tile[ty, k] /= pivot
        end
        sync_threads()
        if ty > k && tx > k
            @inbounds tile[ty, tx] -= tile[ty, k] * tile[k, tx]
        end
        sync_threads()
    end

    # Write back
    if bx + tx <= n && by + ty <= n
        @inbounds A[by + ty, bx + tx] = tile[ty, tx]
    end
    return
end

function gpu_tile_lu!(A::CuArray{Float32,2})
    n = size(A, 1)
    threads = (TILE, TILE)
    blocks = (cld(n, TILE), cld(n, TILE))
    @cuda threads=threads blocks=blocks tile_lu_kernel!(A, n)
    return A
end

I didn’t test it though and I had to send an exemple from Cuda repo to make it learn the syntax

Leo_I · October 17, 2025, 8:53am

@yolhan_mannes Hmm, I do not understand your code. Where did you get it? ChatGPT? Running it on a copy of a 10x10 matrix X, then extracting L, U and comparing X to L*U gives incorrect results. Is there a way to make it correct and include pivoting, to make sure pivot is nonzero?

yolhan_mannes · October 17, 2025, 9:39am

Ok so its official LLM are not able to write a cuda kernel for LU factorisation !

Leo_I · October 17, 2025, 9:46am

Writing GPU lu! directly is too hard for today’s LLMs, but if I already had a minimal self-contained implementation in C/C++, I might be able to translate it to Julia.

yolhan_mannes · October 17, 2025, 9:51am

Ok I got a minimal one but bad for really big matricies

using CUDA
using LinearAlgebra

# ========= Tunables =========
const TRED = 256  # threads in the single block (compile-time constant for shared buffers)

# ========= Single-block PLU kernel (partial pivoting) =========
# In-place LU with partial pivoting:
#   - A becomes packed L (unit diag implied) and U.
#   - piv[k] stores the pivot row (1-based, global index) chosen at step k.
# Relationship: P*A0 = L*U
# Works for Float32/Float64.
function lu_plu_singleblock_kernel!(A, n::Int32, piv)
    tid = threadIdx().x
    T   = blockDim().x

    # small shared arrays for reduction (constant-sized)
    sh_val = CuStaticSharedArray(eltype(A), TRED)
    sh_idx = CuStaticSharedArray(Int32,     TRED)

    @inbounds for k in 1:Int(n)  # pivot step (1-based)
        kk = Int32(k)

        # --- 1) Pivot search over rows i = k..n (parallel reduction) ---
        local_max = zero(eltype(A))
        local_idx = kk
        i = kk + (tid - 1)
        while i <= n
            v = abs(A[i, kk])
            if v > local_max
                local_max = v
                local_idx = i
            end
            i += T
        end
        sh_val[tid] = local_max
        sh_idx[tid] = local_idx
        sync_threads()

        step = T >>> 1
        while step >= 1
            if tid <= step
                if sh_val[tid + step] > sh_val[tid]
                    sh_val[tid] = sh_val[tid + step]
                    sh_idx[tid] = sh_idx[tid + step]
                end
            end
            sync_threads()
            step >>>= 1
        end

        p = sh_idx[1]           # winning pivot row
        if tid == 1
            piv[kk] = p
        end
        sync_threads()

        # if pivot is (near-)zero, skip the step (singular column)
        pivval = A[p, kk]
        if abs(pivval) == zero(eltype(A))
            sync_threads()
            continue
        end

        # --- 2) Swap rows k <-> p across ALL columns (parallel over j) ---
        if p != kk
            j = (tid - 1) + 1
            while j <= n
                tmp      = A[kk, j]
                A[kk, j] = A[p,  j]
                A[p,  j] = tmp
                j += T
            end
        end
        sync_threads()

        # --- 3) Scale below pivot: A[i,k] /= A[k,k] for i>k (parallel over i) ---
        denom = A[kk, kk]
        i2 = kk + (tid - 1)
        while i2 <= n
            if i2 > kk
                A[i2, kk] /= denom
            end
            i2 += T
        end
        sync_threads()

        # --- 4) Rank-1 update on trailing submatrix: A[i,j] -= A[i,k] * A[k,j] ---
        # Parallelize over columns j; each thread does its own column’s i-loop
        j2 = kk + 1 + (tid - 1)
        while j2 <= n
            Akj = A[kk, j2]        # read once per thread/column
            i3 = kk + 1
            while i3 <= n
                A[i3, j2] -= A[i3, kk] * Akj
                i3 += 1
            end
            j2 += T
        end
        sync_threads()
    end
    return
end

# ========= Host wrapper =========
function lu_plu!(A_d::CuArray{T,2}) where {T<:Union{Float32,Float64}}
    n = Int32(size(A_d,1))
    @assert size(A_d,2) == n "A must be square"
    piv_d = CUDA.zeros(Int32, n)
    # single-block kernel; the kernel grid-strides so any n is fine
    @cuda threads=TRED blocks=1 lu_plu_singleblock_kernel!(A_d, n, piv_d)
    return A_d, Array(piv_d)
end

# ========= Utilities =========
function permutation_from_pivots(piv::Vector{Int32})
    n = length(piv)
    P = Matrix{Float32}(I, n, n)
    for k in 1:n
        pk = Int(piv[k])
        if pk != k
            P[[k, pk], :] .= P[[pk, k], :]
        end
    end
    P
end

function extract_LU(Ah::Matrix)
    n = size(Ah,1)
    L = Matrix{eltype(Ah)}(I, n, n)
    U = zeros(eltype(Ah), n, n)
    @inbounds for i in 1:n, j in 1:n
        if i > j
            L[i,j] = Ah[i,j]
        else
            U[i,j] = Ah[i,j]
        end
    end
    L, U
end

# ========= Tests =========
function test_once(N::Int; T=Float32)
    A0_h = rand(T, N, N)
    A0_d = cu(copy(A0_h))
    A_lu_d, piv = lu_plu!(A0_d)
    Ah = Array(A_lu_d)
    L, U = extract_LU(Ah)
    P = permutation_from_pivots(piv)
    ok = isapprox(P * Float32.(A0_h), Float32.(L * U); rtol=5e-4, atol=5e-4)
    println("N=$N  ->  P*A ≈ L*U ? ", ok)
    return ok
end

# Try a few sizes
test_once(16)
test_once(64)
test_once(128)
test_once(2000)

It works because its only a single block doing all the jobs and synchronizing again and again so its not how gpu is done normally but well we take it I guess

mstewart · October 17, 2025, 4:34pm

Yeah. That archive is not a great match for what you want. I did spend some time looking for something that focused in a clearer way on outlining a GPU LU algorithm, but didn’t find anything very helpful. Lawn272 seems like it might be marginally helpful in that it discussed blocking.

It ultimately calls a Fortran subroutine. There is a C interface that you linked to in one of your other points. You want to look for sgetrf.f. Personally I browse through LAPACK routines here.

I don’t see any fundamental barrier to using the algorithm for another field, but you will need to implement your own matrix-matrix multiplication. Making that efficient is a good bit of work. For CPU, you probably can use FLINT. Regarding your questions about compiler flags and optimizations, this is really an algorithmic issue. It seems outside the scope of what a compiler optimization could be expected to do.

In terms of translating code, I haven’t used LLMs very much. But it’s not horribly difficult to do in Julia and I have done basic implementations of this in other languages, so I’ll give it a shot. Consider the following

using LinearAlgebra
using BenchmarkTools

# LU using level 2 BLAS.  (i.e. rank 1 updates)
function lu2!(A)
    m, n = size(A)
    p = collect(1:m)
    for k in 1:n
        # Partial pivoting
        _, jpiv = findmax(abs, A[k:m, k])
        jpiv += k - 1
        p[k], p[jpiv] = p[jpiv], p[k]
        for l in 1:n
            A[k,l], A[jpiv, l] = A[jpiv, l], A[k, l]
        end
        # Store multipliers in place for L
        for j in k+1:m
            A[j,k] /= A[k,k]
        end
        # Rank 1 update
        for l in k+1:n
            for j in k+1:m
                A[j,l] -= A[j,k] * A[k,l]
            end
        end
    end
    return p
end

# LU with a panel factorization and rank bs updates, where the block size
# bs is the number of columns in the panel.
function lu3!(A; bs = 64)
    m = LinearAlgebra.checksquare(A)
    b, rem = divrem(m, bs)
    p = collect(1:m)
    @views for l in 1:b
        j0 = (l-1) * bs + 1
        j1 = l * bs
        p2 = lu2!(A[j0:m, j0:j1]) # panel factorization
        Base.permuterows!(A[j0:m, j1+1:m], p2) # apply p2 outside of the panel
        Base.permuterows!(A[j0:m, 1:j0-1], p2)
        permute!(p[j0:m], p2) # update the permutation
        ldiv!(UnitLowerTriangular(A[j0:j1, j0:j1]), A[j0:j1, j1+1:m]) # rows of U
        mul!(A[j1+1:m, j1+1:m], A[j1+1:m, j0:j1], A[j0:j1, j1+1:m], -one(eltype(A)), 
             one(eltype(A))) # rank bs update
    end
    # factor any remaining columns
    @views if rem > 0
        j0 = b*bs + 1
        j1 = m
        p2 = lu2!(A[j0:m, j0:j1])
        Base.permuterows!(A[j0:m, 1:j0-1], p2)
        permute!(p[j0:m], p2)
    end
    return p, UnitLowerTriangular(A), UpperTriangular(A)
end

lu2! does an in-place factorization on a panel of bs columns and then uses the resulting factorization to update the rest of the matrix using mul!. When I ran

n = 3000
A0 = randn(n,n)
p, L, U = lu3!(copy(A0))
@show opnorm(A0[p,:] - L*U, Inf) # check backward error.
@btime lu!(A) setup = (A = copy(A0))
@btime lu3!(A, bs = 128) setup = (A = copy(A0))
@btime lu2!(A) setup = (A = copy(A0))

I got

opnorm(A0[p, :] - L * U, Inf) = 5.569210674225e-11
  76.107 ms (4 allocations: 23.54 KiB)
  215.681 ms (8893 allocations: 35.22 MiB)
  7.965 s (8752 allocations: 34.66 MiB)

So my code is about 3 times slower than LAPACK, but a naive routine that relies solely on rank 1 updates is much slower than that. There’s probably some room to improve lu2!. The function lu3! is fairly close to what LAPACK does, but LAPACK has a function sgetrf2 (or dgetrf2 for double precision) that is recursive and should be better than what I did with lu2!. If I wanted to optimize this, I’d profile and try to get as much of the time spent in mul! as possible. The best value of bs does vary from machine to machine. I think something like this is probably closer to what you want to try to port to CUDA.

Leo_I · October 18, 2025, 9:14pm

Very interesting!! Would you be so kind and adjust the code so that lu3! accepts non-quare matrices with non-full rank? (I don’t understand the algorithm yet, to be able to do it myself) For that, permuting rows and columns of A will be necessary. You could use

function _loc_max1(X::Matrix{tw}, pivot_weight::F=abs, i0=1, j0=1) where {tw, F<:Function}
    #= From first nonzero row in X, return the location of nonzero entry with largest value. =#
    @inbounds begin m,n=size(X)
    d_ = pivot_weight(X[i0,j0]);
    for i = i0 : m  
        j_ = j0
        for j = j0 : n 
            X[i,j]==0 && continue
            d = pivot_weight(X[i,j])
            if (X[i,j_] == 0 || d_ < d) 
                j_ = j;  d_ = d end end
        X[i,j_] ≠ 0 && return i, j_ end
    return i0, j0 end end;
function _loc_max2(X::Matrix{tw}, pivot_weight::F=abs, i0=1, j0=1) where {tw, F<:Function}
    #= From first nonzero column in X, return the location of nonzero entry with largest value. =#
    @inbounds begin m,n=size(X)
    d_ = pivot_weight(X[i0,j0])
    for j = j0 : n  
        i_ = i0
        for i = i0 : m 
            X[i,j] == 0 && continue
            d = pivot_weight(X[i,j])
            if (X[i_,j] == 0 || d_ < d) 
                i_ = i;  d_ = d end end
        X[i_,j] ≠ 0 && return i_,j end
    return i0, j0 end end;

mstewart · October 19, 2025, 10:26am

I’ll see what I can do, although maybe not quickly. In the meantime, if you aren’t sure how the original code is working, making it more complicated doesn’t seem like the best next step. Given an m\times m matrix A we partition it as

A= \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix},

where A_{11} is b\times b (bs\timesbs in the code). We pass the leading m\times b rectangular panel to lu2! to compute a factorization

P \begin{bmatrix} A_{11} \\ A_{21} \end{bmatrix} = \begin{bmatrix} L_{11} \\ L_{21} \end{bmatrix} U_{11}.

This done in place, so that by the end of the panel
factorization in lu2! we have

\begin{bmatrix} A_{11}^{(1)} \\ A_{21}^{(1)} \end{bmatrix} = \begin{bmatrix} L_{11} \backslash U_{11} \\ L_{21} \end{bmatrix}

where the backslash indicates that the strictly lower triangular part of that block contains the multipliers from L_{11} and the upper triangular part contains U_{11}. I used superscripts to indicate that those blocks of A are modified and not what was there originally. lu2! then returns the permutation that was applied to the the rows of A, but in fact it was just done to the panel, so we need to apply the permutation to the rest of the matrix (using Base.permuterows!) and apply it to our overall permutation for the full factorization (using permute!). At this point all columns have had P applied and without loss of generality, lets assume that this permutation was pre-applied to A. (It makes the notation simpler).

At this point we finish computing some rows of U. A full LU factorization of A would be

\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix}

so we have L_{11} U_{12} = A_{12} which can be solved for U_{12}. You do that and store that where A_{12} was. (this is what ldiv! is doing in the code.) Finally, we don’t actually have a full factorization yet. What we have is a partial factorization

\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & I \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & A_S \end{bmatrix}

where A_S is a Schur complement and can be computed from

A_S = A_{22} - L_{21} U_{12}

which is what mul! does, remembering that L_{21} is stored where A_{21} was and U_{12} where A_{12} was. You then repeat the process by partitioning A_S and doing a panel factorization in A_S, etc. The next panel gives you some more columns of L and rows of U in the same manner. At the end, if m was not divisible by b, there will be a smaller panel at the end with nothing to its right. In that case, all you need to do is do the panel factorization and update permutations.

Each panel factorization requires O(b^2m) operations and there are m/b panels, so the total cost of panel factorization is O(bm^2) which is much less than O(m^3) if the panel is not large. The overal cost of computing U_{12} is similar. If the panel is not too small, all the real work is done in mul!, which is highly optimized to use threads and make efficient use of cache (through blocking.) Rank-1 updates are quite bad for this sort of thing and most of the time will be spent on memory access instead of computation.

You need to choose b small enough to avoid spending too much time on the panel factorizations and large enough that you benefit from mul!, which isn’t going show any benefits if you are just doing rank 1 or rank 2 updates. If lu2! is very inefficient, it can limit the size of the best choice of b, which is not good for mul!. LAPACK uses a recursive factorization instead of my rank-1 update based lu2!. The recursive factorization should be a decent factorization algorithm in its own right, and that might be part of the performance difference with my code.

mstewart · October 19, 2025, 3:08pm

One other thought: Your pivoting scheme is not rank revealing in floating point arithmetic. That means that with that pivoting scheme, you can’t expect to have the rank revealed by a small Schur complement of size (m-r)\times (n-r) where r is the rank. It would be fine with a finite field. For floating point values, if you are looking at rank, you need complete pivoting or Rook pivoting where you consider the possibility of using pivots initially outside of the panel. That is, you form columns of the panel dynamically as you choose pivots. This is not workable for complete pivoting, because you need to potentially scan the entire Schur complement to find the largest magnitude element, which is an O((m-k)\times (n-k)) operation where k is the number of columns you have factored so far. That’s more than computing the panel factorization which is O(b^2 m) and is going to kill performance.

So you would need to do Rook pivoting in floating point. (So named because it scans rows and columns sequentially to find a pivot that is the largest element in both its row and column.)

Edit: The more I think about this, the messier it sounds, even for Rook pivoting. To choose pivots in the panel factorization, you are going to have to have applied previous rank one modifications to specific columns and rows in the Schur complement of the full matrix that you might never include in your panel. Maybe you could copy the columns and rows you are scanning for pivots and do provisional updates to them to determine pivot magnitudes and then only finalize the updates to a column in which you found a suitable pivot and are therefore going to include in the panel. But it all seems quite painful to implement.

Leo_I · October 19, 2025, 8:15pm

@mstewart Thank you for the explanations, those are invaluable. Yeah, I know that rank over floats is problematic, but my main focus is finite fields. Linear algebra over floats is very well done, chances of me improving something are next to none. But over finite fields, there is plenty of room for improvement.

Please give me a few days, in the evenings I will try to digest your explanations and edit your lu2!, lu3! to work for rectangular non-full rank matrices. Will report how it goes.

mstewart · October 20, 2025, 11:52am

It might not be much help, since it sounds like you don’t know Fortran, but there is a package for Matlab LURP that does Rook pivoting in the way I was trying to describe above. (i.e. provisionally constructing rows and columns in the full Schur complement to look for pivots for the panel factorization.) The Matlab part is just an interface. All the computational code is in Fortran.

Leo_I · October 27, 2025, 10:25pm

Hey, I refactored the code a bit, to minimize allocations and avoid checking bounds, which gave a 25% speedup on large matrices. I also renamed variables to make it easier for me to understand the code. Apologies for that. Now, it works for m\times n matrices, but I haven’t yet figured out how to remove the restriction that it has full rank.

Here is the explanation of the algorithm, that to me is easier to digest. Let L(X) / U(X) be the unit-lower / upper triangular part of X. Partition the matrix into

X = [A\; B;C\; D] = \begin{bmatrix} A & B \\ C & D \end{bmatrix}

where A has size t×t. Run lu_rank_1! on [A;C] to get [a;c], so that [L(a);c]*[U(a)] = [A;C]. Modify B = L(a)⁻¹*B to obtain b, so that L(X) * U(X) = [L(a)\: 0; c\: L(D)] * [U(a)\: b; 0\: U(D)] = [A\; L(a)*L(a)⁻¹*B;\: c*U(a)\; c*b+L(D)*U(D)] = [A\; B;\: C\; c*b+L(D)*U(D)]. Modify D = D-c*b to obtain d. Now we have a partial factorization [La\; 0; c\; I] * [U(a)\; b; 0\; d] = X, where only d fails to be upper-triangular. Hence, we repeat the procedure on d and continue until the remaining chunk d is smaller than t×t. Notice that when t=1, the algorithm is the same as in lu_rank_1!.

using LinearAlgebra
function _loc_pivot(X::AbstractMatrix{tw}, i0=1, j0=1, i1=size(X,1), j1=size(X,2)) where {tw}
   #= Search column-wise for the first nonzero entry in X and return its location. 
   Intended for finite fields. =#
   @inbounds for j=j0:j1, i=i0:i1  X[i,j]≠0 && return i,j end;
   return i0,j0 end;
function _loc_pivot(X::AbstractMatrix{tw}, i0=1, j0=1, i1=size(X,1), j1=size(X,2)) where {tw<:AbstractFloat}
   #= From first nonzero column in X, return the location of nonzero entry with 
   largest  absolute value. Intended for floats. =#
   @inbounds begin d_ = abs(X[i0,j0])
   for j=j0:j1  i_=i0
      for i=i0:i1
         X[i,j] == 0 && continue;  d = abs(X[i,j])
         if (X[i_,j]==0 || d_<d)  i_ = i;  d_ = d end end
      X[i_,j] ≠ 0 && return i_,j end
   return i0, j0 end end;
function _swap_col!(X::AbstractMatrix{tw}, j1, j2) where {tw}
   @inbounds begin m=size(X,1);   (j1==j2 || m<1) && return;     
   @simd for i=1:m  X[i,j1], X[i,j2] = X[i,j2], X[i,j1] end;   
   return end end;
function _swap_row!(X::AbstractMatrix{tw}, i1, i2) where {tw}
   @inbounds begin n=size(X,2);   (i1==i2 || n<1) && return;
   @simd for j=1:n  X[i1,j], X[i2,j] = X[i2,j], X[i1,j] end;   
   return end end;
function _getLU(X::AbstractMatrix{tw}, r=min(size(X)...)) where {tw}
   @inbounds begin m,n=size(X);  @assert 0≤r≤min(m,n);  
   L = zeros(tw,m,r);  U = zeros(tw,r,n);  
   for j=1:r, i=j+1:m  L[i,j] = X[i,j] end
   for i=1:r, j=i+1:n  U[i,j] = X[i,j] end
   for k=1:r  U[k,k] = X[k,k];  L[k,k] = 1; end
   return L, U end end;

function lu_rank_1!(X, σ, τ, k0=1, m=size(X,1), n=size(X,2))
   @inbounds begin r=0;  # r=rank
   @assert 1≤k0;  @assert m≤size(X,1);  @assert n≤size(X,2);  
   for k=k0:min(m,n)
      i,j = _loc_pivot(X,k,k,m,n);
      X[i,j]==0 && break;  r+=1;
      if i≠k  _swap_row!(X,k,i);  σ[k],σ[i] = σ[i],σ[k] end  # edits whole X
      if j≠k  _swap_col!(X,k,j);  τ[k],τ[j] = τ[j],τ[k] end  # edits whole X
      xkk = inv(X[k,k]);   
      @simd for i=k+1:m  X[i,k] *= xkk end  # store multipliers in-place for L
      for j=k+1:n  xkj=X[k,j]; 
         @simd for i=k+1:m  X[i,j] -= X[i,k] * xkj end end end  # rank-1 update
   return r end end;

function lu_rank_t!(X::AbstractMatrix{tw}, σ, τ, t=128) where {tw}
   @inbounds begin m,n=size(X);  q=min(m,n)÷t;  r=0;
   @views for k=1:q
      k0, k1 = (k-1)*t+1, k*t
      r += lu_rank_1!(X, σ, τ, k0, m, k1)  # lu_rank_1!([A;C])
      ldiv!(UnitLowerTriangular(X[k0:k1,k0:k1]), X[k0:k1,k1+1:n])  # B = L(a)⁻¹*B
      mul!(X[k1+1:m,k1+1:n], X[k1+1:m,k0:k1], X[k0:k1,k1+1:n], -tw(1), tw(1)) end  # D -= c*b
   r += lu_rank_1!(X, σ, τ, q*t+1, m, n)  # factor remaining columns q*t+1:n
   return r end end;

Over floats, it is super fast due to mul! using BLAS:

julia> m=n=30_000;  X=rand(-1:1e-6:1,m,n); 

julia> Y=copy(X);  @time F=LinearAlgebra.lu!(Y, RowMaximum(); check=false, allowsingular=true);  opnorm(X[F.p,:]-F.L*F.U,Inf)
 70.190799 seconds (4 allocations: 234.469 KiB)
3.016106299108079e-10

julia> Y=copy(X);  σ,τ = (Int.(1:d) for d∈(m,n));  @time lu_rank_t!(Y,σ,τ,128);  L,U=_getLU(Y);  opnorm(X[σ,τ]-L*U,Inf)
104.885778 seconds (1 allocation: 16 bytes)
2.197916379930081e-9

Over finite fields, it is awfully slow:

julia> begin m=n=30_000;  X=rand(type_mod(13),m,n); 
       Y=copy(X); σ,τ=(Int.(1:d) for d∈(m,n));  @time lu_rank_1!(Y,σ,τ);  
       Y=copy(X); σ,τ=(Int.(1:d) for d∈(m,n));  @time lu_rank_t!(Y,σ,τ,1);
       Y=copy(X); σ,τ=(Int.(1:d) for d∈(m,n));  @time lu_rank_t!(Y,σ,τ,128) end;
19004.858220 seconds (9 allocations: 400 bytes)
24846.706835 seconds (9 allocations: 400 bytes)
16364.657696 seconds (9 allocations: 400 bytes)

Could you please share any words of wisdom on how I could implement a good mul! for finite fields? It doesn’t have to be perfect, like in BLAS, but a good approximation. Is this doable in 1K or 2K lines of code? Would I have to know IR and LLVM intrinsics?

Also, how could I modify the above function to gracefully handle non-full rank? (currently it is incorrect for those cases) If lu_rank_1!([A;C]) returns rank r<t, do I just swap columns r+1:t with the columns at the end of the matrix and do lu_rank_1!([A;C]) again on the new [A;C]? It could happen that this might be repeated many times.

mstewart · October 28, 2025, 10:37am

I do find some of your notation confusing. I’m not sure I know what the difference is between all the lower and upper case letters. In [L(a);c]*[U(a)] = [A;C] it seems clear that L(a) is part of an LU panel factorization of [A;C], but you aren’t using similar notation for c, which should also be part of a lower triangular factor. In any event, from the code, it does seem like you have the idea. I would have thought it would be a bit faster than you are seeing, but I am on a 13 year old computer at the moment and probably shouldn’t benchmark anything. You might want to compare it fo lu!.

I don’t work with finite fields, but I think the principles are the same. Lower level optimizations can help a bit, but first and foremost you need to structure your algorithm to make good utilization of cache. Typically this involves further blocking. The simplest thing is probably a cache oblivious recursive algorithm. In that regard, I think these posts might be helpful

GEMM with some links to implementations.
Cache oblivious matrix multiplication

The second link with stevengj’s implementation of a cache oblivious algorithm is probably the lowest barrier to entry in trying to get a faster matrix multiply, but it doesn’t seem like a great starting place for CUDA. The older version of the book “Programming Massively Parallel Processors: A Hands-on Approach 4th Edition” by Wen-mei W. Hwu, David B. Kirk, and Izzat El Hajj works through a blocked matrix multiplication as an example and explains the importance of following that approach. That was in the 1st edition, I don’t know about the 4th. It’s a nice example, so I would guess they probably kept it. That would be closer to what you ultimately want, although I think it would be informative to play around with the recursive algorithm.

I do not think this is something that is going to end up being graceful or all that easy to describe. The short answer is that, you do need to swap in columns from the end of the matrix. A longer version is: In a finite field, all you need is a nonzero pivot, which simplifies things a bit. But the complication is that if you have done k steps (where k is less than the number of columns in the panel) of the panel factorization on [A;C] and find that you have no nonzero pivots, then you need to look elsewhere in the matrix. But the columns corresponding to [B;D] have not had any elimination steps applied to them. You can’t just look at the unmodified columns to find a nonzero pivot. You need to do the elimination steps on any columns in which you hope to find a pivot as if they were part of the [A;C] panel all along. But you can’t do them to the whole matrix [B;D] at that stage because you are trying to defer that computation until after you have factored the panel so you can use level 3 BLAS (aka gemm or mul!). So one option is to start doing updates to individual columns to try to find a nonzero pivot.

The rook pivoting code I linked to does this by updating individual columns and rows until a suitable pivot is found. That’s a reasonable strategy in floating point, because you are likely to find an acceptable pivot (one that is the largest magnitude element in its column and row), even if the matrix is numerically singular and every possible pivot is very small. If you were very unlucky and kept finding exactly zero pivots, you could even perturb a matrix element to create a nonzero pivot without compromising stability.

However, in a finite field, it could be that after you are part way through your panel factorization, there are no nonzero pivots anywhere, even in [B;D] (once suitable updates have been made to those columns). Checking them one column at a time will kill your performance. I don’t know what the best strategy would be for a finite field. Perhaps you can just remove the columns that have no nonzero pivots from your panel factorization and bring in a significant number of other columns, update those as if they were part of [A;C] originally, and then look for pivots there. But it will be messy, and not a simple modification of what you have. And I’m just guessing at what might be reasonable to try. People who actually work with finite fields might have their own tricks for this. Maybe you could look and see what is done in FLINT.

Topic		Replies	Views
How to improve efficiency of my linear algebra code (matrix equivalency) Performance linearalgebra , memory-allocation	33	1891	July 29, 2024
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3738	February 11, 2020
Parallel matrix col/row operations give incorrect results General Usage bug , parallel	8	280	July 29, 2024
How to parallelize `lu!` factorization? Performance parallel , linearalgebra	15	311	July 22, 2025
Solving linear system without allocating Numerics	19	595	August 9, 2025

Improving performance of CUDA GPU kernel: LU factorization

Related topics