How to parallelize `lu!` factorization?

Leon_Lampret · July 22, 2025, 4:46am

I have a generic implementation of the LU factorization of a rectangular matrix:

function _swap_col!(X, j1, j2)
    @inbounds begin m=size(X,1);   
        (j1==j2 || m<1) && return nothing;     
        @simd for i=1:m  X[i,j1], X[i,j2] = X[i,j2], X[i,j1] end 
    end end;
function _swap_row!(X, i1, i2)
    @inbounds begin n=size(X,2);   
        (i1==i2 || n<1) && return nothing;     
        @simd for j=1:n  X[i1,j], X[i2,j] = X[i2,j], X[i1,j] end 
    end end;
function _loc_pivot(X, i0, j0)
    # in the first nonzero row of submatrix X[i0:end,j0:end], find the location of entry with largest absolute value
    @inbounds begin m,n=size(X);   d_ = abs(X[i0,j0]);
    for i=i0:m  j_=j0;  
        for j=j0:n 
            X[i,j]==0 && continue; 
            d = abs(X[i,j]);  
            if (X[i,j_]==0 || d_<d) j_=j; d_=d; end end;  
    X[i,j_]≠0 && return i,j_ end;   
    return i0,j0 end end;
function factor_LU!(X, σ, τ)
    m,n = size(X) 
    mn = min(m,n)
    for k=1:mn 
        i,j = _loc_pivot(X,k,k);   
        X[i,j]==0 && break;   # if X[i,j] is zero, then X[k:m,k:n] are zero
        if i≠k  _swap_row!(X,k,i);   σ[k],σ[i] = σ[i],σ[k] end
        if j≠k  _swap_col!(X,k,j);   τ[k],τ[j] = τ[j],τ[k] end
        xkk = X[k,k];   
        @simd for i=k+1:m  X[i,k] /= xkk end
        for j=k+1:n 
            xkj = X[k,j];  
            @simd for i=k+1:m  
                X[i,j] -= X[i,k] * xkj end end end end;
function getLU(X::Matrix{tw})  where tw
   m,n = size(X);  r = min(m,n);  
   L=zeros(tw,m,r);  U=zeros(tw,r,n);  
   for k=1:r                  L[k,k] = tw(1) end
   for i=1:m, j=1:min(i-1,r)  L[i,j] = X[i,j] end
   for i=1:r, j=i:n           U[i,j] = X[i,j] end
   return L, U end;

It works efficiently and correctly:

julia> using LinearAlgebra
julia> X = rand(-1:1e-6:1,1000,1000);  Y = copy(X);  σ,τ = (collect(1:d) for d ∈ size(X));  
julia> @time factor_LU!(Y,σ,τ);   L,U=getLU(Y);  norm(X[σ,τ]-L*U)
  0.324735 seconds (1 allocation: 32 bytes)
3.755177227563265e-12

But for large matrices (m,n>10_000), it is still slow. So I’d like to parallelize it.

If I just put @threads in front of for j=k+1:n, there is no improvement, since the threads are spawned for each k, which incurs some time penalty. So I’d like to spawn threads outside the outer loop and use them repeatedly in the inner loop. My attempt, with ChatGPT:

using Base.Threads
function factor_LU_parallel!(X, σ, τ)
   @inbounds begin m,n=size(X);  mn=min(m,n)

   nt = nthreads()
   jobs = Channel{Tuple{Int,Int,Int}}(nt)
   nactive = Atomic{Int}(0)
   workers = [  # spawn threads that wait to receive work
      @spawn while true  
         job = try take!(jobs) catch; break end;  k,j0,j1=job;
         for j=j0:j1 
            xkj = X[k,j]
            for i=k+1:m  X[i,j] -= X[i,k]*xkj end end;
         atomic_sub!(nactive,1)  # subtract 1 from the number of unfinished jobs
      end for _=1:nt];

   for k=1:mn
      i,j = _loc_pivot(X,k,k);   
      X[i,j]==0 && break;  
      if i≠k  _swap_row!(X,k,i);  σ[k],σ[i] = σ[i],σ[k] end
      if j≠k  _swap_col!(X,k,j);  τ[k],τ[j] = τ[j],τ[k] end
      xkk=X[k,k];   @simd for i=k+1:m  X[i,k] /= xkk end

      bs=max(1,cld(n-k,nt));  # split k+1:n into nt disjoint chunks
      for j=k+1:bs:n  # at each time, there are at most nt jobs open
         atomic_add!(nactive,1); 
         put!(jobs, (k,j,min(j+bs-1,n))) end  
      while nactive[]>0 sleep(1e-9) end  # wait until all jobs are done
   end;  
   close(jobs);  foreach(fetch,workers);  end end;

This is slower, has lock conflicts, and gives an incorrect result:

julia> X = rand(-1:1e-6:1,1000,1000);  Y = copy(X);  σ,τ = (collect(1:d) for d ∈ size(X));  
julia> @time factor_LU_parallel!(Y,σ,τ);   L,U=getLU(Y);  norm(X[σ,τ]-L*U)
  2.622632 seconds (3.83 k allocations: 125.250 KiB, 0.40% gc time, 34037 lock conflicts)
3311.8813323932263

I’d appreciate help in making factor_LU! more performant.

yolhan_mannes · July 22, 2025, 6:59am

I think you can’t this way unfortunately, I would go for block LU and parallel the execution on each block see Example: LU factorization — Task-based parallelism in scientific computing documentation if you know a little bit of c++ or if you prefer a paper look at https://onlinelibrary.wiley.com/doi/10.1155/2019/3720450.

You did kinda of that with your tasks but not exactly though

Leon_Lampret · July 22, 2025, 7:34am

Surely there is a way in Julia to spawn worker threads, reuse them in the inner loop, then wait until they all finish before proceeding to the next k.

yolhan_mannes · July 22, 2025, 7:37am

The issue is that you still need to know X[k , i ] to calculate X[k +1 , i ] right ?

Leon_Lampret · July 22, 2025, 7:52am

Yes. But when the parallelization would start, X[k,i] would only be read from, not modified, so there’s no synchronization necessary.

yolhan_mannes · July 22, 2025, 7:53am

Well I’m pretty sure this is where it locks X[i,j] -= X[i,k]*xkj

Leon_Lampret · July 22, 2025, 8:08am

Umm, I’m not very familiar with locks. Why do they even appear here? ChatGPT says

A lock is a synchronization primitive used to ensure that only one task or thread accesses a shared resource (e.g., a variable or data structure) at a time. In Julia, the most common lock is:

lock = ReentrantLock()
lock(lock)
try      # critical section
finally  unlock(lock) end

But I never used any of this. Is it built into channels? In my case, different threads read the same data X[i,k] and X[k,j] but write to different columns in X.

yolhan_mannes · July 22, 2025, 8:09am

Sorry I mean a thread may access the x[i,k ] you’re changing before you changed it. Not a lock sorry. A good test would be to do that atomicly ( you will get the worst performance) so that you will know if the wrong results come from this. I’m not familiar with Channels though.
I’m just afraid this is just the alg itself that is incremental and thus unparallelizable written this way

Leon_Lampret · July 22, 2025, 8:14am

Oh, that. Yeah, at the end of each iteration of the k-loop, I must wait until all jobs are finished, before proceeding to k+1. If that waiting is done, it shouldn’t happen that X[i,k] is read from before it is altered, right?

Are you suggesting changing X to contain only atomic values and then run my parallel code to see if the result is correct?

yolhan_mannes · July 22, 2025, 8:16am

That won’t help actually I don’t think two threads can write on it so it’s not a race condition it’s just the alg that s not suited in this form. I think you need to do the math to make sure what you need to know before finding next coefficient which will tell you what can be done in parallel if it’s even possible, in my mind the only two way of doing parallel LU is block wize as for GPU and OpenBlas or recursively as in RecursiveFactorization.jl.

I think what you thought was I can parallel the inner loop but I’m spawning each time I want to gather the works there so that I can just run everything after but since one iteration needs the previous one this won’t work and even if you make the full computational graph you won’t be able to go faster I think

Also found a video https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.youtube.com/watch%3Fv%3DE8aBJsC0bY8&ved=2ahUKEwicos-Z8c-OAxUwVKQEHd8VOKEQtwJ6BAg7EAE&usg=AOvVaw1Ls4tn79xq9egQQ8SpKiZt

Btw : I also tried to make chatgpt do exactly that but never worked ( I think it was on gpu for me with KernelAbstractions.jl ) I think this is an example of an alg still beating AI for now

ctkelley · July 22, 2025, 9:37am

You might look at the generic lu
in LinearAlgebra.jl.

If you thread the critical loop in lines 260-264 and put an
@inbounds @simd on the inner loop in lines 261-263 you should have
something to compare to.

I did this to get something to work in half precision. See hlu!.jl which is part of my package MultiPrecisionArrays.jl.

Leon_Lampret · July 22, 2025, 10:55am

Thanks for the links. I’ll wait for someone experienced with channels, locks, waiting, to tell me how I could do wait for the jobs to finish in each iteration and improve the performance.

ctkelley · July 22, 2025, 11:18am

Using something like OhMyThreads makes this easier.

Oscar_Smith · July 22, 2025, 12:19pm

the point is that the algorithm you’re using doesn’t parallelize well. you need to switch lu algorithm before parallelization can be effective.

Leon_Lampret · July 22, 2025, 12:58pm

I’m still interested in the approach I envisioned (spawning threads outside and reusing them), but to which LU algorithm should I switch? @Oscar_Smith I benchmarked BLAS/LAPACK vs RecursiveFactorization vs my implementation on 20Kx20K matrices and the timings were 33s vs 700s vs 2000s. So I’m figuring, if I could multithread like I proposed, with 12 cores I could perhaps outperform RF.

stevengj · July 22, 2025, 8:40pm

Do a search for “parallelizing LU” and you’ll find many links. Be aware that there’s a lot more to making LU fast (e.g. blocking for cache) — the textbook 3-loop algorithms aren’t really a good starting point.

Topic		Replies	Views
UMFpackLU backslash operator involving multiple RHS General Usage question , performance	2	618	May 28, 2018
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3619	February 11, 2020
Increasing the solution speed of sparse linear system General Usage	14	507	May 19, 2024
Parallel matrix col/row operations give incorrect results General Usage bug , parallel	8	216	July 29, 2024
Independent LU factorization of small matrices not faster with threads Performance question	10	711	October 5, 2020

How to parallelize `lu!` factorization?

Related topics