How to parallelize `lu!` factorization?

I have a generic implementation of the LU factorization of a rectangular matrix:

function _swap_col!(X, j1, j2)
    @inbounds begin m=size(X,1);   
        (j1==j2 || m<1) && return nothing;     
        @simd for i=1:m  X[i,j1], X[i,j2] = X[i,j2], X[i,j1] end 
    end end;
function _swap_row!(X, i1, i2)
    @inbounds begin n=size(X,2);   
        (i1==i2 || n<1) && return nothing;     
        @simd for j=1:n  X[i1,j], X[i2,j] = X[i2,j], X[i1,j] end 
    end end;
function _loc_pivot(X, i0, j0)
    # in the first nonzero row of submatrix X[i0:end,j0:end], find the location of entry with largest absolute value
    @inbounds begin m,n=size(X);   d_ = abs(X[i0,j0]);
    for i=i0:m  j_=j0;  
        for j=j0:n 
            X[i,j]==0 && continue; 
            d = abs(X[i,j]);  
            if (X[i,j_]==0 || d_<d) j_=j; d_=d; end end;  
    X[i,j_]≠0 && return i,j_ end;   
    return i0,j0 end end;
function factor_LU!(X, σ, τ)
    m,n = size(X) 
    mn = min(m,n)
    for k=1:mn 
        i,j = _loc_pivot(X,k,k);   
        X[i,j]==0 && break;   # if X[i,j] is zero, then X[k:m,k:n] are zero
        if i≠k  _swap_row!(X,k,i);   σ[k],σ[i] = σ[i],σ[k] end
        if j≠k  _swap_col!(X,k,j);   τ[k],τ[j] = τ[j],τ[k] end
        xkk = X[k,k];   
        @simd for i=k+1:m  X[i,k] /= xkk end
        for j=k+1:n 
            xkj = X[k,j];  
            @simd for i=k+1:m  
                X[i,j] -= X[i,k] * xkj end end end end;
function getLU(X::Matrix{tw})  where tw
   m,n = size(X);  r = min(m,n);  
   L=zeros(tw,m,r);  U=zeros(tw,r,n);  
   for k=1:r                  L[k,k] = tw(1) end
   for i=1:m, j=1:min(i-1,r)  L[i,j] = X[i,j] end
   for i=1:r, j=i:n           U[i,j] = X[i,j] end
   return L, U end;

It works efficiently and correctly:

julia> using LinearAlgebra
julia> X = rand(-1:1e-6:1,1000,1000);  Y = copy(X);  σ,τ = (collect(1:d) for d ∈ size(X));  
julia> @time factor_LU!(Y,σ,τ);   L,U=getLU(Y);  norm(X[σ,τ]-L*U)
  0.324735 seconds (1 allocation: 32 bytes)
3.755177227563265e-12

But for large matrices (m,n>10_000), it is still slow. So I’d like to parallelize it.

If I just put @threads in front of for j=k+1:n, there is no improvement, since the threads are spawned for each k, which incurs some time penalty. So I’d like to spawn threads outside the outer loop and use them repeatedly in the inner loop. My attempt, with ChatGPT:

using Base.Threads
function factor_LU_parallel!(X, σ, τ)
   @inbounds begin m,n=size(X);  mn=min(m,n)

   nt = nthreads()
   jobs = Channel{Tuple{Int,Int,Int}}(nt)
   nactive = Atomic{Int}(0)
   workers = [  # spawn threads that wait to receive work
      @spawn while true  
         job = try take!(jobs) catch; break end;  k,j0,j1=job;
         for j=j0:j1 
            xkj = X[k,j]
            for i=k+1:m  X[i,j] -= X[i,k]*xkj end end;
         atomic_sub!(nactive,1)  # subtract 1 from the number of unfinished jobs
      end for _=1:nt];

   for k=1:mn
      i,j = _loc_pivot(X,k,k);   
      X[i,j]==0 && break;  
      if i≠k  _swap_row!(X,k,i);  σ[k],σ[i] = σ[i],σ[k] end
      if j≠k  _swap_col!(X,k,j);  τ[k],τ[j] = τ[j],τ[k] end
      xkk=X[k,k];   @simd for i=k+1:m  X[i,k] /= xkk end

      bs=max(1,cld(n-k,nt));  # split k+1:n into nt disjoint chunks
      for j=k+1:bs:n  # at each time, there are at most nt jobs open
         atomic_add!(nactive,1); 
         put!(jobs, (k,j,min(j+bs-1,n))) end  
      while nactive[]>0 sleep(1e-9) end  # wait until all jobs are done
   end;  
   close(jobs);  foreach(fetch,workers);  end end;

This is slower, has lock conflicts, and gives an incorrect result:

julia> X = rand(-1:1e-6:1,1000,1000);  Y = copy(X);  σ,τ = (collect(1:d) for d ∈ size(X));  
julia> @time factor_LU_parallel!(Y,σ,τ);   L,U=getLU(Y);  norm(X[σ,τ]-L*U)
  2.622632 seconds (3.83 k allocations: 125.250 KiB, 0.40% gc time, 34037 lock conflicts)
3311.8813323932263

I’d appreciate help in making factor_LU! more performant.

I think you can’t this way unfortunately, I would go for block LU and parallel the execution on each block see Example: LU factorization — Task-based parallelism in scientific computing documentation if you know a little bit of c++ or if you prefer a paper look at https://onlinelibrary.wiley.com/doi/10.1155/2019/3720450.

You did kinda of that with your tasks but not exactly though

3 Likes

Surely there is a way in Julia to spawn worker threads, reuse them in the inner loop, then wait until they all finish before proceeding to the next k.

The issue is that you still need to know X[k , i ] to calculate X[k +1 , i ] right ?

Yes. But when the parallelization would start, X[k,i] would only be read from, not modified, so there’s no synchronization necessary.

Well I’m pretty sure this is where it locks X[i,j] -= X[i,k]*xkj

Umm, I’m not very familiar with locks. Why do they even appear here? ChatGPT says

A lock is a synchronization primitive used to ensure that only one task or thread accesses a shared resource (e.g., a variable or data structure) at a time. In Julia, the most common lock is:

lock = ReentrantLock()
lock(lock)
try      # critical section
finally  unlock(lock) end

But I never used any of this. Is it built into channels? In my case, different threads read the same data X[i,k] and X[k,j] but write to different columns in X.

Sorry I mean a thread may access the x[i,k ] you’re changing before you changed it. Not a lock sorry. A good test would be to do that atomicly ( you will get the worst performance) so that you will know if the wrong results come from this. I’m not familiar with Channels though.
I’m just afraid this is just the alg itself that is incremental and thus unparallelizable written this way

Oh, that. Yeah, at the end of each iteration of the k-loop, I must wait until all jobs are finished, before proceeding to k+1. If that waiting is done, it shouldn’t happen that X[i,k] is read from before it is altered, right?

Are you suggesting changing X to contain only atomic values and then run my parallel code to see if the result is correct?

That won’t help actually I don’t think two threads can write on it so it’s not a race condition it’s just the alg that s not suited in this form. I think you need to do the math to make sure what you need to know before finding next coefficient which will tell you what can be done in parallel if it’s even possible, in my mind the only two way of doing parallel LU is block wize as for GPU and OpenBlas or recursively as in RecursiveFactorization.jl.

I think what you thought was I can parallel the inner loop but I’m spawning each time I want to gather the works there so that I can just run everything after but since one iteration needs the previous one this won’t work and even if you make the full computational graph you won’t be able to go faster I think

Also found a video https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.youtube.com/watch%3Fv%3DE8aBJsC0bY8&ved=2ahUKEwicos-Z8c-OAxUwVKQEHd8VOKEQtwJ6BAg7EAE&usg=AOvVaw1Ls4tn79xq9egQQ8SpKiZt

Btw : I also tried to make chatgpt do exactly that but never worked ( I think it was on gpu for me with KernelAbstractions.jl ) I think this is an example of an alg still beating AI for now :sweat_smile:

1 Like

You might look at the generic lu
in LinearAlgebra.jl.

If you thread the critical loop in lines 260-264 and put an
@inbounds @simd on the inner loop in lines 261-263 you should have
something to compare to.

I did this to get something to work in half precision. See hlu!.jl which is part of my package MultiPrecisionArrays.jl.

1 Like

Thanks for the links. I’ll wait for someone experienced with channels, locks, waiting, to tell me how I could do wait for the jobs to finish in each iteration and improve the performance.

Using something like OhMyThreads makes this easier.

the point is that the algorithm you’re using doesn’t parallelize well. you need to switch lu algorithm before parallelization can be effective.

2 Likes

I’m still interested in the approach I envisioned (spawning threads outside and reusing them), but to which LU algorithm should I switch? @Oscar_Smith I benchmarked BLAS/LAPACK vs RecursiveFactorization vs my implementation on 20Kx20K matrices and the timings were 33s vs 700s vs 2000s. So I’m figuring, if I could multithread like I proposed, with 12 cores I could perhaps outperform RF.

Do a search for “parallelizing LU” and you’ll find many links. Be aware that there’s a lot more to making LU fast (e.g. blocking for cache) — the textbook 3-loop algorithms aren’t really a good starting point.