Trying to parallelize using CUSOLVERRF.jl with @threads

Hi everyone,

I’m working with CUSOLVERRF.jl to solve multiple sparse linear systems on the GPU, using the high-level RFLU API: GitHub - exanauts/CUSOLVERRF.jl: A Julia wrapper for cusolverRF

All systems share the same sparsity pattern, but have different matrix values. I’m trying to run multiple solves in parallel using Threads.@threads. My goal is to have a code that is as fast as possible. At the moment I don’t want/can use CUDA itself.

Typical code I’m using (works serially):

rf = RFLU(dA; symbolic=:RF)      # only once  
for i in 1:N 
    dA.nzVal .= new_values[i]    # update matrix values     
    lu!(rf, dA)                  # update numeric factorization     
    copyto!(db, b[i])            # copy RHS     
    ldiv!(rf, db)                # solve A*x = b 
end 

This works great in serial.

What I want to do:

Run this loop in parallel using @threads to speed up processing of many systems:

rf = RFLU(dA; symbolic=:RF)      # only once  
@threads for i in 1:N     
    dA.nzVal .= new_values[i]     
    lu!(rf, dA)     
    copyto!(db, b[i])     
    ldiv!(rf, db) 
end

But this leads to wrong results because the rf object is not thread-safe. I can’t work with locks or similar methods to avoid data-races because that would lead to a serial execution.

My question
Is there a safe and efficient way to reuse the same RFLU symbolic factorization across threads? With these points in mind:

  • Can I clone rf so that each thread has its own copy of the numeric buffers, but reuses the symbolic structure? Copying does not work.
  • I want to avoid calling RFLU(dA) in each thread, because that recomputes the symbolic factorization and is slow.

Any suggestions or workarounds would be really appreciated.

Thanks in advance!

how do you multi-thread something that uses GPU? that doesn’t make sense

You should go with Distributed.jl instead since I imagine you want to use multiple machine with gpus see Multiple GPUs · CUDA.jl.
If you do want to use multithreaded on 1 gpu (which I think will be far slower than just building the big block sparse array and use full power of your gpu on it) I think you will run out of mem really quickly (see Multhreading & GPU memory management).

PS : Last little thing maybe you know but the code you showed isn’t thread safe

I multi-thread CPU code where each thread gets its own stream, and then each thread uses the lu! or other cuda kernels. If the kernels don’t saturate the GPU individually (which they do not), there should be parallelization.