Trying to parallelize using CUSOLVERRF.jl with @threads

mhunke · August 7, 2025, 4:11pm

Hi everyone,

I’m working with CUSOLVERRF.jl to solve multiple sparse linear systems on the GPU, using the high-level RFLU API: GitHub - exanauts/CUSOLVERRF.jl: A Julia wrapper for cusolverRF

All systems share the same sparsity pattern, but have different matrix values. I’m trying to run multiple solves in parallel using Threads.@threads. My goal is to have a code that is as fast as possible. At the moment I don’t want/can use CUDA itself.

Typical code I’m using (works serially):

rf = RFLU(dA; symbolic=:RF)      # only once  
for i in 1:N 
    dA.nzVal .= new_values[i]    # update matrix values     
    lu!(rf, dA)                  # update numeric factorization     
    copyto!(db, b[i])            # copy RHS     
    ldiv!(rf, db)                # solve A*x = b 
end

This works great in serial.

What I want to do:

Run this loop in parallel using @threads to speed up processing of many systems:

rf = RFLU(dA; symbolic=:RF)      # only once  
@threads for i in 1:N     
    dA.nzVal .= new_values[i]     
    lu!(rf, dA)     
    copyto!(db, b[i])     
    ldiv!(rf, db) 
end

But this leads to wrong results because the rf object is not thread-safe. I can’t work with locks or similar methods to avoid data-races because that would lead to a serial execution.

My question
Is there a safe and efficient way to reuse the same RFLU symbolic factorization across threads? With these points in mind:

Can I clone rf so that each thread has its own copy of the numeric buffers, but reuses the symbolic structure? Copying does not work.
I want to avoid calling RFLU(dA) in each thread, because that recomputes the symbolic factorization and is slow.

Any suggestions or workarounds would be really appreciated.

Thanks in advance!

jling · August 8, 2025, 1:13am

how do you multi-thread something that uses GPU? that doesn’t make sense

yolhan_mannes · August 8, 2025, 7:24am

You should go with Distributed.jl instead since I imagine you want to use multiple machine with gpus see Multiple GPUs · CUDA.jl.
If you do want to use multithreaded on 1 gpu (which I think will be far slower than just building the big block sparse array and use full power of your gpu on it) I think you will run out of mem really quickly (see Multhreading & GPU memory management).

PS : Last little thing maybe you know but the code you showed isn’t thread safe

mhunke · August 8, 2025, 7:25am

I multi-thread CPU code where each thread gets its own stream, and then each thread uses the lu! or other cuda kernels. If the kernels don’t saturate the GPU individually (which they do not), there should be parallelization.

jpsamaroo · August 10, 2025, 7:56pm

This is correct, using multiple streams will (usually) increase device utilization by providing sufficient work to engage all SMs.

To your original point, you’ll need to create rf on each task most likely, so that you have one per task. You can use TaskLocalValues.jl to do something like:

using TaskLocalValues

...

RF = TaskLocalValue{RFLU}(()->RFLU(dA; symbolic=:RF))
@threads for i in 1:N
    rf = RF[] # Allocates a new `RFLU` per task, as needed
    ...
end

mhunke · August 11, 2025, 4:52pm

After some digging in the source code of CUSOLVERRF.jl I found that they have a RFBatchedLowLevel() function—exactly what I need. So I can restructure my code and work with that. Even better than what I thought of before. But with the new function I get a ReadOnlyMemoryError() from the function call ``. Here is the code I used for testing:

using CUDA, SparseArrays, CUSOLVERRF, CUDA.CUSPARSE

A_cpu = spdiagm(0 => ones(Float64, 4))
println(A_cpu)
A_cpu = SparseMatrixCSC{Float64, Int32}(A_cpu)

dA = CuSparseMatrixCSR(copy(A_cpu))
@show typeof(dA)

try
    dA.nzVal .= dA.nzVal .+ 1.0
    println("dA.nzVal is writable!")
catch e
    println("Error writing to dA.nzVal: ", e)
end

try
    rf_batched = CUSOLVERRF.RFBatchedLowLevel(dA, 4)
    println("Success, type: ", typeof(rf_batched))
catch e
    println("Failure: ", e)
    rethrow()
end

Any ideas?

mhunke · August 14, 2025, 7:26am

I am now using CUDSS.jl Batched LU: Batch API · CUDSS.jl and this works fine

amontoison · August 19, 2025, 9:09am

@mhunke
I interfaced exactly what you need a few days ago.
With the latest release of cuDSS we can now solve a batch of linear systems with the same sparsity pattern.

I interfaced the feature and added some examples in the documentation:

I worked on it for a project with MadNLP.jl but I’m glad if it can be useful to other people.

Topic		Replies	Views
UMFpackLU backslash operator involving multiple RHS General Usage question , performance	2	634	May 28, 2018
Using CUSOLVER in CuArrays.jl GPU question	12	1907	February 26, 2019
Solves the linear system using CuArrays.jl GPU	3	1642	December 27, 2019
Accelerate solving many matrix problems GPU cuda , linearalgebra , regression	8	2632	June 3, 2020
Parallelizing multiple Crank–Nicolson solvers Performance linearalgebra	21	1541	March 13, 2021

Trying to parallelize using CUSOLVERRF.jl with @threads

Related topics