Hi everyone,
I’m working with CUSOLVERRF.jl to solve multiple sparse linear systems on the GPU, using the high-level RFLU
API: GitHub - exanauts/CUSOLVERRF.jl: A Julia wrapper for cusolverRF
All systems share the same sparsity pattern, but have different matrix values. I’m trying to run multiple solves in parallel using Threads.@threads
. My goal is to have a code that is as fast as possible. At the moment I don’t want/can use CUDA itself.
Typical code I’m using (works serially):
rf = RFLU(dA; symbolic=:RF) # only once
for i in 1:N
dA.nzVal .= new_values[i] # update matrix values
lu!(rf, dA) # update numeric factorization
copyto!(db, b[i]) # copy RHS
ldiv!(rf, db) # solve A*x = b
end
This works great in serial.
What I want to do:
Run this loop in parallel using @threads
to speed up processing of many systems:
rf = RFLU(dA; symbolic=:RF) # only once
@threads for i in 1:N
dA.nzVal .= new_values[i]
lu!(rf, dA)
copyto!(db, b[i])
ldiv!(rf, db)
end
But this leads to wrong results because the rf
object is not thread-safe. I can’t work with locks or similar methods to avoid data-races because that would lead to a serial execution.
My question
Is there a safe and efficient way to reuse the same RFLU
symbolic factorization across threads? With these points in mind:
- Can I clone
rf
so that each thread has its own copy of the numeric buffers, but reuses the symbolic structure? Copying does not work. - I want to avoid calling
RFLU(dA)
in each thread, because that recomputes the symbolic factorization and is slow.
Any suggestions or workarounds would be really appreciated.
Thanks in advance!