All systems share the same sparsity pattern, but have different matrix values. I’m trying to run multiple solves in parallel using Threads.@threads. My goal is to have a code that is as fast as possible. At the moment I don’t want/can use CUDA itself.
Typical code I’m using (works serially):
rf = RFLU(dA; symbolic=:RF) # only once
for i in 1:N
dA.nzVal .= new_values[i] # update matrix values
lu!(rf, dA) # update numeric factorization
copyto!(db, b[i]) # copy RHS
ldiv!(rf, db) # solve A*x = b
end
This works great in serial.
What I want to do:
Run this loop in parallel using @threads to speed up processing of many systems:
rf = RFLU(dA; symbolic=:RF) # only once
@threads for i in 1:N
dA.nzVal .= new_values[i]
lu!(rf, dA)
copyto!(db, b[i])
ldiv!(rf, db)
end
But this leads to wrong results because the rf object is not thread-safe. I can’t work with locks or similar methods to avoid data-races because that would lead to a serial execution.
My question
Is there a safe and efficient way to reuse the same RFLU symbolic factorization across threads? With these points in mind:
Can I clone rf so that each thread has its own copy of the numeric buffers, but reuses the symbolic structure? Copying does not work.
I want to avoid calling RFLU(dA) in each thread, because that recomputes the symbolic factorization and is slow.
Any suggestions or workarounds would be really appreciated.
You should go with Distributed.jl instead since I imagine you want to use multiple machine with gpus see Multiple GPUs · CUDA.jl.
If you do want to use multithreaded on 1 gpu (which I think will be far slower than just building the big block sparse array and use full power of your gpu on it) I think you will run out of mem really quickly (see Multhreading & GPU memory management).
PS : Last little thing maybe you know but the code you showed isn’t thread safe
I multi-thread CPU code where each thread gets its own stream, and then each thread uses the lu! or other cuda kernels. If the kernels don’t saturate the GPU individually (which they do not), there should be parallelization.
This is correct, using multiple streams will (usually) increase device utilization by providing sufficient work to engage all SMs.
To your original point, you’ll need to create rf on each task most likely, so that you have one per task. You can use TaskLocalValues.jl to do something like:
using TaskLocalValues
...
RF = TaskLocalValue{RFLU}(()->RFLU(dA; symbolic=:RF))
@threads for i in 1:N
rf = RF[] # Allocates a new `RFLU` per task, as needed
...
end
After some digging in the source code of CUSOLVERRF.jl I found that they have a RFBatchedLowLevel() function—exactly what I need. So I can restructure my code and work with that. Even better than what I thought of before. But with the new function I get a ReadOnlyMemoryError() from the function call ``. Here is the code I used for testing:
using CUDA, SparseArrays, CUSOLVERRF, CUDA.CUSPARSE
A_cpu = spdiagm(0 => ones(Float64, 4))
println(A_cpu)
A_cpu = SparseMatrixCSC{Float64, Int32}(A_cpu)
dA = CuSparseMatrixCSR(copy(A_cpu))
@show typeof(dA)
try
dA.nzVal .= dA.nzVal .+ 1.0
println("dA.nzVal is writable!")
catch e
println("Error writing to dA.nzVal: ", e)
end
try
rf_batched = CUSOLVERRF.RFBatchedLowLevel(dA, 4)
println("Success, type: ", typeof(rf_batched))
catch e
println("Failure: ", e)
rethrow()
end
@mhunke
I interfaced exactly what you need a few days ago.
With the latest release of cuDSS we can now solve a batch of linear systems with the same sparsity pattern.
I interfaced the feature and added some examples in the documentation:
I worked on it for a project with MadNLP.jl but I’m glad if it can be useful to other people.