# Using CUSOLVER in CuArrays.jl

I’m writing a GPU-version of `gesv!`, using CuArrays.jl.

The following works:

``````using CuArrays, LinearAlgebra, Test

function gpugesv!(A,b)
A, ipiv = CuArrays.CUSOLVER.getrf!(A)
CuArrays.CUSOLVER.getrs!('N',A,ipiv,b)
return nothing
end

###
A = rand(32^2,32^2); b = rand(32^2);
A_d = CuArray(A); b_d = CuArray(b);

LAPACK.gesv!(A,b);
gpugesv!(A_d,b_d)
A_d = Array(A_d); b_d = Array(b_d);

@test isapprox(A_d,A) && isapprox(b_d,b)
###

``````

I’ll have to do this computation repeatedly for different `A` and `b` (of size `128^2)`, but I’m not sure how to clean up the GPU after each evaluation of `gpugesv!`. The CPU becomes stuck at 100% load after several iterations.

Any suggestions on how to implement this?

MWE that shows the problematic behavior? I tried putting it a loop but don’t see extremely high GC overhead

``````using CuArrays, LinearAlgebra, Test

function gpugesv!(A,b)
A, ipiv = CuArrays.CUSOLVER.getrf!(A)
CuArrays.CUSOLVER.getrs!('N',A,ipiv,b)
return
end

function main(;N=32^2, i=25)
CuArrays.pool_timings!()
CuArrays.@time for _ in 1:i
A = rand(N, N)
b = rand(N)

A_d = CuArray(A)
b_d = CuArray(b)

LAPACK.gesv!(A,b)
gpugesv!(A_d, b_d)

@test Array(A_d) ≈ A && Array(b_d) ≈ b
end
CuArrays.pool_timings()
end

main()
``````
``````  2.055105 seconds (483.19 k CPU allocations: 1.001 GiB, 5.15% gc time) (150 GPU allocations: 200.297 MiB, 8.23% gc time of which 35.57% spent allocating)
──────────────────────────────────────────────────────────────────────────
Time                   Allocations
──────────────────────   ───────────────────────
Tot / % measured:          2.89s / 2.18%           1.05GiB / 0.21%

Section           ncalls     time   %tot     avg     alloc   %tot      avg
──────────────────────────────────────────────────────────────────────────
pooled alloc         150   60.3ms  96.0%   402μs   2.20MiB  100%   15.0KiB
1 try alloc         15   60.2ms  95.7%  4.01ms   2.20MiB  100%    150KiB
background task        1   2.53ms  4.03%  2.53ms   2.03KiB  0.09%  2.03KiB
scan                 1   2.04μs  0.00%  2.04μs         -  0.01%        -
reclaim              1   1.49μs  0.00%  1.49μs         -  0.00%        -
──────────────────────────────────────────────────────────────────────────
``````

Ah, maybe you’re running into the “cost” of syncing the GPU. Try wrapping your GPU code (eg. the call to `gpugesv!`) into `CuArrays.@sync`. That will synchronize the GPU, after which a download (ie. a call to `Array(x::CuArray)`) will be “free”.

This recreates my problem.

The following works fine:

``````main(N=4,i=2) #Warmup

main(N=32^2,i=2)
``````
``````0.328417 seconds (404 CPU allocations: 80.108 MiB, 37.92% gc time) (12 GPU allocations: 16.024 MiB, 0.91% gc time of which 91.69% spent allocating)
──────────────────────────────────────────────────────────────────────────
Time                   Allocations
──────────────────────   ───────────────────────
Tot / % measured:          411ms / 20.5%           80.1MiB / 0.00%

Section           ncalls     time   %tot     avg     alloc   %tot      avg
──────────────────────────────────────────────────────────────────────────
background task        1   81.5ms  96.7%  81.5ms   2.03KiB  58.8%  2.03KiB
reclaim              1   2.61μs  0.00%  2.61μs         -  0.00%        -
scan                 1   1.80μs  0.00%  1.80μs         -  9.50%        -
pooled alloc          12   2.77ms  3.29%   231μs   1.42KiB  41.2%        -
1 try alloc          8   2.74ms  3.25%   342μs         -  13.6%        -
──────────────────────────────────────────────────────────────────────────

``````

However, calling `main(N=32^2,i=3)` will freeze the session.

Is it enough to use CuArrays.@sync as the following?

``````main(;N=32^2,i=25)
...
for _ in 1:i
....
CuArrays.@sync gpugesv!(A_d,b_d)
...
end
...
end
``````

Following steps 4 and 5 in LU-example from the cuSOLVER-documentation, should I write:

``````function gpugesv!(A,b)
CuArrays.@sync A, ipiv = CuArrays.CUSOLVER.getrf!(A)
CuArrays.@sync CuArrays.CUSOLVER.getrs!('N',A,ipiv,b)
return
end
``````

It’s fine to put the `@sync` on the call to `gpugesv!`.

However, if your session really freezes doing `main(N=32^2,i=3)`, there’s something else going on. Could you attach `gdb` and inspect where the process hangs?

I haven’t built Julia from source before and I’m running this on a cluster (with a GPU-node).

I’m currently building Julia with `make debug`, I’ll get back to you when I’ve tried `gdb`.

You don’t need a debug build just to get a backtrace. Just run under `gdb`, or attach `gdb` to the process afterwards, and use `bt` to dump the backtrace.

Alternatively, depending on where it’s stuck, interrupting the process using `CTRL-C` might print some kind of a backtrace too.

Attaching to `gdb` finally allows me to interrupt the call to `main(N=32^2,i=4)`.
I manage to complete the loop when I set `i=3` in `main`, although inconsistently.
I don’t know what to make of the backtrace.

Here is a gist:

Not sure what you’re trying to show with that backtrace, it just points to the SIGINT handler after having pressed CTRL-C (as expected) and in a case where the execution just finishes… My idea was to attach `gdb` when the process was frozen and see where it hangs, since I can’t reproduce a hang with any problem size / iteration count.

Sorry,
Here’s the backtrace during the hanging `main`-function.

``````(gdb) bt
#0 0x00007ffd5cb7b7c2 in clock_gettime ()
#1 0x00002b6c584b993d in clock_gettime () from /usr/lib64/libc.so.6
#2 0x00002b6c88f3be5e in ?? () from /usr/lib64/nvidia/libcuda.so
#3 0x00002b6c88fc9a05 in ?? () from /usr/lib64/nvidia/libcuda.so
#4 0x00002b6c88fe813b in ?? () from /usr/lib64/nvidia/libcuda.so
#5 0x00002b6c88f1e01d in ?? () from /usr/lib64/nvidia/libcuda.so
#6 0x00002b6c88e419ba in ?? () from /usr/lib64/nvidia/libcuda.so
#7 0x00002b6c88e44f8a in ?? () from /usr/lib64/nvidia/libcuda.so
#8 0x00002b6c88f7e265 in cuMemcpyDtoH_v2 () from /usr/lib64/nvidia/libcuda.so
#9 0x00002b6c83734b2e in ?? ()
#10 0x0000000000000002 in ?? ()
#11 0x00002b6c842d2a20 in ?? ()
#12 0x0000000000000000 in ?? ()
``````

Could this be related to this issue?

That seems to deal with concurrent kernels and temporary freezes, so I’d think not.

I’m not sure how to help here, since I can’t reproduce the hang and it seems to be happening within CUDA. Your issue looks like https://devtalk.nvidia.com/default/topic/1025580/cusolver-lu-factorization-inside-a-for-loop-problem/ – maybe try upgrading CUDA / the NVIDIA driver and see if it reproduces?

I set up the CUDA toolkit (10.0) on my personal desktop at home and `main()`, where it ran without hanging.