Using CUSOLVER in CuArrays.jl

eliassno · February 20, 2019, 6:59pm

I’m writing a GPU-version of gesv!, using CuArrays.jl.

The following works:

using CuArrays, LinearAlgebra, Test

function gpugesv!(A,b)
        A, ipiv = CuArrays.CUSOLVER.getrf!(A)
        CuArrays.CUSOLVER.getrs!('N',A,ipiv,b)
        return nothing
end

###
A = rand(32^2,32^2); b = rand(32^2);
A_d = CuArray(A); b_d = CuArray(b);


LAPACK.gesv!(A,b);
gpugesv!(A_d,b_d)
A_d = Array(A_d); b_d = Array(b_d);

@test isapprox(A_d,A) && isapprox(b_d,b)
###

I’ll have to do this computation repeatedly for different A and b (of size 128^2), but I’m not sure how to clean up the GPU after each evaluation of gpugesv!. The CPU becomes stuck at 100% load after several iterations.

Any suggestions on how to implement this?

maleadt · February 20, 2019, 7:57pm

MWE that shows the problematic behavior? I tried putting it a loop but don’t see extremely high GC overhead

using CuArrays, LinearAlgebra, Test

function gpugesv!(A,b)
    A, ipiv = CuArrays.CUSOLVER.getrf!(A)
    CuArrays.CUSOLVER.getrs!('N',A,ipiv,b)
    return
end

function main(;N=32^2, i=25)
    CuArrays.pool_timings!()
    CuArrays.@time for _ in 1:i
        A = rand(N, N)
        b = rand(N)

        A_d = CuArray(A)
        b_d = CuArray(b)

        LAPACK.gesv!(A,b)
        gpugesv!(A_d, b_d)

        @test Array(A_d) ≈ A && Array(b_d) ≈ b
    end
    CuArrays.pool_timings()
end

main()

  2.055105 seconds (483.19 k CPU allocations: 1.001 GiB, 5.15% gc time) (150 GPU allocations: 200.297 MiB, 8.23% gc time of which 35.57% spent allocating)
 ──────────────────────────────────────────────────────────────────────────
                                   Time                   Allocations      
                           ──────────────────────   ───────────────────────
     Tot / % measured:          2.89s / 2.18%           1.05GiB / 0.21%    

 Section           ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 pooled alloc         150   60.3ms  96.0%   402μs   2.20MiB  100%   15.0KiB
   1 try alloc         15   60.2ms  95.7%  4.01ms   2.20MiB  100%    150KiB
 background task        1   2.53ms  4.03%  2.53ms   2.03KiB  0.09%  2.03KiB
   scan                 1   2.04μs  0.00%  2.04μs         -  0.01%        -
   reclaim              1   1.49μs  0.00%  1.49μs         -  0.00%        -
 ──────────────────────────────────────────────────────────────────────────

maleadt · February 20, 2019, 8:07pm

Ah, maybe you’re running into the “cost” of syncing the GPU. Try wrapping your GPU code (eg. the call to gpugesv!) into CuArrays.@sync. That will synchronize the GPU, after which a download (ie. a call to Array(x::CuArray)) will be “free”.

eliassno · February 20, 2019, 9:29pm

This recreates my problem.

The following works fine:

main(N=4,i=2) #Warmup

main(N=32^2,i=2)

0.328417 seconds (404 CPU allocations: 80.108 MiB, 37.92% gc time) (12 GPU allocations: 16.024 MiB, 0.91% gc time of which 91.69% spent allocating)
 ──────────────────────────────────────────────────────────────────────────
                                   Time                   Allocations      
                           ──────────────────────   ───────────────────────
     Tot / % measured:          411ms / 20.5%           80.1MiB / 0.00%    

 Section           ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────
 background task        1   81.5ms  96.7%  81.5ms   2.03KiB  58.8%  2.03KiB
   reclaim              1   2.61μs  0.00%  2.61μs         -  0.00%        -
   scan                 1   1.80μs  0.00%  1.80μs         -  9.50%        -
 pooled alloc          12   2.77ms  3.29%   231μs   1.42KiB  41.2%        -
   1 try alloc          8   2.74ms  3.25%   342μs         -  13.6%        -
 ──────────────────────────────────────────────────────────────────────────

However, calling main(N=32^2,i=3) will freeze the session.

eliassno · February 20, 2019, 9:47pm

Is it enough to use CuArrays.@sync as the following?

main(;N=32^2,i=25)
    ...
    for _ in 1:i
        ....
        CuArrays.@sync gpugesv!(A_d,b_d)
        ...
    end
    ...
end

Following steps 4 and 5 in LU-example from the cuSOLVER-documentation, should I write:

function gpugesv!(A,b)
    CuArrays.@sync A, ipiv = CuArrays.CUSOLVER.getrf!(A)
    CuArrays.@sync CuArrays.CUSOLVER.getrs!('N',A,ipiv,b)
    return
end

maleadt · February 21, 2019, 6:42am

It’s fine to put the @sync on the call to gpugesv!.

However, if your session really freezes doing main(N=32^2,i=3), there’s something else going on. Could you attach gdb and inspect where the process hangs?

eliassno · February 21, 2019, 2:03pm

I haven’t built Julia from source before and I’m running this on a cluster (with a GPU-node).

I’m currently building Julia with make debug, I’ll get back to you when I’ve tried gdb.

maleadt · February 21, 2019, 2:05pm

You don’t need a debug build just to get a backtrace. Just run under gdb, or attach gdb to the process afterwards, and use bt to dump the backtrace.

Alternatively, depending on where it’s stuck, interrupting the process using CTRL-C might print some kind of a backtrace too.

eliassno · February 21, 2019, 5:46pm

Attaching to gdb finally allows me to interrupt the call to main(N=32^2,i=4).
I manage to complete the loop when I set i=3 in main, although inconsistently.
I don’t know what to make of the backtrace.

Here is a gist:
https://gist.github.com/elisno/0c6d316d1867b7f3f540eb8789bd837c

maleadt · February 21, 2019, 6:27pm

Not sure what you’re trying to show with that backtrace, it just points to the SIGINT handler after having pressed CTRL-C (as expected) and in a case where the execution just finishes… My idea was to attach gdb when the process was frozen and see where it hangs, since I can’t reproduce a hang with any problem size / iteration count.

eliassno · February 21, 2019, 7:34pm

Sorry,
Here’s the backtrace during the hanging main-function.

(gdb) bt
#0 0x00007ffd5cb7b7c2 in clock_gettime ()
#1 0x00002b6c584b993d in clock_gettime () from /usr/lib64/libc.so.6
#2 0x00002b6c88f3be5e in ?? () from /usr/lib64/nvidia/libcuda.so
#3 0x00002b6c88fc9a05 in ?? () from /usr/lib64/nvidia/libcuda.so
#4 0x00002b6c88fe813b in ?? () from /usr/lib64/nvidia/libcuda.so
#5 0x00002b6c88f1e01d in ?? () from /usr/lib64/nvidia/libcuda.so
#6 0x00002b6c88e419ba in ?? () from /usr/lib64/nvidia/libcuda.so
#7 0x00002b6c88e44f8a in ?? () from /usr/lib64/nvidia/libcuda.so
#8 0x00002b6c88f7e265 in cuMemcpyDtoH_v2 () from /usr/lib64/nvidia/libcuda.so
#9 0x00002b6c83734b2e in ?? ()
#10 0x0000000000000002 in ?? ()
#11 0x00002b6c842d2a20 in ?? ()
#12 0x0000000000000000 in ?? ()

Could this be related to this issue?

maleadt · February 25, 2019, 6:06am

That seems to deal with concurrent kernels and temporary freezes, so I’d think not.

I’m not sure how to help here, since I can’t reproduce the hang and it seems to be happening within CUDA. Your issue looks like cuSolver LU factorization inside a for loop problem - GPU-Accelerated Libraries - NVIDIA Developer Forums – maybe try upgrading CUDA / the NVIDIA driver and see if it reproduces?

eliassno · February 26, 2019, 2:21pm

Sorry for the late reply.

I set up the CUDA toolkit (10.0) on my personal desktop at home and main(), where it ran without hanging.

The cluster I’m working on had some slowdown issues last week. It was rebooted yesterday.
I ran the code today, and it works like a charm.

Your original suggestion certainly helps!