How to reset GPU after launch failure

If I write a kernel that fails for whatever reason, julia becomes unable to use the GPU for the remainder of the session

example :
julia>cu([1.,2.])

CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)

2 Likes

Sadly, CUDA doesn’t allow us to do that, so the only course of action is restarting Julia.

2 Likes

Since somebody just liked this post: Note that nowadays, on recent hardware (Volta or higher), Julia exceptions are supported and will not leave the GPU in a unrecoverable state.

1 Like

Hey maleadt,

This topic with error code 719 (ERROR_LAUNCH_FAILED) is really crucial .
How can I do the reset with the Volta or higher? If I can help for you, let me know what do you advice what should I implement to do the “reset” feature on this error case. (I think this is something that can help for thousands of people later on.)

There’s nothing to enable. If your GPU has compute capability 7 or higher, Julia exceptions will result in a recoverable KernelException. On older hardware, they will result in an unrecoverable CUDA error that requires a process restart. And of course, many CUDA errors (like illegal memory accesses) are always unrecoverable, and there’s nothing we can do about that.

1 Like

That could explain why I didn’t find any solution on this topic… :frowning:
I cannot believe this really an issue in 2022. :smiley: Damn!

There is a solution to this topic, namely to use Volta hardware or more recent. You haven’t specified which error you are running into. If you are triggering an ILLEGAL_MEMORY_ACCESS and want to recover, complain to NVIDIA. In general, you shouldn’t be able to trigger such errors with CUDA.jl; kernel array accesses are bounds checked, and will result in a recoverable exception.

1 Like

Indeed I have computing capability 6.1. :frowning: I will try to acquire a better videocard.

I am using kernel functions and having Out of bounds errors somewhere. I am debugging it, but the restart times are really long so it is hard. I just got use to the instantaneous Revise development workflow…
Also this was pretty interesting as I didn’t face with the bug when I rewrote the @cuda kernelfn into a cpu version for i... kernelfn version. It was a day ago… I still don’t understand how the CPU code was running properly with out of bound indexing… it was a rollercoaster to find this out of bound indexing, rly… That is why I came to the forum, weather there is a better option to debug things like this.

Thank you for your answers Tim!

You can try disabling this branch: https://github.com/JuliaGPU/CUDA.jl/blob/71e9760f02bc6e1bc76aaaf32e98be10a41bae6b/src/compiler/gpucompiler.jl#L35-L37
But beware that it may lead to miscompilations, so I wouldn’t recommend using it for any other purpose than finding your bug. Also be sure to run with -g2 for better stack traces.

1 Like