CUDNNError when using Flux within a Task

Multi-tasking and threading are fairly recent additions to the Julia/CUDA stack, so some bugs are to be expected. Importantly, every task gets its own CUDNN handle, so you can’t leak handle-local data between tasks. But the backtrace here seems to point to where that handle gets created; I assume that’s the first CUDNN operation in a newly-created task (if not, something’s up with handle creation)? Maybe there’s a limit on how many handles we can create. 200MB free also isn’t much, so maybe creation fails because it runs out of memory and we need to retry after running a GC iteration. You can try that with the following patch:

--- a/lib/cudnn/base.jl
+++ b/lib/cudnn/base.jl
@@ -1,6 +1,9 @@
 function cudnnCreate()
     handle = Ref{cudnnHandle_t}()
-    cudnnCreate(handle)
+    res = @retry_reclaim CUDNN_STATUS_INTERNAL_ERROR unsafe_cudnnCreate(handle)
+    if res != CUDNN_STATUS_SUCCESS
+        throw_api_error(res)
+    end
     return handle[]
 end
1 Like