Failed to load `CUDAdrv`

I am, for the first time, trying to learn how GPU programming works. I have two Tesla K80s on a HPC. To begin I request the proper node through slurm, and launch an interactive job.

[affans@hpc ~]$ srun -p gpuq --gres=gpu:2 --pty bash
[affans@node018 ~]$ nvidia-smi
Fri Dec 20 13:46:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P8    27W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Then I launch Julia in this bash and try to compile CUDAdrv. (I have two installations… one local to my home directory v1.2 and one installed at the system level 1.0.3 that no one uses).

[affans@node018 ~]$ export JULIA_CUDA_VERBOSE=true
[affans@node018 ~]$ ./bin/julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using CUDAdrv
┌ Error: CUDAdrv.jl failed to initialize
│   exception =
│    CUDA error: unknown error (code 999, ERROR_UNKNOWN)
│    Stacktrace:
│     [1] throw_api_error(::CUDAdrv.cudaError_enum) at /home/affans/.julia/packages/CUDAdrv/i465Q/src/error.jl:131      │     [2] macro expansion at /home/affans/.julia/packages/CUDAdrv/i465Q/src/error.jl:144 [inlined]
│     [3] cuInit at /home/affans/.julia/packages/CUDAdrv/i465Q/src/libcuda.jl:18 [inlined]
│     [4] __init__() at /home/affans/.julia/packages/CUDAdrv/i465Q/src/CUDAdrv.jl:56
│     [5] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:685
│     [6] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:765
│     [7] _require(::Base.PkgId) at ./loading.jl:990
│     [8] require(::Base.PkgId) at ./loading.jl:911
│     [9] require(::Module, ::Symbol) at ./loading.jl:906
│     [10] eval(::Module, ::Any) at ./boot.jl:330
│     [11] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:86
│     [12] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:118 [inlined]
│     [13] (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})() at ./task.jl:268
└ @ CUDAdrv ~/.julia/packages/CUDAdrv/i465Q/src/CUDAdrv.jl:67

The error message is very cryptic. No idea how to even debug this.

I think I should use a basic C script first to test out the GPUs and making sure the libraries work before using Julia. Is there a basic C script that simply prints out the name of the devices?

We can’t provide you with much more information though, that’s just what the NVIDIA driver reports. Please make sure CUDA C-compiled binaries work (the CUDA toolkit comes with samples). This error happens when initializing CUDA, and is one of the first things CUDAdrv does. Typically it is not Julia/CUDAdrv related.

That’s what I figured. What would be the simplest C script to run to see if everything is working as intended? “simple C GPU script” dosn’t really help on Google.

See e.g.
https://docs.nvidia.com/cuda/cuda-samples/index.html#simple

Here’s how I compiled and tested Julia CUDA libraries in our cluster:

mkdir bin; cd bin
curl -o julia-1.3.0-linux-x86_64.tar.gz https://julialang-s3.julialang.org/bin/linux/x64/1.3/julia-1.3.0-linux-x86_64.tar.gz
tar xzvf julia-1.3.0-linux-x86_64.tar.gz
module add CUDA
ln -s julia-1.3.0 julia
julia/bin/julia
]
pkg> add CUDAapi CUDAdrv CUDAnative CuArrays BenchmarkTools DiffResults ForwardDiff Compat
cat > buildcuarrays.jl
using Pkg
Pkg.test("CuArrays")
srun -p gpu --mem=50G --time=5:00:00 bash -c 'pwd;echo $SLURMD_NODENAME;~/bin/julia-1.3.0/bin/julia ~/bin/testcuarrays.jl'