I am, for the first time, trying to learn how GPU programming works. I have two Tesla K80s on a HPC. To begin I request the proper node through slurm, and launch an interactive job.
[affans@hpc ~]$ srun -p gpuq --gres=gpu:2 --pty bash
[affans@node018 ~]$ nvidia-smi
Fri Dec 20 13:46:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:06:00.0 Off | 0 |
| N/A 30C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:07:00.0 Off | 0 |
| N/A 24C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Then I launch Julia in this bash and try to compile CUDAdrv
. (I have two installations… one local to my home directory v1.2 and one installed at the system level 1.0.3 that no one uses).
[affans@node018 ~]$ export JULIA_CUDA_VERBOSE=true
[affans@node018 ~]$ ./bin/julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.2.0 (2019-08-20)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using CUDAdrv
┌ Error: CUDAdrv.jl failed to initialize
│ exception =
│ CUDA error: unknown error (code 999, ERROR_UNKNOWN)
│ Stacktrace:
│ [1] throw_api_error(::CUDAdrv.cudaError_enum) at /home/affans/.julia/packages/CUDAdrv/i465Q/src/error.jl:131 │ [2] macro expansion at /home/affans/.julia/packages/CUDAdrv/i465Q/src/error.jl:144 [inlined]
│ [3] cuInit at /home/affans/.julia/packages/CUDAdrv/i465Q/src/libcuda.jl:18 [inlined]
│ [4] __init__() at /home/affans/.julia/packages/CUDAdrv/i465Q/src/CUDAdrv.jl:56
│ [5] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:685
│ [6] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:765
│ [7] _require(::Base.PkgId) at ./loading.jl:990
│ [8] require(::Base.PkgId) at ./loading.jl:911
│ [9] require(::Module, ::Symbol) at ./loading.jl:906
│ [10] eval(::Module, ::Any) at ./boot.jl:330
│ [11] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:86
│ [12] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:118 [inlined]
│ [13] (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})() at ./task.jl:268
â”” @ CUDAdrv ~/.julia/packages/CUDAdrv/i465Q/src/CUDAdrv.jl:67
The error message is very cryptic. No idea how to even debug this.
I think I should use a basic C script first to test out the GPUs and making sure the libraries work before using Julia. Is there a basic C script that simply prints out the name of the devices?