I am, for the first time, trying to learn how GPU programming works. I have two Tesla K80s on a HPC. To begin I request the proper node through slurm, and launch an interactive job.
[affans@hpc ~]$ srun -p gpuq --gres=gpu:2 --pty bash
[affans@node018 ~]$ nvidia-smi
Fri Dec 20 13:46:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:06:00.0 Off | 0 |
| N/A 30C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:07:00.0 Off | 0 |
| N/A 24C P8 27W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Then I launch Julia in this bash and try to compile CUDAdrv. (I have two installations… one local to my home directory v1.2 and one installed at the system level 1.0.3 that no one uses).
[affans@node018 ~]$ export JULIA_CUDA_VERBOSE=true
[affans@node018 ~]$ ./bin/julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.2.0 (2019-08-20)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using CUDAdrv
┌ Error: CUDAdrv.jl failed to initialize
│ exception =
│ CUDA error: unknown error (code 999, ERROR_UNKNOWN)
│ Stacktrace:
│ [1] throw_api_error(::CUDAdrv.cudaError_enum) at /home/affans/.julia/packages/CUDAdrv/i465Q/src/error.jl:131 │ [2] macro expansion at /home/affans/.julia/packages/CUDAdrv/i465Q/src/error.jl:144 [inlined]
│ [3] cuInit at /home/affans/.julia/packages/CUDAdrv/i465Q/src/libcuda.jl:18 [inlined]
│ [4] __init__() at /home/affans/.julia/packages/CUDAdrv/i465Q/src/CUDAdrv.jl:56
│ [5] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:685
│ [6] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:765
│ [7] _require(::Base.PkgId) at ./loading.jl:990
│ [8] require(::Base.PkgId) at ./loading.jl:911
│ [9] require(::Module, ::Symbol) at ./loading.jl:906
│ [10] eval(::Module, ::Any) at ./boot.jl:330
│ [11] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:86
│ [12] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.2/REPL/src/REPL.jl:118 [inlined]
│ [13] (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})() at ./task.jl:268
â”” @ CUDAdrv ~/.julia/packages/CUDAdrv/i465Q/src/CUDAdrv.jl:67
The error message is very cryptic. No idea how to even debug this.
I think I should use a basic C script first to test out the GPUs and making sure the libraries work before using Julia. Is there a basic C script that simply prints out the name of the devices?