I’m working on a shared system, and the admins are making various CUDA versions available via
module load. Unfortunately, the installed modules don’t contain
libcuda.so, but it looks like pretty much everything else is there. Is there a way to install
CuArrays in this case? There is a stub for a
libcuda.so, which fails with
Building against CUDA driver stubs, which is not supported..
and a different version in
/usr/lib64 (even before ‘module load’), which fails with
Initializing CUDA driver failed: no CUDA hardware available (code 100)..
I can see that the other libs are correctly detected, just this
libcuda.so problem prevents me from proceeding. I don’t know enough about CUDA, but is it possible to build
CuArrays without it? Tensorflow and pytorch work with CUDA support on this system.
/usr/lib64/libcuda.so is sym-linked to
libcuda.so.418.39, regardless of the CUDA tool kit version I load with
module load. I tried loading a CUDA 10.1.105, which is the newest, but that fails like all the others.
Sorry to be difficult (me? difficult?) 418.39 is an Nvidia driver version.
I will check on my setup when I get to work (about an hour away)
I suggest getting some nice cookies or doughnuts. Casually leave them in the admins lair (sorry - their cubicle). Sweetly ask for libcuda to be installed.
I just logged in to an Ubuntu system with Nvidia GPUs.
/usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.384.90 where 384.90 is the Nvidia driver version.
I can see a stub library in /usr/local/cuda-9.0/lib64/stubs
which has routines in it.
Every CUDA application/library needs libcuda.so, so you system definitely has it. The stubs are there to facilitate compilation, and doesn’t contain any functionality.
If CUDAapi (which underpins the build system of CuArrays) detects the wrong libcuda.so, you can use the CUDA_PATH environment variable to help it find the correct library (run with JULIA_DEBUG=CUDAapi to see some details on what’s detected and where’s being searched).
Thank you for the replies and sorry, maybe I didn’t explain this correctly.
The system has one libcuda in /usr, regardless of which toolkit version (in /share/CUDA/version/…) is being loaded. The toolkit libs are found and loaded correctly, but when the system gets to libcuda, it fails with the error 100 mentioned above. The only version-specific libcuda is the stub, which also fails.
Is there a way I can dig into the 100 error more? The hardware is definitely there…
nvidia-smi work? Maybe you can try
stracing that binary and see where it picks up
libcuda to see if the one in
/usr/lib64 (throwing error 100) is the correct one.
EDIT: you don’t need to specify any CUDA_VISIBLE_DEVICES, do you?
Nvidia-smi works fine. I’ll try the strace idea. Or ldd… Will get back with that info, but it’ll be a couple hours.
I don’t set CUDA_VISIBLE_DEVICES for other apps, but I’ll try that, too. Thanks for the suggestions.
That was it.
CUDA_VISIBLE_DEVICES was not set. It gets set automatically in our queues, but I had been running interactively.
Thanks again for taking the time to look into this.