How to ignore 1 of the 2 GPUs on my machine

JosePereiraUA · September 24, 2021, 2:13pm

I’m trying to run a piece of code on a different machine than it was written on. On this new machine, there are two GPUs: a Teska K40c and an NVIDIA GeForce GTX 650. The problem seems to be with the second one. Running ] test CUDA gives the following error:

┌ Info: System information:
│ CUDA toolkit 11.4.1, artifact installation
│ CUDA driver 11.4.0
│ NVIDIA driver 470.57.2
│ 
│ Libraries: 
│ - CUBLAS: 11.5.4
│ - CURAND: 10.2.5
│ - CUFFT: 10.5.1
│ - CUSOLVER: 11.2.0
│ - CUSPARSE: 11.6.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+470.57.2
│ - CUDNN: 8.20.2 (for CUDA 11.4.0)
│ - CUTENSOR: 1.3.0 (for CUDA 11.2.0)
│ 
│ Toolchain:
│ - Julia: 1.6.2
│ - LLVM: 11.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
│ - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
│ 
│ 2 devices:
│   0: Tesla K40c (sm_35, 11.107 GiB / 11.173 GiB available)
└   1: NVIDIA GeForce GTX 650 (sm_30, 900.750 MiB / 978.188 MiB available)
┌ Warning: Your NVIDIA GeForce GTX 650 GPU does not meet the minimal required compute capability (3.0.0 < 3.5).
│ Some functionality might be unavailable.
└ @ CUDA ~/.julia/packages/CUDA/9T5Sq/src/state.jl:237
ERROR: LoadError: BoundsError: attempt to access 1-element Vector{Any} at index [0:1]
Stacktrace:
 [1] throw_boundserror(A::Vector{Any}, I::Tuple{UnitRange{Int64}})
   @ Base ./abstractarray.jl:651
 [2] checkbounds
   @ ./abstractarray.jl:616 [inlined]
 [3] getindex(A::Vector{Any}, I::UnitRange{Int64})
   @ Base ./array.jl:807
 [4] top-level scope
   @ ~/.julia/packages/CUDA/9T5Sq/test/runtests.jl:158
 [5] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [6] top-level scope
   @ none:6
in expression starting at /home/jpereira/.julia/packages/CUDA/9T5Sq/test/runtests.jl:158
ERROR: Package CUDA errored during testing

Running the CUDA.jl introductory example gives a similar error:

julia> x_d = CUDA.fill(1.0f0, N);
julia> y_d = CUDA.fill(2.0f0, N);
julia> y_d .+= x_d
ERROR: Device capability v3.0.0 not supported by available toolchain

Altough I am not sure, it seems the issue is with the lack of support for compute capability 3.5 by the NVIDIA GeForce GTX 650 (according to https://developer.nvidia.com/cuda-gpus).

Is there a way for me to just ignore this GPU and use the Tesla K40c only?
Thank you

maleadt · September 24, 2021, 2:16pm

Select a device at the start of your session using device!, or use the CUDA_VISIBLE_DEVICES environment variable.

The bounds error is strange though, could you file an issue with more details?

JosePereiraUA · September 24, 2021, 2:25pm

I’ve tried setting the device!, but the error complaining about the lack of compute compatibility remains.
How can I set the CUDA_VISIBLE_DEVICES environment variable?

maleadt · September 24, 2021, 2:25pm

That depends on your platform and how you execute Julia.

On Linux you just do CUDA_VISIBLE_DEVICES=1 julia in your shell.

JosePereiraUA · September 24, 2021, 2:34pm

I’m working on Linux Ubuntu 18.04.
Since the Tesla K40 is on device number 0, I used CUDA_VISIBLE_DEVICES=0 julia. It now shows:

julia> CUDA.devices()
CUDA.DeviceIterator() for 1 devices:
0. Tesla K40c

So the NVIDIA GeForce GTX 650 is ignored. I was now able to successfully run the tests. Although they take forever:

                                          |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                             (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                        (2) |     7.70 |   0.00 |  0.0 |       0.00 |    62.88 |   0.22 |  2.8 |     548.43 |   835.81 |
gpuarrays/indexing scalar             (2) |    57.20 |   0.00 |  0.0 |       0.01 |    69.12 |   1.63 |  2.9 |    4499.45 |   835.81 |
gpuarrays/reductions/reducedim!       (2) |   189.36 |   0.01 |  0.0 |       1.03 |    70.12 |  10.21 |  5.4 |   18727.28 |   914.50 |

A follow-up question now is: how can I set this environment variable by default, on this machine?

maleadt · September 24, 2021, 2:36pm

That again depends on your environment. Check the documentation of your shell, you could e.g. add it to ~/.profile.

Better run Julia with --threads=auto, the CUDA.jl tests will then make use of all your cores, reducing the execution from e.g. 1h30 to 5min on my 32-core machine

JosePereiraUA · September 24, 2021, 2:48pm

I’ve added the --threads=auto flag, and indeed it was dramatically faster. Is there a way to always use this flag (other than a bash alias)? (Should I always use this flag?)

I’ve marked the above answer as correct. Setting the CUDA_VISIBLE_DEVICES environment variable (in the .bash_profile, in my case) solves this particular issue.

There has been some errors on the tests, but I think that’s something I can deal/live with ahah
Thank you so much for your help, @maleadt !

johnh · September 24, 2021, 4:05pm

As an aside, if anyone is running on a shared HPC system with multiple GPUS one common way to handle this is using groups.
The natch scheduler will create a group for you and assign memory / CPUs and GPU devices.
Your CUDA_VISIBLE_DEVICES variable will match the GPU which is ‘your GPU’

Topic		Replies	Views
Select a device for the entire program GPU	3	1705	June 24, 2019
CUDA device support GPU	7	2530	June 11, 2020
Check gpu availability on a non-gpu machine with CUDA.jl General Usage question , cuda	1	2427	March 10, 2021
"Your CUDA installation does not provide libcudadevrt" even though it is available in /usr/local/cuda-10.0 GPU question	4	1293	April 5, 2020
CUDA.jl tests on windows GPU	25	3805	August 19, 2020

How to ignore 1 of the 2 GPUs on my machine

Related topics