How to make CUDA.jl work in WSL2

I am using the WSL2 (Ubuntu) with version 4.19.121-microsoft-standard, and have installed the CUDA driver provided here: NVIDIA Drivers for CUDA on WSL.
I also have installed nvidia-cuda-toolkit.
I’m curious to know how I can make CUDA.jl work under this circumstance.

When I test CUDA, I got the error message:

Error: Could not initialize CUDA
│   exception =
│    CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

Thank you in advance.

2 Likes

I take it you have a GPU? Can you use nvidia-smi or such in WSL2? This is basically a CUDA set-up issue, so try and make sure CUDA itself works first (e.g. by compiling and executing a CUDA C binary).

Are you on Windows Insiders?

Also note that WSL needs a different libcuda despite being linux, since it uses the passthrough drivers. Not sure how that interacts with CUDA.jl.

Thank you all for your replies.
Yes, I’m in the fast ring of the windows insiders program. Running nvidia-smi in wsl2-ubuntu gives

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.41       Driver Version: 455.41       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060   WDDM  | 00000000:02:00.0 Off |                  N/A |
| N/A   32C    P8     3W /  N/A |     90MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

This is indeed a cuda set-up issue:
I could compile and execute a simple cu code, but the CUDA kernel function somehow did nothing…

EDIT:
I now could compile and execute cu codes successfully by, e.g.,
nvcc -gencode arch=compute_60,code=sm_60 add.cu -o add_cuda
The key is to have the -gencode option. This shows that CUDA works at my wsl2 ubuntu.

1 Like

The issue was solved by adding

export LD_LIBRARY_PATH="/usr/lib/wsl/lib/:$LD_LIBRARY_PATH"

to the .zshrc file.

1 Like

I too am having dificulties with running CUDA.jl under WSL 2.

This has solved the problem of not detecting the core libraries I guess. Since then I get a true response from CUDA.functional().

However any basic operation like the one below results in:

julia> a = CuArray{Float32}(undef, 10)
10-element CuArray{Float32,1}:

julia> b = CuArray{Float32}(undef, 10)
10-element CuArray{Float32,1}:

julia> a + b
ERROR: CUDA error: PTX JIT compiler library not found (code 221, ERROR_JIT_COMPILER_NOT_FOUND)
Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /home/coz/.julia/packages/CUDA/7vLVC/lib/cudadrv/error.jl:103
 [2] CuModule(::String, ::Dict{CUDA.CUjit_option_enum,Any}) at /home/coz/.julia/packages/CUDA/7vLVC/lib/cudadrv/module.jl:42
 [3] _cufunction(::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#20",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:337
 [4] _cufunction at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:304 [inlined]
 [5] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#20",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:24
 [6] broadcast_kernel at /home/coz/.julia/packages/GPUArrays/PkHCM/src/host/broadcast.jl:60 [inlined]
 [7] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#20",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:0
 [8] cached_compilation at /home/coz/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:44 [inlined]
 [9] cufunction(::GPUArrays.var"#broadcast_kernel#20", ::Type{Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:298
 [10] cufunction at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:293 [inlined]
 [11] #launch_heuristic#816 at /home/coz/.julia/packages/CUDA/7vLVC/src/gpuarrays.jl:19 [inlined]
 [12] launch_heuristic at /home/coz/.julia/packages/CUDA/7vLVC/src/gpuarrays.jl:17 [inlined]
 [13] copyto! at /home/coz/.julia/packages/GPUArrays/PkHCM/src/host/broadcast.jl:66 [inlined]
 [14] copyto!(::CuArray{Float32,1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(+),Tuple{CuArray{Float32,1},CuArray{Float32,1}}}) at ./broadcast.jl:864
 [15] copy(::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(+),Tuple{CuArray{Float32,1},CuArray{Float32,1}}}) at ./broadcast.jl:840
 [16] materialize at ./broadcast.jl:820 [inlined]
 [17] broadcast_preserving_zero_d at ./broadcast.jl:809 [inlined]
 [18] +(::CuArray{Float32,1}, ::CuArray{Float32,1}) at ./arraymath.jl:39
 [19] top-level scope at REPL[7]:1

] test CUDA also fails with a gazillion of outputs that overflow my terminal.

┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.1.0
│
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: missing
│ - CUDNN: 8.0.1 (for CUDA 11.0.0)
│ - CUTENSOR: 1.2.0 (for CUDA 11.0.0)
│
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device(s):
└ - GeForce GTX 1050 Ti (sm_61, 3.297 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce GTX 1050 Ti (UUID a2d155cc-31af-3125-b9b2-bb5b43712f8d)
[ Info: Skipping the following tests: nvml, cutensor, device/wmma
                                     |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                        (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                   (2) |     3.27 |   0.00 |  0.0 |       0.00 |      N/A |   0.06 |  1.9 |     219.96 |   834.47 |
apiutils                         (2) |     0.27 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       5.74 |   834.88 |
array                            (2) |         failed at 2020-08-03T16:10:57.567
.
.
.

I have tried Ubuntu 18.04 and 20.04, with julia v1.4.2 and v1.5.0. The Nvidia BlackScholes sample from CUDA on WSL docs compiles and runs fine.

Any idea or glipmse of hope?

The known-limitations page says

PTX JIT is not supported (so PTX code will not be loaded from CUDA binaries for runtime compilation)

So no CUDA.jl fun on WSL for the foreseeable future? Or did you manage to overcome this somehow?

no…