How to make CUDA.jl work in WSL2

I am using the WSL2 (Ubuntu) with version 4.19.121-microsoft-standard, and have installed the CUDA driver provided here: NVIDIA Drivers for CUDA on WSL.
I also have installed nvidia-cuda-toolkit.
I’m curious to know how I can make CUDA.jl work under this circumstance.

When I test CUDA, I got the error message:

Error: Could not initialize CUDA
│   exception =
│    CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

Thank you in advance.

3 Likes

I take it you have a GPU? Can you use nvidia-smi or such in WSL2? This is basically a CUDA set-up issue, so try and make sure CUDA itself works first (e.g. by compiling and executing a CUDA C binary).

Are you on Windows Insiders?

Also note that WSL needs a different libcuda despite being linux, since it uses the passthrough drivers. Not sure how that interacts with CUDA.jl.

Thank you all for your replies.
Yes, I’m in the fast ring of the windows insiders program. Running nvidia-smi in wsl2-ubuntu gives

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.41       Driver Version: 455.41       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060   WDDM  | 00000000:02:00.0 Off |                  N/A |
| N/A   32C    P8     3W /  N/A |     90MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

This is indeed a cuda set-up issue:
I could compile and execute a simple cu code, but the CUDA kernel function somehow did nothing…

EDIT:
I now could compile and execute cu codes successfully by, e.g.,
nvcc -gencode arch=compute_60,code=sm_60 add.cu -o add_cuda
The key is to have the -gencode option. This shows that CUDA works at my wsl2 ubuntu.

1 Like

The issue was solved by adding

export LD_LIBRARY_PATH="/usr/lib/wsl/lib/:$LD_LIBRARY_PATH"

to the .zshrc file.

4 Likes

I too am having dificulties with running CUDA.jl under WSL 2.

This has solved the problem of not detecting the core libraries I guess. Since then I get a true response from CUDA.functional().

However any basic operation like the one below results in:

julia> a = CuArray{Float32}(undef, 10)
10-element CuArray{Float32,1}:

julia> b = CuArray{Float32}(undef, 10)
10-element CuArray{Float32,1}:

julia> a + b
ERROR: CUDA error: PTX JIT compiler library not found (code 221, ERROR_JIT_COMPILER_NOT_FOUND)
Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /home/coz/.julia/packages/CUDA/7vLVC/lib/cudadrv/error.jl:103
 [2] CuModule(::String, ::Dict{CUDA.CUjit_option_enum,Any}) at /home/coz/.julia/packages/CUDA/7vLVC/lib/cudadrv/module.jl:42
 [3] _cufunction(::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#20",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:337
 [4] _cufunction at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:304 [inlined]
 [5] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#20",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:24
 [6] broadcast_kernel at /home/coz/.julia/packages/GPUArrays/PkHCM/src/host/broadcast.jl:60 [inlined]
 [7] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#20",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:0
 [8] cached_compilation at /home/coz/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:44 [inlined]
 [9] cufunction(::GPUArrays.var"#broadcast_kernel#20", ::Type{Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,1,CUDA.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}},Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDA.AS.Global},Tuple{Bool},Tuple{Int64}}}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:298
 [10] cufunction at /home/coz/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:293 [inlined]
 [11] #launch_heuristic#816 at /home/coz/.julia/packages/CUDA/7vLVC/src/gpuarrays.jl:19 [inlined]
 [12] launch_heuristic at /home/coz/.julia/packages/CUDA/7vLVC/src/gpuarrays.jl:17 [inlined]
 [13] copyto! at /home/coz/.julia/packages/GPUArrays/PkHCM/src/host/broadcast.jl:66 [inlined]
 [14] copyto!(::CuArray{Float32,1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(+),Tuple{CuArray{Float32,1},CuArray{Float32,1}}}) at ./broadcast.jl:864
 [15] copy(::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(+),Tuple{CuArray{Float32,1},CuArray{Float32,1}}}) at ./broadcast.jl:840
 [16] materialize at ./broadcast.jl:820 [inlined]
 [17] broadcast_preserving_zero_d at ./broadcast.jl:809 [inlined]
 [18] +(::CuArray{Float32,1}, ::CuArray{Float32,1}) at ./arraymath.jl:39
 [19] top-level scope at REPL[7]:1

] test CUDA also fails with a gazillion of outputs that overflow my terminal.

┌ Info: System information:
│ CUDA toolkit 11.0.194, artifact installation
│ CUDA driver 11.1.0
│
│ Libraries:
│ - CUBLAS: 11.1.0
│ - CURAND: 10.2.1
│ - CUFFT: 10.2.0
│ - CUSOLVER: 10.5.0
│ - CUSPARSE: 11.1.0
│ - CUPTI: 13.0.0
│ - NVML: missing
│ - CUDNN: 8.0.1 (for CUDA 11.0.0)
│ - CUTENSOR: 1.2.0 (for CUDA 11.0.0)
│
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device(s):
└ - GeForce GTX 1050 Ti (sm_61, 3.297 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce GTX 1050 Ti (UUID a2d155cc-31af-3125-b9b2-bb5b43712f8d)
[ Info: Skipping the following tests: nvml, cutensor, device/wmma
                                     |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                        (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                   (2) |     3.27 |   0.00 |  0.0 |       0.00 |      N/A |   0.06 |  1.9 |     219.96 |   834.47 |
apiutils                         (2) |     0.27 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       5.74 |   834.88 |
array                            (2) |         failed at 2020-08-03T16:10:57.567
.
.
.

I have tried Ubuntu 18.04 and 20.04, with julia v1.4.2 and v1.5.0. The Nvidia BlackScholes sample from CUDA on WSL docs compiles and runs fine.

Any idea or glipmse of hope?

The known-limitations page says

PTX JIT is not supported (so PTX code will not be loaded from CUDA binaries for runtime compilation)

So no CUDA.jl fun on WSL for the foreseeable future? Or did you manage to overcome this somehow?

no…

Hi guys - there may be an avenue that might be worth investigating. I just tested running the official Julia Docker image with gpu:
sudo docker run --gpus all -it --rm julia
having followed (approximately) the guide from here:
CUDA on WSL docs
The installation of the nvidia-docker2 being the key part. I did it in Arch Linux on WSL2 (AUR package nvidia-docker is the equivalent)
first run - did little test:

julia> x_d = CUDA.fill(1.0f0, 2^10);
Downloading artifact: CUDA111
Downloading artifact: CUDNN_CUDA111
Downloading artifact: CUTENSOR_CUDA111

julia> y_d = CUDA.fill(2.0f0, 2^10);
julia> y_d .+= x_d;
julia> @test all(Array(y_d) .== 3.0f0)
Test Passed

the CUDA111 artifact download took forever for some reason, but it worked. I’m a total newb to Docker (and julia too!) but I just thought I’d throw it out here. And to be clear I’m running the docker instance all within linux on WSL.

snippet of test CUDA output

gpuarrays/mapreduce derivatives: Error During Test at none:1
  Test threw exception
  Expression: gpuarrays/mapreduce derivatives
  On worker 58:
  NVMLError: Unknown Error (code 999)
  throw_api_error at /root/.julia/packages/CUDA/wTQsK/lib/nvml/error.jl:22
  compute_processes at /root/.julia/packages/CUDA/wTQsK/lib/nvml/device.jl:124
  runtests at /root/.julia/packages/CUDA/wTQsK/test/setup.jl:70
  #106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294
  run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
  macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294 [inlined]
  #105 at ./task.jl:356


Test Summary:                         | Pass  Error  Total
  Overall                             |   70     58    128
    initialization                    |           1      1
    apiutils                          |           1      1
    array                             |           1      1
    broadcast                         |           1      1
    codegen                           |    8      2     10
    cublas                            |           1      1
    cudnn                             |           1      1
    cufft                             |           1      1
    curand                            |           1      1
    cusolver                          |           1      1
    cusparse                          |           1      1
    examples                          |           1      1
    exceptions                        |           1      1
    execution                         |   56      1     57
    forwarddiff                       |           1      1
    iterator                          |           1      1
    nnlib                             |           1      1
    nvml                              |    6      1      7
    nvtx                              |           1      1

CUDA JIT is supported under WSL2 nowadays so you shouldn’t require a container.

Thanks - good to know - though I had errors just trying within WSL itself (preferred path), I guess I’ll look again…

I tried again but failed to use CUDA on WSL/Arch (which is not really a supported WSL distro by Windows or Arch even)

So I installed Ubuntu and everything works so far!