Cuda makes Julia freeze

I want to use julia on GPU.
So I installed CUDA by add CUDA and wrote the following code.

julia> using CUDA
julia> y = CUDA.fill(2.0, 3)

I have been waiting more than an hour but still no response.
It seems to freeze. If I interpreted with ctrl + c, I got the following errors:

Force throwing a SIGINT
ERROR: ^CInterruptException:
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/libcuda.jl:84 [inlined]
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:95 [inlined]
  [3] cuDevicePrimaryCtxRetain
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] CuContext
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/context.jl:57 [inlined]
  [5] context(dev::CuDevice)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:222
  [6] CUDA.TaskLocalState(dev::CuDevice) (repeats 2 times)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:50
  [7] task_local_state!()
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:73
  [8] active_state
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:106 [inlined]
  [9] #_alloc#170
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:309 [inlined]
 [10] #alloc#169
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:299 [inlined]
 [11] alloc
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:293 [inlined]
 [12] CuArray
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:42 [inlined]
 [13] CuArray
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:125 [inlined]
 [14] CuArray
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:136 [inlined]
 [15] CuArray
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:138 [inlined]
 [16] fill(v::Float64, dims::Int64)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:602
 [17] top-level scope
    @ REPL[12]:1
 [18] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

I provide information about my environment.

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 40 virtual cores
Environment:
  LD_LIBRARY_PATH = :/usr/local/lib64:/usr/local/lib64

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 450.66.0, for CUDA 11.0
CUDA driver 11.7

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+450.66
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.3
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: Tesla V100-DGXS-32GB (sm_70, 31.739 GiB / 31.743 GiB available)
  1: Tesla V100-DGXS-32GB (sm_70, 31.742 GiB / 31.746 GiB available)
  2: Tesla V100-DGXS-32GB (sm_70, 31.742 GiB / 31.746 GiB available)
  3: Tesla V100-DGXS-32GB (sm_70, 31.742 GiB / 31.746 GiB available)
julia> using CUDA

julia> has_cuda()
true

I also confirmed there was no waiting job on GPU.
I need your help. What should I do?

EDIT

I noticed the versions of cuda and cudnn were not up to data.
I am not admin so i need to request to update them…

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
$ cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 5
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

Does this happen on all calls? Can you execute any others? e.g:

y=CUDA.zeros(8,8)

Yes.
CUDA.zeros also makes freeze.

cuDevicePrimaryCtxRetain hanging is unlikely to be an issue with CUDA.jl, but is indicative of a deeper issue. Can you check dmesg? Try using different devices by starting Julia with the CUDA_VISIBLE_DEVICES env var set to a device number, or by using CUDA.device! at the start of a session. It may be that certain devices are specially configured, e.g., to only allow a single compute process.

1 Like

Thank you for your comments.

Trying on CUDA_VISIBLE_DEVICES

Try using different devices by starting Julia with the CUDA_VISIBLE_DEVICES env var set to a device number

$export CUDA_VISIBLE_DEVICES=0
$julia

julia > using CUDA
julia > y = CUDA.zeros(8,8)
(freezing...)

$export CUDA_VISIBLE_DEVICES=3
$julia

julia > using CUDA
julia > y = CUDA.zeros(8,8)
(freezing...)

Trying on CUDA.device!

using CUDA.device! at the start of a session

julia > using CUDA
julia > CUDA.device!(1)
CUDA.device!(1)
ERROR: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
 [2] macro expansion
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
 [3] cuDeviceGet
   @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
 [4] CuDevice
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/devices.jl:17 [inlined]
 [5] device! (repeats 2 times)
   @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/state.jl:304 [inlined]
 [6] top-level scope
   @ REPL[6]:1
 [7] top-level scope
   @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

julia > CUDA.device!(0)
(freezing...)

Checking dmesg.

I used the dmesg command. Many responses appeared as follows, but I am a beginner at management GPUs, so I don’t know what is going on.

$dmesg
[2524465.174076] NVRM: client does not support versioning!!
[2524741.143834] NVRM: Xid (PCI:0000:07:00): 31, pid=27114, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2524741.163757] NVRM: Xid (PCI:0000:07:00): 31, pid=27114, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2524741.183532] NVRM: Xid (PCI:0000:07:00): 31, pid=27114, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2524741.198413] NVRM: Xid (PCI:0000:07:00): 31, pid=27114, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2524741.288950] NVRM: Xid (PCI:0000:07:00): 31, pid=27114, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2524741.309166] NVRM: Xid (PCI:0000:07:00): 31, pid=27114, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2526256.784500] NVRM: client does not support versioning!!
[2527544.195629] NVRM: client does not support versioning!!
[2527544.439061] NVRM: Xid (PCI:0000:07:00): 31, pid=33778, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2527544.459102] NVRM: Xid (PCI:0000:07:00): 31, pid=33778, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2527544.480957] NVRM: Xid (PCI:0000:07:00): 31, pid=33778, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2527544.493946] NVRM: Xid (PCI:0000:07:00): 31, pid=33778, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2527544.515491] NVRM: Xid (PCI:0000:07:00): 31, pid=33778, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2527544.590972] NVRM: Xid (PCI:0000:07:00): 31, pid=33778, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528627.365021] NVRM: client does not support versioning!!
[2528627.631303] NVRM: Xid (PCI:0000:07:00): 31, pid=36529, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528627.654074] NVRM: Xid (PCI:0000:07:00): 31, pid=36529, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528627.672220] NVRM: Xid (PCI:0000:07:00): 31, pid=36529, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528627.695145] NVRM: Xid (PCI:0000:07:00): 31, pid=36529, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528627.713821] NVRM: Xid (PCI:0000:07:00): 31, pid=36529, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528627.731601] NVRM: Xid (PCI:0000:07:00): 31, pid=36529, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528874.212675] NVRM: client does not support versioning!!
[2528874.483560] NVRM: Xid (PCI:0000:07:00): 31, pid=37062, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528874.505255] NVRM: Xid (PCI:0000:07:00): 31, pid=37062, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528874.626627] NVRM: Xid (PCI:0000:07:00): 31, pid=37062, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528874.737871] NVRM: Xid (PCI:0000:07:00): 31, pid=37062, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528874.754681] NVRM: Xid (PCI:0000:07:00): 31, pid=37062, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2528874.776190] NVRM: Xid (PCI:0000:07:00): 31, pid=37062, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539039.293603] NVRM: client does not support versioning!!
[2539039.472790] NVRM: Xid (PCI:0000:07:00): 31, pid=987, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539039.494545] NVRM: Xid (PCI:0000:07:00): 31, pid=987, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539039.514732] NVRM: Xid (PCI:0000:07:00): 31, pid=987, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539039.536338] NVRM: Xid (PCI:0000:07:00): 31, pid=987, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539039.556498] NVRM: Xid (PCI:0000:07:00): 31, pid=987, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539039.572503] NVRM: Xid (PCI:0000:07:00): 31, pid=987, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539265.428411] NVRM: client does not support versioning!!
[2539265.605529] NVRM: GPU at PCI:0000:08:00: GPU-3bb1bb37-ce5c-dd11-9db2-9fa5e1f65dcd
[2539265.605540] NVRM: GPU Board Serial Number: 0323918135338
[2539265.605543] NVRM: Xid (PCI:0000:08:00): 31, pid=1041, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539265.711866] NVRM: Xid (PCI:0000:08:00): 31, pid=1041, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539265.734734] NVRM: Xid (PCI:0000:08:00): 31, pid=1041, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539265.800018] NVRM: Xid (PCI:0000:08:00): 31, pid=1041, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539265.821588] NVRM: Xid (PCI:0000:08:00): 31, pid=1041, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539265.842353] NVRM: Xid (PCI:0000:08:00): 31, pid=1041, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539340.689738] NVRM: client does not support versioning!!
[2539367.320491] NVRM: Xid (PCI:0000:08:00): 31, pid=1096, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x2_0063c000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539367.338967] NVRM: Xid (PCI:0000:08:00): 31, pid=1096, Ch 00000018, intr 00000000. MMU Fault: ENGINE HOST7 HUBCLIENT_HOST faulted @ 0x2_00618000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539367.358021] NVRM: Xid (PCI:0000:08:00): 31, pid=1096, Ch 00000010, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x2_00600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539367.377812] NVRM: Xid (PCI:0000:08:00): 31, pid=1096, Ch 0000001c, intr 00000000. MMU Fault: ENGINE HOST8 HUBCLIENT_HOST faulted @ 0x2_00624000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539367.399685] NVRM: Xid (PCI:0000:08:00): 31, pid=1096, Ch 00000020, intr 00000000. MMU Fault: ENGINE HOST9 HUBCLIENT_HOST faulted @ 0x2_00630000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[2539367.415499] NVRM: Xid (PCI:0000:08:00): 31, pid=1096, Ch 00000028, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x2_00648000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

I’m not familiar with that specific error message, but generally any NVRM exceptions in your dmesg log is bad. So I’d guess that your set-up is broken.

Do you mean the setting up of CUDA.jl and Julia is broken? or do you think the setting up of the GPU driver is broken?

The NVIDIA driver. Try rebooting, and if that doesn’t help, maybe testing some CUDA C applications to see if they work.

I have reinstalled the NVIDIA driver, and the problem has been solved.
Thank you.