CUDA test failure

Hi,
I’m new to Julia and GPU computing. I installed CUDA and ran ]test CUDA to check it is working. When I do this I get the following errors:

 Info: Testing using device 0 (NVIDIA GH200 120GB). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.
[ Info: Running 47 tests in parallel. If this is too many, specify the `--jobs` argument to the tests, or set the `JULIA_CPU_THREADS` environment variable.
┌ Warning: Running tests on a GPU in exclusive mode; reducing parallelism to 1.
└ @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/runtests.jl:181
                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
core/initialization                           (2) |     3.57 |   0.00 |  0.0 |       0.00 |   558.00 |   0.01 |  0.2 |      61.43 |  1471.06 |
gpuarrays/reductions/sum prod                 (3) |   107.85 |   0.03 |  0.0 |       3.24 |   630.00 |   3.34 |  3.1 |   11213.62 |  3911.06 |
gpuarrays/reductions/reduce                   (3) |    63.57 |   0.02 |  0.0 |       1.53 |   634.00 |   1.69 |  2.7 |    9181.83 |  4991.06 |
gpuarrays/reductions/mapreducedim!            (3) |    41.90 |   0.01 |  0.0 |       1.54 |   636.00 |   0.79 |  1.9 |    4307.07 |  5675.06 |
gpuarrays/broadcasting                        (3) |   102.40 |   0.02 |  0.0 |       2.00 |   642.00 |   1.69 |  1.7 |   10034.98 |  8051.06 |
gpuarrays/reductions/== isequal               (3) |    36.79 |   0.01 |  0.0 |       1.07 |   646.00 |   0.94 |  2.5 |    5580.35 |  8663.06 |
gpuarrays/base                                (3) |    16.83 |   0.00 |  0.0 |       8.90 |   646.00 |   0.60 |  3.6 |    2604.37 |  9059.06 |
gpuarrays/random                              (3) |     9.21 |   0.02 |  0.2 |     392.05 |   762.00 |   0.15 |  1.6 |    1508.81 |  9491.06 |
gpuarrays/vectors                             (3) |     0.20 |   0.00 |  0.2 |       0.00 |   648.00 |   0.00 |  0.0 |      18.07 |  9491.06 |
gpuarrays/ext/jld2                            (3) |     5.40 |   0.00 |  0.0 |       0.00 |   648.00 |   0.04 |  0.7 |     325.16 |  9599.06 |
gpuarrays/constructors                        (3) |    14.41 |   0.01 |  0.0 |       0.65 |   648.00 |   0.19 |  1.3 |    1166.54 |  9707.06 |
gpuarrays/reductions/mapreduce                (3) |    19.07 |   0.01 |  0.1 |       1.83 |   652.00 |   0.32 |  1.7 |    2205.61 |  9923.06 |
gpuarrays/statistics                          (3) |    37.26 |   0.01 |  0.0 |       1.51 |   718.00 |   0.64 |  1.7 |    3696.56 | 11039.06 |
gpuarrays/linalg/norm                         (3) |    82.14 |   0.02 |  0.0 |       0.02 |   722.00 |   1.14 |  1.4 |    7597.94 | 14243.06 |
gpuarrays/linalg/NaN_false                    (3) |     9.76 |   0.00 |  0.0 |       0.00 |   724.00 |   0.09 |  0.9 |     800.22 | 14711.06 |
gpuarrays/math/intrinsics                     (3) |     1.12 |   0.00 |  0.0 |       0.00 |   724.00 |   0.00 |  0.0 |      91.09 | 14711.06 |
gpuarrays/linalg/mul!/matrix-matrix           (3) |    55.27 |   0.02 |  0.0 |       0.13 |   726.00 |   0.94 |  1.7 |    5627.97 | 15431.06 |
gpuarrays/sparse                              (3) |     0.00 |   0.00 |  0.0 |       0.00 |   726.00 |   0.00 |  0.0 |       0.15 | 15431.06 |
gpuarrays/reductions/mapreducedim!_large      (3) |     5.94 |   0.02 |  0.3 |     818.38 |   766.00 |   0.10 |  1.7 |    1985.02 | 16264.88 |
      From worker 3:	JIT session error: Cannot allocate memory
      From worker 3:	JIT session error: Cannot allocate memory
      From worker 3:	
      From worker 3:	[2030669] signal (11.1): Segmentation fault
      From worker 3:	in expression starting at none:1
gpuarrays/uniformscaling                      (3) |         failed at 2025-12-05T10:18:27.336
Worker 3 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#741")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:947
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:955
 [3] unsafe_read
   @ ./io.jl:773 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:772
 [5] read!
   @ ./io.jl:774 [inlined]
 [6] deserialize_hdr_raw
   @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121

Here’s the system info:

Info: System information:
│ CUDA toolchain: 
│ - runtime 12.6, local installation
│ - driver 565.57.1 for 13.0
│ - compiler 12.9
│ 
│ CUDA libraries: 
│ - CUBLAS: 12.6.3
│ - CURAND: 10.3.7
│ - CUFFT: 11.3.0
│ - CUSOLVER: 11.7.1
│ - CUSPARSE: 12.5.4
│ - CUPTI: 2024.3.2 (API 12.6.0)
│ - NVML: 12.0.0+565.57.1
│ 
│ Julia packages: 
│ - CUDA: 5.9.5
│ - CUDA_Driver_jll: 13.0.2+0
│ - CUDA_Compiler_jll: 0.3.0+0
│ - CUDA_Runtime_jll: 0.19.2+0
│ - CUDA_Runtime_Discovery: 1.0.0
│ 
│ Toolchain:
│ - Julia: 1.10.10
│ - LLVM: 15.0.7
│ 
│ Environment:
│ - JULIA_CUDA_USE_BINARY_BUILDER: false
│ - JULIA_CUDA_MEMORY_POOL: none
│ 
│ Preferences:
│ - CUDA_Runtime_jll.version: 12.6
│ - CUDA_Runtime_jll.local: true

I tried to install CUDA with a local toolkit and without, but all gives the same error. Can anyone point me in the right direction of what goes wrong here, and how I best address this? Thanks in advance!

What is your Julia versioninfo(verbose=true)?

julia> versioninfo(verbose=true)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
      "SUSE Linux Enterprise Server 15 SP6"
  uname: Linux 6.4.0-150600.23.25_15.0.9-cray_shasta_c_64k #1 SMP Mon Jan 13 18:26:04 UTC 2025 (7f98b6b) aarch64 aarch64
  CPU: unknown: 
                  speed         user         nice          sys         idle          irq
       #1-288  3960 MHz    7409551 s        424 s    2031153 s  1918107590 s          0 s
  Memory: 858.0562744140625 GB (784242.6875 MB free)
  Uptime: 669691.03 sec
  Load Avg:  1.04  1.04  1.83
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, generic)
Threads: 1 default, 0 interactive, 1 GC (on 288 virtual cores)
Environment:
  LD_LIBRARY_PATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/lib:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/math_libs/12.6/lib64:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/CUPTI/lib64:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/Debugger/lib64:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/nvvm/lib64:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/lib64:/opt/cray/pe/papi/7.2.0.1/lib64:/opt/cray/libfabric/1.22.0/lib64
  JULIA_CUDA_MEMORY_POOL = none
  JULIA_DEPOT_PATH = /cluster/projects/nn9874k/aklocker/juliaup/depot
  JULIA_LOAD_PATH = :/cluster/projects/nn9874k/aklocker/juhpc_setup/julia_preferences
  JULIA_CUDA_USE_BINARY_BUILDER = false
  __LMOD_REF_COUNT_INCLUDE_PATH_AARCH64 = /opt/cray/pe/cce/19.0.0/cce-clang/aarch64/lib/clang/19/include:1;/opt/cray/pe/cce/19.0.0/cce/aarch64/include/craylibs:1
  __LMOD_REF_COUNT_PE_CRAYCLANG_FIXED_PKGCONFIG_PATH = /opt/cray/pe/mpich/8.1.32/ofi/crayclang/17.0/lib/pkgconfig:1
  __LMOD_REF_COUNT_PATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/bin:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/libnvvp:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/profilers/Nsight_Compute:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/profilers/Nsight_Systems/bin:1;/opt/cray/pe/mpich/8.1.32/ofi/crayclang/17.0/bin:1;/opt/cray/pe/mpich/8.1.32/bin:1;/opt/cray/pe/craype/2.7.34/bin:1;/opt/cray/pe/cce/19.0.0/binutils/aarch64/aarch64-unknown-linux-gnu/bin:1;/opt/cray/pe/cce/19.0.0/utils/aarch64/bin:1;/opt/cray/pe/cce/19.0.0/bin:1;/opt/cray/pe/perftools/25.03.0/bin:1;/opt/cray/pe/papi/7.2.0.1/bin:1;/opt/cray/libfabric/1.22.0/bin:1;/cluster/projects/nn9874k/aklocker/juhpc_setup/juliaup_wrapper:1;/cluster/projects/nn9874k/aklocker/juliaup/bin:1;/cluster/home/aklocker/.juliaup/bin:1;/opt/clmgr/sbin:1;/opt/clmgr/bin:1;/opt/sgi/sbin:1;/opt/sgi/bin:1;/usr/local/bin:1;/usr/bin:1;/bin:1;/opt/c3/bin:1;/usr/lib/mit/bin:1;/cluster/bin:1;/opt/cray/pe/bin:1
  CRAY_LD_LIBRARY_PATH = /opt/cray/pe/libsci/25.03.0/CRAYCLANG/17.0/aarch64/lib:/opt/cray/pe/mpich/8.1.32/ofi/crayclang/17.0/lib:/opt/cray/pe/mpich/8.1.32/gtl/lib:/opt/cray/pe/dsmml/0.3.1/dsmml/lib:/opt/cray/pe/cce/19.0.0/cce-clang/aarch64/lib:/opt/cray/pe/cce/19.0.0/cce/aarch64/lib:/opt/cray/pe/perftools/25.03.0/lib64
  CRAYPAT_LD_LIBRARY_PATH = /opt/cray/pe/perftools/25.03.0/lib64
  FPATH = /opt/cray/pe/lmod/lmod/init/ksh_funcs
  __LMOD_REF_COUNT_NLSPATH = /opt/cray/pe/cce/19.0.0/cce/aarch64/share/nls/En/%N.cat:1
  JAVA_HOME = /usr/lib64/jvm/java-11-openjdk-11
  __LMOD_REF_COUNT_LD_LIBRARY_PATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/lib:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/math_libs/12.6/lib64:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/CUPTI/lib64:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/Debugger/lib64:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/nvvm/lib64:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/lib64:1;/opt/cray/pe/papi/7.2.0.1/lib64:1;/opt/cray/libfabric/1.22.0/lib64:1
  __LMOD_REF_COUNT_PKG_CONFIG_PATH = /usr/lib64/pkgconfig:1;/opt/cray/pe/dsmml/0.3.1/dsmml/lib/pkgconfig:1;/opt/cray/pe/craype/2.7.34/pkg-config:1;/opt/cray/libfabric/1.22.0/lib64/pkgconfig:1
  JUHPC_HDF5_HOME = 
  __LMOD_REF_COUNT_MODULEPATH = /opt/cray/pe/lmod/modulefiles/mpi/crayclang/17.0/ofi/1.0/cray-mpich/8.0:1;/opt/cray/pe/lmod/modulefiles/comnet/crayclang/17.0/ofi/1.0:1;/opt/cray/pe/lmod/modulefiles/compiler/crayclang/17.0:1;/opt/cray/pe/lmod/modulefiles/mix_compilers:1;/opt/cray/pe/lmod/modulefiles/perftools/25.03.0:1;/opt/cray/pe/lmod/modulefiles/net/ofi/1.0:1;/opt/cray/pe/lmod/modulefiles/cpu/arm-grace/1.0:1;/opt/cray/pe/modulefiles/Linux:1;/opt/cray/pe/lmod/modulefiles/craype-targets/default:1;/opt/cray/pe/lmod/modulefiles/core:1;/opt/cray/pe/lmod/lmod/modulefiles/Core:1;/opt/cray/pe/modulefiles/Core:1;/opt/cray/modulefiles:1;/cluster/software/modules/Core:1
  __LMOD_REF_COUNT_CRAY_LD_LIBRARY_PATH = /opt/cray/pe/libsci/25.03.0/CRAYCLANG/17.0/aarch64/lib:1;/opt/cray/pe/mpich/8.1.32/ofi/crayclang/17.0/lib:1;/opt/cray/pe/mpich/8.1.32/gtl/lib:1;/opt/cray/pe/dsmml/0.3.1/dsmml/lib:1;/opt/cray/pe/cce/19.0.0/cce-clang/aarch64/lib:1;/opt/cray/pe/cce/19.0.0/cce/aarch64/lib:1;/opt/cray/pe/perftools/25.03.0/lib64:1
  LLVM_SYMBOLIZER_PATH = /opt/cray/pe/cce/19.0.0/cce-clang/aarch64/bin/llvm-symbolizer
  HOME = /cluster/home/aklocker
  CUDA_HOME = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6
  XNLSPATH = /usr/X11R6/lib/X11/nls
  CPATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/include:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nccl/include:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/include:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/nvvm/include:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/Debugger/include:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/CUPTI/include:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/math_libs/12.6/include
  COMPILERRT_PATH_AARCH64 = /opt/cray/pe/cce/19.0.0/cce-clang/aarch64/lib/clang/19/lib/linux
  SDK_HOME = /usr/lib64/jvm/java-11-openjdk-11
  NVHPC_CUDA_HOME = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6
  NLSPATH = /opt/cray/pe/cce/19.0.0/cce/aarch64/share/nls/En/%N.cat
  PE_LIBSCI_VOLATILE_PKGCONFIG_PATH = /opt/cray/pe/libsci/25.03.0/@PRGENV@/@PE_LIBSCI_GENCOMPS@/@PE_LIBSCI_TARGET@/lib/pkgconfig
  JDK_HOME = /usr/lib64/jvm/java-11-openjdk-11
  INCLUDE_PATH_AARCH64 = /opt/cray/pe/cce/19.0.0/cce-clang/aarch64/lib/clang/19/include:/opt/cray/pe/cce/19.0.0/cce/aarch64/include/craylibs
  __LMOD_REF_COUNT_MANPATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/doc/man:1;/opt/cray/pe/libsci/25.03.0/share/man:1;/opt/cray/pe/mpich/8.1.32/ofi/man:1;/opt/cray/pe/mpich/8.1.32/man/mpich:1;/opt/cray/pe/dsmml/0.3.1/dsmml/man:1;/opt/cray/pe/craype/2.7.34/man:1;/opt/cray/pe/cce/19.0.0/cce-clang/aarch64/share/man:1;/opt/cray/pe/cce/19.0.0/man:1;/opt/cray/pe/perftools/25.03.0/man:1;/opt/cray/pe/papi/7.2.0.1/share/pdoc/man:1;/opt/cray/libfabric/1.22.0/share/man:1;/opt/cray/pe/lmod/lmod/share/man:1;/usr/local/man:1;/usr/share/man:1;/usr/man:1;/opt/c3/man:1;/opt/clmgr/man:1;/opt/sgi/share/man:1;/opt/clmgr/share/man:1;/opt/clmgr/lib/cm-cli/man:1
  PE_CRAYCLANG_FIXED_PKGCONFIG_PATH = /opt/cray/pe/mpich/8.1.32/ofi/crayclang/17.0/lib/pkgconfig
  JULIAUP_DEPOT_PATH = /cluster/projects/nn9874k/aklocker/juliaup/depot
  TERM = xterm-256color
  __LMOD_REF_COUNT_CPATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/include:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/comm_libs/12.6/nccl/include:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/include:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/nvvm/include:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/Debugger/include:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/extras/CUPTI/include:1;/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/math_libs/12.6/include:1
  CUDATOOLKIT_HOME = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6
  JUHPC_CUDA_HOME = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6
  MANPATH = /opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/doc/man:/opt/cray/pe/libsci/25.03.0/share/man:/opt/cray/pe/mpich/8.1.32/ofi/man:/opt/cray/pe/mpich/8.1.32/man/mpich:/opt/cray/pe/dsmml/0.3.1/dsmml/man:/opt/cray/pe/craype/2.7.34/man:/opt/cray/pe/cce/19.0.0/cce-clang/aarch64/share/man:/opt/cray/pe/cce/19.0.0/man:/opt/cray/pe/perftools/25.03.0/man:/opt/cray/pe/papi/7.2.0.1/share/pdoc/man:/opt/cray/libfabric/1.22.0/share/man:/opt/cray/pe/lmod/lmod/share/man:/usr/local/man:/usr/share/man:/usr/man:/opt/c3/man:/opt/clmgr/man:/opt/sgi/share/man:/opt/clmgr/share/man:/opt/clmgr/lib/cm-cli/man
  OSCAR_HOME = /opt/oscar
  MODULEPATH = /opt/cray/pe/lmod/modulefiles/mpi/crayclang/17.0/ofi/1.0/cray-mpich/8.0:/opt/cray/pe/lmod/modulefiles/comnet/crayclang/17.0/ofi/1.0:/opt/cray/pe/lmod/modulefiles/compiler/crayclang/17.0:/opt/cray/pe/lmod/modulefiles/mix_compilers:/opt/cray/pe/lmod/modulefiles/perftools/25.03.0:/opt/cray/pe/lmod/modulefiles/net/ofi/1.0:/opt/cray/pe/lmod/modulefiles/cpu/arm-grace/1.0:/opt/cray/pe/modulefiles/Linux:/opt/cray/pe/lmod/modulefiles/craype-targets/default:/opt/cray/pe/lmod/modulefiles/core:/opt/cray/pe/lmod/lmod/modulefiles/Core:/opt/cray/pe/modulefiles/Core:/opt/cray/modulefiles:/cluster/software/modules/Core
  MODULEPATH_ROOT = /opt/cray/pe/modulefiles
  LMOD_PACKAGE_PATH = /cluster/software/config/lmod/SitePackage.lua
  JRE_HOME = /usr/lib64/jvm/java-11-openjdk-11
  PATH = /cluster/projects/nn9874k/aklocker/juhpc_setup/juliaup_wrapper:/cluster/projects/nn9874k/aklocker/juliaup/bin:/cluster/projects/nn9874k/aklocker/juhpc_setup/juliaup_wrapper:/cluster/projects/nn9874k/aklocker/juliaup/bin:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/bin:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/cuda/12.6/libnvvp:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/profilers/Nsight_Compute:/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/profilers/Nsight_Systems/bin:/opt/cray/pe/mpich/8.1.32/ofi/crayclang/17.0/bin:/opt/cray/pe/mpich/8.1.32/bin:/opt/cray/pe/craype/2.7.34/bin:/opt/cray/pe/cce/19.0.0/binutils/aarch64/aarch64-unknown-linux-gnu/bin:/opt/cray/pe/cce/19.0.0/utils/aarch64/bin:/opt/cray/pe/cce/19.0.0/bin:/opt/cray/pe/perftools/25.03.0/bin:/opt/cray/pe/papi/7.2.0.1/bin:/opt/cray/libfabric/1.22.0/bin:/cluster/projects/nn9874k/aklocker/juhpc_setup/juliaup_wrapper:/cluster/projects/nn9874k/aklocker/juliaup/bin:/cluster/home/aklocker/.juliaup/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/usr/lib/mit/bin:/cluster/bin:/opt/cray/pe/bin
  MODULESHOME = /opt/cray/pe/lmod/lmod
  PKG_CONFIG_PATH = /usr/lib64/pkgconfig:/opt/cray/pe/dsmml/0.3.1/dsmml/lib/pkgconfig:/opt/cray/pe/craype/2.7.34/pkg-config:/opt/cray/libfabric/1.22.0/lib64/pkgconfig

Thanks! That looks like ARM (Linux) · The Julia Language

So not an issue with CUDA.jl, but rather with Julia on that platform :confused:

We could restart the test worker when we detect this issue.

1 Like

That does indeed look like my problem! I guess I can ask our HPC people to increase this limit for memory mapping.
And since I’m new to Julia I’m not sure what restarting a test worker would do..
My ultimate goal is to get cuda-aware MPI working so I thought I somehow need to fix this first…

So increaseing the limit for memory mapping fixed the problem I had before, and now the test fails with

(@v1.10) pkg> test CUDA
     Testing CUDA
      Status `/tmp/jl_hViKlC/Project.toml`
  [621f4979] AbstractFFTs v1.5.0
  [79e6a3ab] Adapt v4.4.0
  [ab4f0b2a] BFloat16s v0.6.0
  [052768ef] CUDA v5.9.5
  [d360d2e6] ChainRulesCore v1.26.0
  [864edb3b] DataStructures v0.19.3
  [7a1cc6ca] FFTW v1.10.0
  [0c68f7d7] GPUArrays v11.3.1
  [61eb1bfa] GPUCompiler v1.7.5
⌃ [a98d9a8b] Interpolations v0.15.1
  [033835bb] JLD2 v0.6.3
  [63c18a36] KernelAbstractions v0.9.39
  [5da4648a] NVTX v1.0.1
  [a0a7dd2c] SparseMatricesCSR v0.6.9
  [276daf66] SpecialFunctions v2.6.1
  [90137ffa] StaticArrays v1.9.15
  [4ee394cb] CUDA_Driver_jll v13.0.2+0
  [76a88914] CUDA_Runtime_jll v0.19.2+0
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [8dfed614] Test
      Status `/tmp/jl_hViKlC/Manifest.toml`
  [621f4979] AbstractFFTs v1.5.0
  [79e6a3ab] Adapt v4.4.0
  [a9b6321e] Atomix v1.1.2
  [13072b0f] AxisAlgorithms v1.1.0
  [ab4f0b2a] BFloat16s v0.6.0
  [fa961155] CEnum v0.5.0
  [052768ef] CUDA v5.9.5
  [1af6417a] CUDA_Runtime_Discovery v1.0.0
  [d360d2e6] ChainRulesCore v1.26.0
  [0b6fb165] ChunkCodecCore v1.0.0
  [4c0bbee4] ChunkCodecLibZlib v1.0.0
  [55437552] ChunkCodecLibZstd v1.0.0
  [3da002f7] ColorTypes v0.12.1
  [5ae59095] Colors v0.13.1
  [34da2185] Compat v4.18.1
  [a8cc5b0e] Crayons v4.1.1
  [9a962f9c] DataAPI v1.16.0
  [a93c6f00] DataFrames v1.8.1
  [864edb3b] DataStructures v0.19.3
  [e2d170a0] DataValueInterfaces v1.0.0
  [ffbed154] DocStringExtensions v0.9.5
  [e2ba6199] ExprTools v0.1.10
  [7a1cc6ca] FFTW v1.10.0
  [5789e2e9] FileIO v1.17.1
  [53c48c17] FixedPointNumbers v0.8.5
  [0c68f7d7] GPUArrays v11.3.1
  [46192b85] GPUArraysCore v0.2.0
  [61eb1bfa] GPUCompiler v1.7.5
  [096a3bc2] GPUToolbox v1.0.0
  [076d061b] HashArrayMappedTries v0.2.0
  [842dd82b] InlineStrings v1.4.5
⌃ [a98d9a8b] Interpolations v0.15.1
  [41ab1584] InvertedIndices v1.3.1
  [92d709cd] IrrationalConstants v0.2.6
  [82899510] IteratorInterfaceExtensions v1.0.0
  [033835bb] JLD2 v0.6.3
  [692b3bcd] JLLWrappers v1.7.1
  [63c18a36] KernelAbstractions v0.9.39
  [929cbde3] LLVM v9.4.4
  [8b046642] LLVMLoopInfo v1.0.0
  [b964fa9f] LaTeXStrings v1.4.0
  [2ab3a3ac] LogExpFunctions v0.3.29
  [1914dd2f] MacroTools v0.5.16
  [e1d29d7a] Missings v1.2.0
  [5da4648a] NVTX v1.0.1
  [6fe1bfb0] OffsetArrays v1.17.0
  [bac558e1] OrderedCollections v1.8.1
  [2dfb63ee] PooledArrays v1.4.3
⌅ [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.5.0
  [08abe8d2] PrettyTables v3.1.2
  [74087812] Random123 v1.7.1
  [e6cf234a] RandomNumbers v1.6.0
  [c84ed2f1] Ratios v0.4.5
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.1
  [7e506255] ScopedValues v1.5.0
  [6c6a2e73] Scratch v1.3.0
  [91c51154] SentinelArrays v1.4.8
  [a2af1166] SortingAlgorithms v1.2.2
  [a0a7dd2c] SparseMatricesCSR v0.6.9
  [276daf66] SpecialFunctions v2.6.1
  [90137ffa] StaticArrays v1.9.15
  [1e83bf80] StaticArraysCore v1.4.4
  [892a3eda] StringManipulation v0.4.2
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.1
  [e689c965] Tracy v0.1.6
  [013be700] UnsafeAtomics v0.3.0
  [efce3f68] WoodburyMatrices v1.0.0
  [d1e2174e] CUDA_Compiler_jll v0.3.0+0
  [4ee394cb] CUDA_Driver_jll v13.0.2+0
  [76a88914] CUDA_Runtime_jll v0.19.2+0
  [f5851436] FFTW_jll v3.3.11+0
  [1d5cc7b8] IntelOpenMP_jll v2025.2.0+0
  [9c1d0b0a] JuliaNVTXCallbacks_jll v0.2.1+0
  [dad2f222] LLVMExtra_jll v0.0.38+0
  [ad6e5548] LibTracyClient_jll v0.9.1+6
  [856f044c] MKL_jll v2025.2.0+0
  [e98f9f5b] NVTX_jll v3.2.2+0
  [efe28fd5] OpenSpecFun_jll v0.5.6+0
  [3161d3a3] Zstd_jll v1.5.7+1
  [1e29f10c] demumble_jll v1.3.0+0
  [1317d2d5] oneTBB_jll v2022.0.0+1
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [4607b0f0] SuiteSparse
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.5+0
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.11.0+0
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
        Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading.
     Testing Running tests...
┌ Info: System information:
│ CUDA toolchain: 
│ - runtime 12.6, local installation
│ - driver 565.57.1 for 13.0
│ - compiler 12.9
│ 
│ CUDA libraries: 
│ - CUBLAS: 12.6.3
│ - CURAND: 10.3.7
│ - CUFFT: 11.3.0
│ - CUSOLVER: 11.7.1
│ - CUSPARSE: 12.5.4
│ - CUPTI: 2024.3.2 (API 12.6.0)
│ - NVML: 12.0.0+565.57.1
│ 
│ Julia packages: 
│ - CUDA: 5.9.5
│ - CUDA_Driver_jll: 13.0.2+0
│ - CUDA_Compiler_jll: 0.3.0+0
│ - CUDA_Runtime_jll: 0.19.2+0
│ - CUDA_Runtime_Discovery: 1.0.0
│ 
│ Toolchain:
│ - Julia: 1.10.10
│ - LLVM: 15.0.7
│ 
│ Environment:
│ - JULIA_CUDA_MEMORY_POOL: none
│ - JULIA_CUDA_USE_BINARYBUILDER: false
│ 
│ Preferences:
│ - CUDA_Runtime_jll.version: 12.6
│ - CUDA_Runtime_jll.local: true
│ 
│ 1 device:
└   0: NVIDIA GH200 120GB (sm_90, 94.997 GiB / 95.577 GiB available)
[ Info: Testing using device 0 (NVIDIA GH200 120GB). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.
[ Info: Running 47 tests in parallel. If this is too many, specify the `--jobs` argument to the tests, or set the `JULIA_CPU_THREADS` environment variable.
┌ Warning: Running tests on a GPU in exclusive mode; reducing parallelism to 1.
└ @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/runtests.jl:181
                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
core/initialization                           (2) |     3.56 |   0.00 |  0.0 |       0.00 |   558.00 |   0.01 |  0.2 |      61.43 |  1121.62 |
gpuarrays/reductions/sum prod                 (3) |   108.53 |   0.03 |  0.0 |       3.24 |   630.00 |   3.32 |  3.1 |   11212.63 |  3911.00 |
gpuarrays/reductions/reduce                   (3) |    63.61 |   0.02 |  0.0 |       1.53 |   634.00 |   1.69 |  2.7 |    9181.62 |  4919.00 |
gpuarrays/reductions/mapreducedim!            (3) |    42.20 |   0.01 |  0.0 |       1.54 |   636.00 |   0.80 |  1.9 |    4306.77 |  5675.00 |
gpuarrays/broadcasting                        (3) |   103.48 |   0.02 |  0.0 |       2.00 |   642.00 |   1.91 |  1.9 |   10033.50 |  8015.00 |
gpuarrays/reductions/== isequal               (3) |    37.30 |   0.01 |  0.0 |       1.07 |   646.00 |   0.96 |  2.6 |    5579.53 |  8627.00 |
gpuarrays/base                                (3) |    16.83 |   0.00 |  0.0 |       8.90 |   646.00 |   0.60 |  3.6 |    2604.10 |  9059.00 |
gpuarrays/random                              (3) |     9.34 |   0.02 |  0.2 |     392.05 |   766.00 |   0.14 |  1.5 |    1508.73 |  9599.00 |
gpuarrays/vectors                             (3) |     0.20 |   0.00 |  0.2 |       0.00 |   648.00 |   0.00 |  0.0 |      18.08 |  9599.00 |
gpuarrays/ext/jld2                            (3) |     5.54 |   0.00 |  0.0 |       0.00 |   648.00 |   0.04 |  0.7 |     325.20 |  9707.00 |
gpuarrays/constructors                        (3) |    14.37 |   0.01 |  0.0 |       0.65 |   648.00 |   0.18 |  1.3 |    1166.46 |  9851.00 |
gpuarrays/reductions/mapreduce                (3) |    19.01 |   0.01 |  0.1 |       1.83 |   652.00 |   0.33 |  1.7 |    2205.48 |  9995.00 |
gpuarrays/statistics                          (3) |    37.38 |   0.01 |  0.0 |       1.51 |   718.00 |   0.65 |  1.7 |    3696.44 | 11039.00 |
gpuarrays/linalg/norm                         (3) |    82.73 |   0.02 |  0.0 |       0.02 |   722.00 |   1.29 |  1.6 |    7597.49 | 14099.00 |
gpuarrays/linalg/NaN_false                    (3) |     9.72 |   0.00 |  0.0 |       0.00 |   724.00 |   0.05 |  0.5 |     800.11 | 14675.00 |
gpuarrays/math/intrinsics                     (3) |     1.14 |   0.00 |  0.0 |       0.00 |   724.00 |   0.00 |  0.0 |      91.07 | 14675.00 |
gpuarrays/linalg/mul!/matrix-matrix           (3) |    55.64 |   0.02 |  0.0 |       0.13 |   726.00 |   0.79 |  1.4 |    5628.03 | 15467.00 |
gpuarrays/sparse                              (3) |     0.00 |   0.00 |  0.0 |       0.00 |   726.00 |   0.00 |  0.0 |       0.15 | 15467.00 |
gpuarrays/reductions/mapreducedim!_large      (3) |     5.90 |   0.02 |  0.3 |     818.38 |   766.00 |   0.10 |  1.6 |    1984.85 | 16301.81 |
gpuarrays/uniformscaling                      (3) |     4.14 |   0.00 |  0.0 |       0.01 |   726.00 |   0.00 |  0.0 |     275.87 | 16301.81 |
gpuarrays/reductions/minimum maximum extrema  (3) |   106.89 |   0.02 |  0.0 |       2.19 |   732.00 |   1.89 |  1.8 |   10842.66 | 18428.56 |
gpuarrays/reductions/any all count            (3) |     5.19 |   0.00 |  0.0 |       0.00 |   734.00 |   0.06 |  1.1 |     571.55 | 18500.56 |
gpuarrays/indexing multidimensional           (3) |    29.06 |   0.00 |  0.0 |       2.07 |   822.00 |   0.45 |  1.6 |    2615.50 | 19040.56 |
gpuarrays/indexing find                       (3) |    13.49 |   0.00 |  0.0 |       0.13 |   822.00 |   0.36 |  2.6 |    1651.20 | 19364.56 |
gpuarrays/linalg/mul!/vector-matrix           (3) |    33.45 |   0.01 |  0.0 |       0.02 |   822.00 |   0.62 |  1.8 |    3597.05 | 19832.56 |
gpuarrays/math/power                          (3) |     8.32 |   0.00 |  0.0 |       0.01 |   822.00 |   0.22 |  2.7 |    1355.35 | 19868.56 |
gpuarrays/linalg/core                         (3) |   104.86 |   0.27 |  0.3 |    5409.05 |   970.00 |   2.46 |  2.3 |   34659.54 | 26517.25 |
gpuarrays/reductions/reducedim!               (3) |     0.48 |   0.00 |  0.5 |       1.03 |   832.00 |   0.00 |  0.0 |      21.95 | 26517.25 |
gpuarrays/indexing scalar                     (3) |     5.54 |   0.00 |  0.0 |       0.01 |   832.00 |   0.07 |  1.3 |     522.67 | 26517.25 |
gpuarrays/alloc cache                         (3) |     0.83 |   0.00 |  0.0 |       0.00 |   832.00 |   0.00 |  0.0 |     111.58 | 26517.25 |
libraries/cusparse                            (3) |    70.95 |   0.12 |  0.2 |      23.36 |   844.00 |   1.04 |  1.5 |    5825.04 | 26517.25 |
libraries/cusolver/dense                      (3) |   121.86 |   0.23 |  0.2 |     280.34 |  1220.00 |   2.04 |  1.7 |   11838.60 | 26517.25 |
base/array                                    (3) |    35.09 |   0.02 |  0.1 |    1316.20 |  2282.00 |   0.66 |  1.9 |    4712.71 | 27821.00 |
      From worker 3:	WARNING: Method definition var"#10662#kernel"(Any) in module Main at /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/core/execution.jl:358 overwritten at /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/core/execution.jl:366.
core/execution                                (3) |    19.61 |   0.00 |  0.0 |       0.02 |  1162.00 |   0.25 |  1.3 |    1839.64 | 27821.00 |
libraries/cublas/extensions                   (3) |    21.95 |   0.06 |  0.3 |      36.69 |  1166.00 |   0.39 |  1.8 |    2112.08 | 27821.00 |
core/cudadrv                                  (3) |         failed at 2025-12-09T09:29:03.454
libraries/cublas/level2                       (4) |    38.29 |   0.01 |  0.0 |       1.35 |   774.00 |   1.34 |  3.5 |    4123.64 |  2130.88 |
libraries/cublas/level3/gemm                  (4) |    56.03 |   0.03 |  0.0 |       8.95 |   784.00 |   1.58 |  2.8 |    6013.99 |  3030.88 |
libraries/cublas/level3                       (4) |    33.65 |   0.02 |  0.1 |       5.74 |   784.00 |   0.53 |  1.6 |    2495.24 |  3642.88 |
libraries/cusparse/generic                    (4) |    40.23 |   0.12 |  0.3 |      14.22 |   788.00 |   0.69 |  1.7 |    3458.64 |  4470.88 |
libraries/cublas/xt                           (4) |     7.39 |   0.00 |  0.0 |       0.98 |   856.00 |   0.07 |  0.9 |     468.19 |  4650.88 |
base/sorting                                  (4) |         failed at 2025-12-09T09:32:47.805
core/device/intrinsics/wmma                   (5) |    42.65 |   0.01 |  0.0 |       0.63 |   626.00 |   1.07 |  2.5 |    4834.52 |  2435.00 |
libraries/cusparse/interfaces                 (5) |    99.18 |   0.31 |  0.3 |      47.12 |   636.00 |   2.66 |  2.7 |    9399.03 |  4297.25 |
libraries/cufft                               (5) |    52.88 |   0.03 |  0.1 |     198.03 |   884.00 |   1.17 |  2.2 |    5886.47 |  5161.25 |
core/device/intrinsics/atomics                (5) |    11.62 |   0.00 |  0.0 |       0.00 |   838.00 |   0.16 |  1.4 |     971.84 |  5652.12 |
libraries/cusparse/conversions                (5) |     7.72 |   0.02 |  0.2 |       1.73 |   838.00 |   0.22 |  2.8 |     878.62 |  5760.12 |
libraries/cusolver/dense_generic              (5) |    39.17 |   0.02 |  0.1 |      15.11 |  1228.00 |   0.98 |  2.5 |    4277.75 |  6840.12 |
base/texture                                  (5) |    22.44 |   0.00 |  0.0 |       0.10 |  1224.00 |   0.63 |  2.8 |    3079.93 |  7236.12 |
core/device/intrinsics/cooperative_groups     (5) |    27.14 |   0.01 |  0.0 |      20.50 |  1222.00 |   0.26 |  1.0 |    1799.05 |  9108.12 |
core/device/intrinsics                        (5) |         failed at 2025-12-09T09:38:13.958
libraries/cublas/level1                       (6) |    31.71 |   0.01 |  0.0 |       0.03 |   688.00 |   1.33 |  4.2 |    3672.91 |  1770.69 |
libraries/cusparse/bmm                        (6) |    25.32 |   0.02 |  0.1 |       0.99 |   778.00 |   1.29 |  5.1 |    3873.55 |  2454.69 |
core/device/array                             (6) |     3.48 |   0.00 |  0.0 |       0.00 |   778.00 |   0.08 |  2.2 |     391.56 |  2490.69 |
base/random                                   (6) |    20.09 |   0.01 |  0.0 |    4352.59 |   778.00 |   0.39 |  1.9 |    1838.40 |  3282.69 |
libraries/cusolver/sparse                     (6) |    14.40 |   0.00 |  0.0 |       0.22 |   844.00 |   0.33 |  2.3 |    1354.57 |  3390.69 |
core/device/intrinsics/memory                 (6) |     4.58 |   0.00 |  0.0 |       0.02 |   844.00 |   0.06 |  1.4 |     385.94 |  3534.69 |
core/codegen                                  (6) |     2.19 |   0.00 |  0.0 |       0.00 |   844.00 |   0.03 |  1.4 |     156.67 |  3678.69 |
core/device/intrinsics/math                   (6) |    22.61 |   0.00 |  0.0 |       0.00 |   846.00 |   0.38 |  1.7 |    1921.37 |  4866.69 |
core/device/intrinsics/output                 (6) |     6.29 |   0.00 |  0.0 |       0.00 |   846.00 |   0.14 |  2.3 |     742.78 |  5010.69 |
core/device/random                            (6) |    23.24 |   0.01 |  0.0 |       0.37 |   850.00 |   0.30 |  1.3 |    1674.00 |  5946.69 |
libraries/cusparse/device                     (6) |     1.43 |   0.00 |  0.0 |       0.01 |   850.00 |   0.04 |  2.6 |     192.60 |  5982.69 |
libraries/cusolver/multigpu                   (6) |    10.35 |   0.03 |  0.3 |     545.60 |  1512.00 |   0.14 |  1.3 |     840.54 |  6522.69 |
core/device/ldg                               (6) |     4.50 |   0.00 |  0.0 |       0.00 |   858.00 |   0.12 |  2.6 |     548.39 |  6558.19 |
libraries/cusparse/broadcast                  (6) |    46.83 |   0.01 |  0.0 |       0.13 |   860.00 |   1.08 |  2.3 |    5251.58 |  7530.19 |
libraries/cusolver/base                       (6) |     0.10 |   0.00 |  0.0 |       0.00 |   860.00 |   0.00 |  0.0 |       1.86 |  7530.19 |
core/pointer                                  (6) |     0.25 |   0.00 |  0.0 |       0.00 |   860.00 |   0.00 |  0.0 |       7.63 |  7530.19 |
base/broadcast                                (6) |     9.84 |   0.00 |  0.0 |       0.00 |   862.00 |   0.16 |  1.6 |     945.92 |  7890.19 |
core/nvml                                     (6) |     0.57 |   0.00 |  0.0 |       0.00 |   862.00 |   0.00 |  0.0 |      53.32 |  7890.19 |
libraries/cusparse/linalg                     (6) |    44.54 |   0.10 |  0.2 |       6.78 |   864.00 |   1.22 |  2.7 |    5339.69 |  8934.19 |
base/exceptions                               (6) |         failed at 2025-12-09T09:49:21.644
libraries/cusolver/sparse_factorizations      (7) |    22.63 |   0.01 |  0.0 |      18.32 |   772.00 |   1.43 |  6.3 |    3396.66 |  2057.56 |
core/profile                                  (7) |   276.86 |   0.00 |  0.0 |       0.00 |   766.00 |  10.35 |  3.7 |   81968.94 |  2993.56 |
base/iterator                                 (7) |     2.64 |   0.00 |  0.0 |       1.93 |   766.00 |   0.08 |  2.9 |     392.02 |  2993.56 |
base/threading                                (7) |     3.05 |   0.01 |  0.2 |      10.94 |   832.00 |   0.10 |  3.3 |     357.20 |  2993.56 |
core/utils                                    (7) |     0.61 |   0.00 |  0.0 |       0.00 |   830.00 |   0.01 |  1.7 |      70.97 |  2993.56 |
core/pool                                     (7) |     2.31 |   0.00 |  0.0 |       0.00 |   638.00 |   0.67 | 28.9 |     244.97 |  2993.56 |
libraries/cusparse/sparse_matrices_csr        (7) |     3.58 |   0.00 |  0.1 |       1.48 |   638.00 |   0.10 |  2.9 |     370.75 |  2993.56 |
base/linalg                                   (7) |    42.14 |   0.02 |  0.1 |    1554.64 |   704.00 |   2.68 |  6.4 |   15829.82 |  5038.81 |
libraries/cusparse/reduce                     (7) |    15.85 |   0.11 |  0.7 |       0.06 |   704.00 |   0.38 |  2.4 |    1874.93 |  5038.81 |
libraries/staticarrays                        (7) |     1.02 |   0.00 |  0.0 |       0.00 |   704.00 |   0.03 |  3.4 |     193.18 |  5038.81 |
base/kernelabstractions                       (7) |    30.98 |   0.01 |  0.0 |      71.03 |   820.00 |   1.20 |  3.9 |    3575.04 |  5038.81 |
base/examples                                 (7) |     5.61 |   0.00 |  0.0 |     385.30 |  1204.00 |   0.67 | 12.0 |    1313.16 |  5195.44 |
libraries/curand                              (7) |     0.05 |   0.00 |  0.0 |       0.00 |   820.00 |   0.00 |  0.0 |       1.77 |  5195.44 |
Testing finished in 48 minutes, 39 seconds, 469 milliseconds
Worker 3 failed running test core/cudadrv:
Some tests did not pass: 2065 passed, 0 failed, 1 errored, 3 broken.
core/cudadrv: Error During Test at /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/core/cudadrv.jl:132
  Got exception outside of a @test
  CUDA error: limit is not supported on this architecture (code 215, ERROR_UNSUPPORTED_LIMIT)
  Stacktrace:
    [1] throw_api_error(res::CUDA.cudaError_enum)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
    [2] check
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
    [3] cuCtxGetLimit
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
    [4] limit(lim::CUDA.CUlimit_enum)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/context.jl:351
    [5] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/core/cudadrv.jl:134 [inlined]
    [6] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [7] top-level scope
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/core/cudadrv.jl:134
    [8] include
      @ ./client.jl:494 [inlined]
    [9] #12
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/runtests.jl:89 [inlined]
   [10] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:70 [inlined]
   [11] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [12] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:70 [inlined]
   [13] macro expansion
      @ ./timing.jl:503 [inlined]
   [14] top-level scope
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:69
   [15] eval
      @ ./boot.jl:385 [inlined]
   [16] (::var"#inner#3"{Serialization.__deserialized_types__.var"#12#17"{String}, String, Symbol})()
      @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:77
   [17] runtests(f::Function, name::String, time_source::Symbol)
      @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:135
   [18] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
      @ Base ./essentials.jl:892
   [19] invokelatest(::Any, ::Any, ::Vararg{Any})
      @ Base ./essentials.jl:889
   [20] (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
   [21] run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
   [22] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
Worker 4 failed running test base/sorting:
Some tests did not pass: 143 passed, 0 failed, 21 errored, 0 broken.
base/sorting: Error During Test at /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:256
  Got exception outside of a @test
  CUDA error: limit is not supported on this architecture (code 215, ERROR_UNSUPPORTED_LIMIT)
  Stacktrace:
    [1] throw_api_error(res::CUDA.cudaError_enum)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
    [2] check
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
    [3] cuCtxGetLimit
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
    [4] limit(lim::CUDA.CUlimit_enum)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/context.jl:351
    [5] quicksort!(c::CuArray{UInt8, 1, CUDA.DeviceMemory}; lt::typeof(isless), by::typeof(identity), dims::Int64, partial_k::Nothing, block_size_shift::Int64)
      @ CUDA.QuickSortImpl /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/sorting.jl:477
    [6] quicksort!
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/sorting.jl:473 [inlined]
    [7] (::var"#check#129"{var"#init#127"})(block_size_shift::Int64)
      @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:266
    [8] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:273 [inlined]
    [9] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [10] top-level scope
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:257
   [11] include
      @ ./client.jl:494 [inlined]
   [12] #12
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/runtests.jl:89 [inlined]
   [13] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:66 [inlined]
   [14] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [15] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:66 [inlined]
   [16] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/utilities.jl:35 [inlined]
   [17] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/memory.jl:835 [inlined]
   [18] top-level scope
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:65
   [19] eval
      @ ./boot.jl:385 [inlined]
   [20] (::var"#inner#3"{Serialization.__deserialized_types__.var"#12#17"{String}, String, Symbol})()
      @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:77
   [21] runtests(f::Function, name::String, time_source::Symbol)
      @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/setup.jl:135
   [22] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
      @ Base ./essentials.jl:892
   [23] invokelatest(::Any, ::Any, ::Vararg{Any})
      @ Base ./essentials.jl:889
   [24] (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
   [25] run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
   [26] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
base/sorting: Error During Test at /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:283
  Test threw exception
  Expression: check_sort!(Int, 1000000; alg = CUDA.QuickSort)
  CUDA error: limit is not supported on this architecture (code 215, ERROR_UNSUPPORTED_LIMIT)
  Stacktrace:
    [1] throw_api_error(res::CUDA.cudaError_enum)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:30
    [2] check
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/libcuda.jl:37 [inlined]
    [3] cuCtxGetLimit
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/GPUToolbox/JLBB1/src/ccalls.jl:33 [inlined]
    [4] limit(lim::CUDA.CUlimit_enum)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/lib/cudadrv/context.jl:351
    [5] quicksort!(c::CuArray{Int64, 1, CUDA.DeviceMemory}; lt::typeof(isless), by::typeof(identity), dims::Int64, partial_k::Nothing, block_size_shift::Int64)
      @ CUDA.QuickSortImpl /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/sorting.jl:477
    [6] sort!(c::CuArray{Int64, 1, CUDA.DeviceMemory}, alg::CUDA.QuickSortAlg; lt::Function, by::Function, rev::Bool)
      @ CUDA /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/sorting.jl:991
    [7] sort!
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/sorting.jl:985 [inlined]
    [8] #sort!#1296
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/src/sorting.jl:1000 [inlined]
    [9] check_sort!(T::Type, N::Int64, f::Function; kwargs::@Kwargs{alg::CUDA.QuickSortAlg})
      @ Main /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:198
   [10] check_sort!
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:196 [inlined]
   [11] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:283 [inlined]
   [12] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:669 [inlined]
   [13] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:283 [inlined]
   [14] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [15] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:283 [inlined]
   [16] macro expansion
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/juliaup/julia-1.10.10+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [17] top-level scope
      @ /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:281
base/sorting: Error During Test at /cluster/projects/nn9874k/aklocker/juliaup/depot/packages/CUDA/x8d2s/test/base/sorting.jl:284
 

Is this related to the same problem?

Having the CUDA tests pass on your platform is a worthy goal, but not necessary for that goal. The CUDA.jl tests are much broader.

The failure you are seeing are LIMIT_DEV_RUNTIME_SYNC_DEPTH

cudaLimitDevRuntimeSyncDepth is the maximum grid depth at which a thread can isssue the device runtime call cudaDeviceSynchronize() to wait on child grid launches to complete. This functionality is removed for devices of compute capability >= 9.0, and hence will return error cudaErrorUnsupportedLimit on such devices.

from

also

Use of cudaDeviceSynchronize in device code was deprecated in CUDA 11.6 and removed for compute_90+ compilation. For compute capability < 9.0, compile-time opt-in by specifying -D CUDA_FORCE_CDP1_IF_SUPPORTED is required to continue using cudaDeviceSynchronize() in device code for now. Note that this is different from host-side cudaDeviceSynchronize, which is still supported.

Could you open an issue on CUDA.jl?

Thanks. Yes, I will do that.

My issue is actually the same (as far as I can tell) as discussed here, and apparently harmless.

1 Like