Another freezing test CUDA

I get no errors at all that I can find. But when I run Pkg.test("CUDA"; test_args=–verbose --jobs=1) after a while my computer freezes up. I have 6-core Intel CPU with 12GB and I have just installed an Intel A2000, which also has 12GB.

I’m running on Arch (Endeavorous) and everything is fully up to date.

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.8
NVIDIA driver 570.86.16

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+570.86.16

Julia packages: 
- CUDA: 5.6.1
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.3
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX A2000 12GB (sm_86, 10.160 GiB / 11.994 GiB available)

My .bashrc contains:

export JULIA_NUM_THREADS=1
export JULIA_CPU_THREADS=1

FWIW:

[roger@roger-pc ~]$ sudo lshw -C display
[sudo] password for roger: 
  *-display                 
       description: VGA compatible controller
       product: GA106 [RTX A2000 12GB]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:03:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=nvidia latency=0 resolution=3840,2160
       resources: irq:40 memory:fa000000-faffffff memory:d0000000-dfffffff memory:ce000000-cfffffff ioport:cc00(size=128) memory:c0000-dffff

Notice, I have an old motherboard:

[roger@roger-pc ~]$ inxi -Fxz
System:
  Kernel: 6.13.1-arch1-1 arch: x86_64 bits: 64 compiler: gcc v: 14.2.1
  Desktop: KDE Plasma v: 6.2.5 Distro: EndeavourOS base: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: P6X58D-E v: Rev 1.xx
    serial: <superuser required> BIOS: American Megatrends v: 0803
    date: 08/06/2012
CPU:
  Info: 6-core model: Intel Core i7 980 bits: 64 type: MT MCP arch: Nehalem
    rev: 2 cache: L1: 384 KiB L2: 1.5 MiB L3: 12 MiB
  Speed (MHz): avg: 1600 min/max: 1600/3334 boost: enabled cores: 1: 1600
    2: 1600 3: 1600 4: 1600 5: 1600 6: 1600 7: 1600 8: 1600 9: 1600 10: 1600
    11: 1600 12: 1600 bogomips: 80178
  Flags: ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: NVIDIA GA106 [RTX A2000 12GB] driver: nvidia v: 570.86.16
    arch: Ampere bus-ID: 03:00.0
  Device-2: Microsoft LifeCam Cinema driver: snd-usb-audio,uvcvideo
    type: USB bus-ID: 5-3:2
  Display: wayland server: X.org v: 1.21.1.15 with: Xwayland v: 24.1.5
    compositor: kwin_wayland driver: X: loaded: nvidia unloaded: modesetting
    gpu: nvidia,nvidia-nvswitch resolution: 3840x2160~60Hz
  API: EGL v: 1.5 drivers: nvidia platforms:
    active: gbm,wayland,x11,surfaceless,device inactive: N/A
  API: OpenGL v: 4.6.0 vendor: nvidia v: 570.86.16 glx-v: 1.4
    direct-render: yes renderer: NVIDIA RTX A2000 12GB/PCIe/SSE2
  API: Vulkan v: 1.4.303 drivers: N/A surfaces: xcb,xlib,wayland devices: 1
  Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo
    de: kscreen-console,kscreen-doctor gpu: nvidia-settings,nvidia-smi
    wl: wayland-info x11: xdpyinfo, xprop, xrandr
Audio:
  Device-1: Intel 82801JI HD Audio vendor: ASUSTeK driver: snd_hda_intel
    v: kernel bus-ID: 00:1b.0
  Device-2: NVIDIA GA106 High Definition Audio driver: snd_hda_intel
    v: kernel bus-ID: 03:00.1
  Device-3: Microsoft LifeCam Cinema driver: snd-usb-audio,uvcvideo
    type: USB bus-ID: 5-3:2
  API: ALSA v: k6.13.1-arch1-1 status: kernel-api
  Server-1: PipeWire v: 1.2.7 status: active
Network:
  Device-1: Marvell 88E8056 PCI-E Gigabit Ethernet vendor: ASUSTeK
    driver: sky2 v: 1.30 port: d800 bus-ID: 05:00.0
  IF: enp5s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 3.64 TiB used: 216.36 GiB (5.8%)
  ID-1: /dev/sda vendor: Samsung model: SSD 870 EVO 2TB size: 1.82 TiB
  ID-2: /dev/sdb vendor: Samsung model: SSD 870 EVO 2TB size: 1.82 TiB
Partition:
  ID-1: / size: 1.79 TiB used: 216.36 GiB (11.8%) fs: ext4 dev: /dev/sdb1
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 31.0 C mobo: 31.0 C
  Fan Speeds (rpm): cpu: 2556 psu: 0 case-1: 0 case-2: 570 case-3: 359
  Power: 12v: 11.91 5v: N/A 3.3v: 3.15 vbat: N/A
Info:
  Memory: total: 12 GiB available: 11.67 GiB used: 4.42 GiB (37.9%)
  Processes: 304 Uptime: 48m Init: systemd
  Packages: 1522 Compilers: clang: 19.1.7 gcc: 14.2.1 Shell: Bash v: 5.2.37
    inxi: 3.3.37

I monitor the process using nvitop – very useful, shows memory usage growing but after about two hours the PC has frozen. No errors, no logs, no clues.

Can anyone help?

The --verbose should make it print when a test suite starts. What is the final test suite that starts executing before the PC hangs? Does starting that test in isolation reproduce the issue?

I ran the test in headless mode, otherwise as above. I redirected all output to the uploaded files: See Download julia_cuda_info.txt | LimeWire

The first file is huge.

That looks fine. There’s a handful of expected failures because of regressions in Julia which have been worked around on the master branch of CUDA.jl, but in general everything seems to be working fine. The fact your display freezes when running the test suite is not great, but it’s also not recommended to be running heavy computational jobs on the same GPU that’s used to drive your display.

1 Like

OK, thanks for the information. I switched to headless mode for the very reason you say about not driving the monitor at the same time as running a heavy computational load.