Running CUDA.jl with JULIA_DEPOT_PATH read-only

I’m updating from Julia 1.8 to 1.10 for the simulations I’m running. Previously everything worked fine, but now I have an issue. I can install CUDA.jl just fine and use it on the HPC login nodes, but on compute nodes the home filesystem is read-only, and CUDA always tries to pre-compile, even if it’s already been pre-compiled on a login node.
If I put the Julia DEPOT_PATH into the scratch filesystem, it works just fine because it can write to it.
Why has this started happening, when it wasn’t an issue before (CUDA version 4.4 → 5.3)? Can I force it to not pre-compile?

1 Like

I don’t think it’s ideal to use a read-only primary depot, as packages like GPUCompiler.jl expects to be able to use read-write scratchspaces, which are also stored in the depot. Can you not use a layered depot, with the primary entry still being read-write?

Regardless of that, the package is not expected to recompile again. Try running with JULIA_DEBUG=loading to inspect why this happens.

Typically, you should put your Julia depot on the parallel filesystem (PFS). Often times, that’s “scratch”.

1 Like

Ah, I did expect it was an unusual use case, it’s okay if this is just not supported, I just don’t want to gunk up this filesystem but I suppose I must…
What do you mean by a layered depot?

Anyway here are the loading debug messages, seems to stem from an edited libcuda:

julia> using CUDA
┌ Debug: Rejecting stale cache file /home/n/ngrisoua/atkin163/.julia-mist/compiled/v1.10/CUDA_Runtime_jll/Hs50y_SwiJs.ji (mtime 1.717012621771995e9) because file /lib64/libcuda.so.1 (mtime 1.7170137474668772e9) has changed
└ @ Base loading.jl:3143
┌ Debug: Ignoring cache file /home/n/ngrisoua/atkin163/.julia-mist/compiled/v1.10/CUDA_Runtime_jll/Hs50y_3AysM.ji for CUDA_Runtime_jll [76a88914-d11a-5bdc-97e0-2f5a05c973a2] (fafbfcfd-01e2-611a-0006-41bd32282031) since it is does not provide desired build_id (fafbfcfd-f246-903e-0006-43c02359b684)
└ @ Base loading.jl:3058
┌ Debug: Rejecting cache file /home/n/ngrisoua/atkin163/.julia-mist/compiled/v1.10/CUDA/oWw5k_SwiJs.ji because required dependency CUDA_Runtime_jll [76a88914-d11a-5bdc-97e0-2f5a05c973a2] with build ID fafbfcfd-f246-903e-0006-43c02359b684 is missing from the cache.
└ @ Base loading.jl:1560
┌ Debug: Rejecting cache file /home/n/ngrisoua/atkin163/.julia-mist/compiled/v1.10/CUDA/oWw5k_3AysM.ji because it is for file /home/n/ngrisoua/atkin163/scratch/.julia-mist/packages/CUDA/B2Z5u/src/CUDA.jl not file /home/n/ngrisoua/atkin163/.julia-mist/packages/CUDA/B2Z5u/src/CUDA.jl
└ @ Base loading.jl:3113
ERROR: IOError: open("/gpfs/fs1/home/n/ngrisoua/atkin163/.julia-mist/logs/manifest_usage.toml.pid", 194, 292): read-only file system (EROFS)
...

For the sake of completion here is CUDA.versioninfo()

CUDA runtime 11.8, artifact installation
CUDA driver 11.7
NVIDIA driver 515.105.1

CUDA libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+515.105.1

Julia packages:
- CUDA: 5.3.5
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

4 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 25.215 GiB / 32.000 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 26.889 GiB / 32.000 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 26.889 GiB / 32.000 GiB available)
  3: Tesla V100-SXM2-32GB (sm_70, 27.397 GiB / 32.000 GiB available)

I suspected as much. In this case, it’s kind-of needed to recompile, as the CUDA driver (libcuda) determines what toolkit you can use. So if the driver changes, users would want CUDA.jl to recompile so that we’re using a compatible CUDA toolkit.

To prevent this, you would need to make sure the same driver is being used on the compute nodes, or you would need to manually specify the toolkit to use (see here for how that logic works, Yggdrasil/C/CUDA/CUDA_Runtime/platform_augmentation.jl at 7d67bcabd520057dffa52e9f160afc8fef24d720 · JuliaPackaging/Yggdrasil · GitHub). This can be done using a preference, which you can set by calling CUDA.set_runtime_version!.

I did that (CUDA.set_runtime_version!(v"11.8.0"; local_toolkit=true)), and it does manage to get further, but then this:
It looks like in the cuda library folder there is no nvperf_host? Is this an issue I need to bring up with the support?

┌ Debug: Looking for library nvperf_host, no specific version, in /scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0 or /scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/extras/compute-sanitizer or /scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/compute-sanitizer
│   all_names =
│    1-element Vector{String}:
│     "libnvperf_host.so"
│   all_locations =
│    15-element Vector{String}:
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0"
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/lib"
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/lib64"
│     ⋮
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/compute-sanitizer/libx64"
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/compute-sanitizer/targets/ppc64le-linux/lib"
└ @ CUDA_Runtime_Discovery ~/.julia-mist/packages/CUDA_Runtime_Discovery/ZGnCZ/src/CUDA_Runtime_Discovery.jl:136
┌ Debug: Did not find nvperf_host
└ @ CUDA_Runtime_Discovery ~/.julia-mist/packages/CUDA_Runtime_Discovery/ZGnCZ/src/CUDA_Runtime_Discovery.jl:139
┌ Debug: Looking for library nvperf_host, version 9.0.0 or 9.1.0 or 9.2.0 or 10.0.0 or 10.1.0 or 10.2.0 or 11.0.0 or 11.1.0 or 11.2.0 or 11.3.0 or 11.4.0 or 11.5.0 or 11.6.0 or 11.7.0 or 11.8.0 or 12.0.0 or 12.1.0 or 12.2.0 or 12.3.0 or 12.4.0 or 12.5.0 or 12.6.0 or 12.7.0 or 12.8.0 or 12.9.0 or 12.10.0 or 13.1.0 or 13.2.0 or 13.3.0 or 13.4.0 or 13.5.0 or 13.6.0 or 13.7.0 or 13.8.0 or 13.9.0 or 13.10.0 or 14.1.0 or 14.2.0 or 14.3.0 or 14.4.0 or 14.5.0 or 14.6.0 or 14.7.0 or 14.8.0 or 14.9.0 or 14.10.0, in /scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0 or /scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/extras/compute-sanitizer or /scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/compute-sanitizer
│   all_names =
│    139-element Vector{String}:
│     "libnvperf_host.so"
│     "libnvperf_host.so.9.0.0"
│     "libnvperf_host.so.9.0"
│     ⋮
│     "libnvperf_host.so.14.10"
│     "libnvperf_host.so.14"
│   all_locations =
│    15-element Vector{String}:
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0"
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/lib"
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/lib64"
│     ⋮
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/compute-sanitizer/libx64"
│     "/scinet/mist/rhel8/software/2021a/opt/base/cuda/11.8.0/compute-sanitizer/targets/ppc64le-linux/lib"
└ @ CUDA_Runtime_Discovery ~/.julia-mist/packages/CUDA_Runtime_Discovery/ZGnCZ/src/CUDA_Runtime_Discovery.jl:136
┌ Debug: Did not find nvperf_host
└ @ CUDA_Runtime_Discovery ~/.julia-mist/packages/CUDA_Runtime_Discovery/ZGnCZ/src/CUDA_Runtime_Discovery.jl:139
┌ Debug: Could not discover CUDA toolkit
│   exception =
│    Could not find library 'nvperf_host' in your local CUDA installation.

This is fixed in the latest CUDA_Runtime_Discovery.jl; but you’re using and old version (v0.2.4). Upgrading to CUDA v5.4 should fix this.

Ah, thank you. Unfortunately this cluster uses Power9 CPUs…I bit the bullet and just put the depot in scratch.

If you need specific features, such as compatibility with the latest CUDA_Runtime_Discovery_jll, feel free to create a backport PR. There is already a release branch: GitHub - JuliaGPU/CUDA.jl at release-5.3