How to use CUDA on cluster nodes without internet access

Hi there,

I’m trying to run Julia code that utilizes CUDA and Flux on a cluster. Only the login nodes have internet access but they do not have a GPU. So I’ve run into the problem that if I run the Julia code on the login nodes it doesn’t download the CUDA artifacts because there is no GPU and if I run the code on a GPU node, it fails to download CUDA because there is no internet access.

Is there a way for Julia to download the artifacts for CUDA even if there is no GPU on the machine that it’s running on?

Thanks for the help.

Sounds like a good reason to build a sysimage;)

To explain the smiley: I’ve recently tried PackageCompiler and I don’t want to know what it means to build a sysimage…

Most clusters provide CUDA installations. Have you considered using those, i.e. JULIA_CUDA_USE_BINARYBUILDER=false (IIRC)? This way CUDA.jl wouldn’t have to download anything.

2 Likes

Hi carsten,

thanks for that suggestion. It is now able to find CUDA but fails to find cudnn.

"
julia> using CUDA

julia> using Flux

julia> ENV[“JULIA_CUDA_USE_BINARYBUILDER”] = false
false

julia>

julia> b = Float32.(randn(64,64,3,4)) |> gpu;
Downloaded artifact: CUDA_compat
Downloaded artifact: CUDA_compat
▒ Warning: CUDA.jl found cuda, but did not find libcudnn. Some functionality will not be available.
▒ @ Flux ~/.julia/packages/Flux/18YZE/src/functor.jl:189

julia> encBlock = Chain(
Conv((4,4), 3 => 128, leakyrelu;pad=(0,0),stride=(2,2)),
BatchNorm(128),
) |> gpu;

julia>

julia> enc_out = encBlock(b) |> size
ERROR: This functionality is unavailabe as CUDNN is missing.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] libcudnn(; throw_error::Bool)
@ CUDA.Deps ~/.julia/packages/CUDA/5jdFl/deps/bindeps.jl:535
[3] libcudnn()
@ CUDA.Deps ~/.julia/packages/CUDA/5jdFl/deps/bindeps.jl:528
[4] cudnnGetVersion
@ ~/.julia/packages/CUDA/5jdFl/lib/cudnn/libcudnn.jl:5 [inlined]
[5] version()
@ CUDA.CUDNN ~/.julia/packages/CUDA/5jdFl/lib/cudnn/base.jl:14
[6] cudnnversion
@ ~/.julia/packages/NNlibCUDA/i1IW9/src/cudnn/cudnn.jl:5 [inlined]
[7] conv!(y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; alpha::Int64, beta::Int64, algo::Int64)
@ NNlibCUDA ~/.julia/packages/NNlibCUDA/i1IW9/src/cudnn/conv.jl:60
[8] conv!
@ ~/.julia/packages/NNlibCUDA/i1IW9/src/cudnn/conv.jl:60 [inlined]
[9] conv(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ NNlib ~/.julia/packages/NNlib/TAcqa/src/conv.jl:88
[10] conv
@ ~/.julia/packages/NNlib/TAcqa/src/conv.jl:86 [inlined]
[11] (::Conv{2, 2, typeof(leakyrelu), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
@ Flux ~/.julia/packages/Flux/18YZE/src/layers/conv.jl:170
[12] macro expansion
@ ~/.julia/packages/Flux/18YZE/src/layers/basic.jl:53 [inlined]
[13] applychain
@ ~/.julia/packages/Flux/18YZE/src/layers/basic.jl:53 [inlined]
[14] (::Chain{Tuple{Conv{2, 2, typeof(leakyrelu), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}})(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
@ Flux ~/.julia/packages/Flux/18YZE/src/layers/basic.jl:51
[15] top-level scope
@ REPL[6]:1
[16] top-level scope
@ ~/.julia/packages/CUDA/5jdFl/src/initialization.jl:52

"

I have loaded the CUDA module in the cluster and I’m on a GPU node. Any ideas?

Do you have CUDNN locally?

It should be installed on the cluster, since I have been able to run convolutional layers in python’s PyTorch with GPU. As far as I know that also uses CUDNN

The library should be discoverable, too. You can run with JULIA_DEBUG=CUDA to have it print where it is looking. But in general it should be discoverable by Libdl.find_library.

Thanks for prompting me to double check. It turned out there were modules available for cuda 11.0 and 11.1 and only 11.0 had cudnn. Problem solved

1 Like