How to use CUDA on cluster nodes without internet access

PeeVee · April 7, 2022, 6:29pm

Hi there,

I’m trying to run Julia code that utilizes CUDA and Flux on a cluster. Only the login nodes have internet access but they do not have a GPU. So I’ve run into the problem that if I run the Julia code on the login nodes it doesn’t download the CUDA artifacts because there is no GPU and if I run the code on a GPU node, it fails to download CUDA because there is no internet access.

Is there a way for Julia to download the artifacts for CUDA even if there is no GPU on the machine that it’s running on?

Thanks for the help.

goerch · April 7, 2022, 7:58pm

Sounds like a good reason to build a sysimage;)

To explain the smiley: I’ve recently tried PackageCompiler and I don’t want to know what it means to build a sysimage…

carstenbauer · April 7, 2022, 8:37pm

Most clusters provide CUDA installations. Have you considered using those, i.e. JULIA_CUDA_USE_BINARYBUILDER=false (IIRC)? This way CUDA.jl wouldn’t have to download anything.

PeeVee · April 8, 2022, 8:53am

Hi carsten,

thanks for that suggestion. It is now able to find CUDA but fails to find cudnn.

"
julia> using CUDA

julia> using Flux

julia> ENV[“JULIA_CUDA_USE_BINARYBUILDER”] = false
false

julia>

julia> b = Float32.(randn(64,64,3,4)) |> gpu;
Downloaded artifact: CUDA_compat
Downloaded artifact: CUDA_compat
▒ Warning: CUDA.jl found cuda, but did not find libcudnn. Some functionality will not be available.
▒ @ Flux ~/.julia/packages/Flux/18YZE/src/functor.jl:189

julia> encBlock = Chain(
Conv((4,4), 3 => 128, leakyrelu;pad=(0,0),stride=(2,2)),
BatchNorm(128),
) |> gpu;

julia>

julia> enc_out = encBlock(b) |> size
ERROR: This functionality is unavailabe as CUDNN is missing.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] libcudnn(; throw_error::Bool)
@ CUDA.Deps ~/.julia/packages/CUDA/5jdFl/deps/bindeps.jl:535
[3] libcudnn()
@ CUDA.Deps ~/.julia/packages/CUDA/5jdFl/deps/bindeps.jl:528
[4] cudnnGetVersion
@ ~/.julia/packages/CUDA/5jdFl/lib/cudnn/libcudnn.jl:5 [inlined]
[5] version()
@ CUDA.CUDNN ~/.julia/packages/CUDA/5jdFl/lib/cudnn/base.jl:14
[6] cudnnversion
@ ~/.julia/packages/NNlibCUDA/i1IW9/src/cudnn/cudnn.jl:5 [inlined]
[7] conv!(y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; alpha::Int64, beta::Int64, algo::Int64)
@ NNlibCUDA ~/.julia/packages/NNlibCUDA/i1IW9/src/cudnn/conv.jl:60
[8] conv!
@ ~/.julia/packages/NNlibCUDA/i1IW9/src/cudnn/conv.jl:60 [inlined]
[9] conv(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ NNlib ~/.julia/packages/NNlib/TAcqa/src/conv.jl:88
[10] conv
@ ~/.julia/packages/NNlib/TAcqa/src/conv.jl:86 [inlined]
[11] (::Conv{2, 2, typeof(leakyrelu), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
@ Flux ~/.julia/packages/Flux/18YZE/src/layers/conv.jl:170
[12] macro expansion
@ ~/.julia/packages/Flux/18YZE/src/layers/basic.jl:53 [inlined]
[13] applychain
@ ~/.julia/packages/Flux/18YZE/src/layers/basic.jl:53 [inlined]
[14] (::Chain{Tuple{Conv{2, 2, typeof(leakyrelu), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}})(x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
@ Flux ~/.julia/packages/Flux/18YZE/src/layers/basic.jl:51
[15] top-level scope
@ REPL[6]:1
[16] top-level scope
@ ~/.julia/packages/CUDA/5jdFl/src/initialization.jl:52

"

I have loaded the CUDA module in the cluster and I’m on a GPU node. Any ideas?

maleadt · April 8, 2022, 8:54am

Do you have CUDNN locally?

PeeVee · April 8, 2022, 9:26am

It should be installed on the cluster, since I have been able to run convolutional layers in python’s PyTorch with GPU. As far as I know that also uses CUDNN

maleadt · April 8, 2022, 12:37pm

The library should be discoverable, too. You can run with JULIA_DEBUG=CUDA to have it print where it is looking. But in general it should be discoverable by Libdl.find_library.

PeeVee · April 8, 2022, 3:30pm

Thanks for prompting me to double check. It turned out there were modules available for cuda 11.0 and 11.1 and only 11.0 had cudnn. Problem solved

Topic		Replies	Views
Could not download forward compatibility package for CUDA General Usage cuda	7	972	January 5, 2023
Unable to use CUDA from artifacts GPU cuda	5	1733	January 28, 2023
How to setup Julia & Flux with CUDA & cuDNN on a computing cluster? New to Julia question , packages , flux	9	1927	December 18, 2019
How to install CUDA.jl without Julia package manager General Usage cudajl	14	516	August 10, 2022
Calling Flux.gpu downloads CUDNN artifact every first call in REPL/script GPU cuda , flux	6	316	September 27, 2022

How to use CUDA on cluster nodes without internet access

Related topics