Strange CUDA_ERROR_OUT_OF_MEMORY when using CUDA on HPC

I am using CUDA.jl v2.3 on a HPC GPU cluster. There is a 50% chance that my job will receive an error of out of memory. Otherwise, the same script will run to completion. Any clue why this is happening.

The error stack is

ERROR: LoadError: CUDA error (code 2, CUDA_ERROR_OUT_OF_MEMORY)
 Stacktrace:
  [1] throw_api_error(::CUDA.cudaError_enum) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:97
  [2] macro expansion at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:104 [inlined]
  [3] cuDevicePrimaryCtxRetain at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/utils/call.jl:93 [inlined]
  [4] CUDA.CuContext(::CUDA.CuPrimaryContext) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/context/primary.jl:32
  [5] context(::CUDA.CuDevice) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:257
  [6] device! at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:294 [inlined]
  [7] device! at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:273 [inlined]
  [8] initialize_thread(::Int64) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:122
  [9] prepare_cuda_call() at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:80
  [10] device at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:227 [inlined]
  [11] alloc at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/pool.jl:293 [inlined]
  [12] CUDA.CuArray{Float32,2}(::UndefInitializer, ::Tuple{Int64,Int64}) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:20
  [13] CuArray at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:76 [inlined]
  [14] similar at ./abstractarray.jl:675 [inlined]
  [15] convert at /users/PAS0177/shah1285/.julia/packages/GPUArrays/ZxsKE/src/host/construction.jl:82 [inlined]
  [16] adapt_storage at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:330 [inlined]
  [17] adapt_structure at /users/PAS0177/shah1285/.julia/packages/Adapt/8kQMV/src/Adapt.jl:42 [inlined]
  [18] adapt at /users/PAS0177/shah1285/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
  [19] cu at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:342 [inlined]
  [20] fmap(::typeof(CUDA.cu), ::Array{Float64,2}; cache::IdDict{Any,Any}) at /users/PAS0177/shah1285/.julia/packages/Functors/YlETM/src/functor.jl:35
  [21] fmap at /users/PAS0177/shah1285/.julia/packages/Functors/YlETM/src/functor.jl:34 [inlined]
  [22] gpu at /users/PAS0177/shah1285/.julia/packages/Flux/q3zeA/src/functor.jl:74 [inlined]
  [23] |>(::Array{Float64,2}, ::typeof(Flux.gpu)) at ./operators.jl:834
  [24] getdata_reg(::Int64, ::Int64, ::Float64) at /fs/scratch/PAS0177/NNFitter/src/regression.jl:25
  [25] train_reg(::Int64, ::Int64, ::Float64; epochs::Int64) at /fs/scratch/PAS0177/NNFitter/src/regression.jl:55
  [26] main() at /fs/scratch/PAS0177/NNFitter/scripts/trainner.jl:15
  [27] top-level scope at /fs/scratch/PAS0177/NNFitter/scripts/trainner.jl:19
  [28] include(::Function, ::Module, ::String) at ./Base.jl:380
  [29] include(::Module, ::String) at ./Base.jl:368
  [30] exec_options(::Base.JLOptions) at ./client.jl:296
  [31] _start() at ./client.jl:506

This seems to happening at the start of the process, right? Because I see a call to CuContext there, which should only happen once. The obvious question then: is there enough free memory on this device?

There is enough free memory as the script successfully executes 50% of time.

That’s not relevant to my question. Did you actually check that when the job failed? Also, does the error happen at early-start of the job as I asked, or at some other point?

Could it be that other processes are using these GPUs? Please ask your systems people to run nvidia-smi
Sometimes on HPC there are ‘orphan’ processes left when jobs fail. These can be running and consuming resources yet do not show up on job queue lists

Yes, error occurs at the start of script. Even if I just execute CUDA.zeros(1). Originally, I was using PackageCompiler.jl to compile CUDA.jl and Flux.jl on a HPC GPU and then using that compiled sysimage to run my actual jobs.
Now, I have stopped using that precompiled sysimage and it seems that error isn’t occurring any more.