Strange CUDA_ERROR_OUT_OF_MEMORY when using CUDA on HPC

UtkarshChemE · November 19, 2020, 12:35pm

I am using CUDA.jl v2.3 on a HPC GPU cluster. There is a 50% chance that my job will receive an error of out of memory. Otherwise, the same script will run to completion. Any clue why this is happening.

The error stack is

ERROR: LoadError: CUDA error (code 2, CUDA_ERROR_OUT_OF_MEMORY)
 Stacktrace:
  [1] throw_api_error(::CUDA.cudaError_enum) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:97
  [2] macro expansion at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:104 [inlined]
  [3] cuDevicePrimaryCtxRetain at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/utils/call.jl:93 [inlined]
  [4] CUDA.CuContext(::CUDA.CuPrimaryContext) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/context/primary.jl:32
  [5] context(::CUDA.CuDevice) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:257
  [6] device! at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:294 [inlined]
  [7] device! at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:273 [inlined]
  [8] initialize_thread(::Int64) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:122
  [9] prepare_cuda_call() at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:80
  [10] device at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:227 [inlined]
  [11] alloc at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/pool.jl:293 [inlined]
  [12] CUDA.CuArray{Float32,2}(::UndefInitializer, ::Tuple{Int64,Int64}) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:20
  [13] CuArray at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:76 [inlined]
  [14] similar at ./abstractarray.jl:675 [inlined]
  [15] convert at /users/PAS0177/shah1285/.julia/packages/GPUArrays/ZxsKE/src/host/construction.jl:82 [inlined]
  [16] adapt_storage at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:330 [inlined]
  [17] adapt_structure at /users/PAS0177/shah1285/.julia/packages/Adapt/8kQMV/src/Adapt.jl:42 [inlined]
  [18] adapt at /users/PAS0177/shah1285/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
  [19] cu at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:342 [inlined]
  [20] fmap(::typeof(CUDA.cu), ::Array{Float64,2}; cache::IdDict{Any,Any}) at /users/PAS0177/shah1285/.julia/packages/Functors/YlETM/src/functor.jl:35
  [21] fmap at /users/PAS0177/shah1285/.julia/packages/Functors/YlETM/src/functor.jl:34 [inlined]
  [22] gpu at /users/PAS0177/shah1285/.julia/packages/Flux/q3zeA/src/functor.jl:74 [inlined]
  [23] |>(::Array{Float64,2}, ::typeof(Flux.gpu)) at ./operators.jl:834
  [24] getdata_reg(::Int64, ::Int64, ::Float64) at /fs/scratch/PAS0177/NNFitter/src/regression.jl:25
  [25] train_reg(::Int64, ::Int64, ::Float64; epochs::Int64) at /fs/scratch/PAS0177/NNFitter/src/regression.jl:55
  [26] main() at /fs/scratch/PAS0177/NNFitter/scripts/trainner.jl:15
  [27] top-level scope at /fs/scratch/PAS0177/NNFitter/scripts/trainner.jl:19
  [28] include(::Function, ::Module, ::String) at ./Base.jl:380
  [29] include(::Module, ::String) at ./Base.jl:368
  [30] exec_options(::Base.JLOptions) at ./client.jl:296
  [31] _start() at ./client.jl:506

maleadt · November 19, 2020, 12:37pm

This seems to happening at the start of the process, right? Because I see a call to CuContext there, which should only happen once. The obvious question then: is there enough free memory on this device?

UtkarshChemE · November 19, 2020, 10:43pm

There is enough free memory as the script successfully executes 50% of time.

maleadt · November 20, 2020, 7:19am

That’s not relevant to my question. Did you actually check that when the job failed? Also, does the error happen at early-start of the job as I asked, or at some other point?

johnh · November 20, 2020, 9:34am

Could it be that other processes are using these GPUs? Please ask your systems people to run nvidia-smi
Sometimes on HPC there are ‘orphan’ processes left when jobs fail. These can be running and consuming resources yet do not show up on job queue lists

UtkarshChemE · November 23, 2020, 6:45am

Yes, error occurs at the start of script. Even if I just execute CUDA.zeros(1). Originally, I was using PackageCompiler.jl to compile CUDA.jl and Flux.jl on a HPC GPU and then using that compiled sysimage to run my actual jobs.
Now, I have stopped using that precompiled sysimage and it seems that error isn’t occurring any more.

Topic		Replies	Views
Significant CUDA.jl memory allocations outside of main pool? GPU memory	2	1410	August 6, 2022
Unexpected OutOfMemory error on HPC General Usage hpc , memory , clustering	17	1705	April 8, 2020
`CUDA error: out of memory` with Flux Machine Learning flux	4	1645	August 24, 2020
Memory is not freed with CUDA and two REPLs GPU cuda	8	1519	May 7, 2021
Free GPU resources? GPU	1	553	June 3, 2020

Strange CUDA_ERROR_OUT_OF_MEMORY when using CUDA on HPC

Related topics