I am using CUDA.jl v2.3 on a HPC GPU cluster. There is a 50% chance that my job will receive an error of out of memory. Otherwise, the same script will run to completion. Any clue why this is happening.
The error stack is
ERROR: LoadError: CUDA error (code 2, CUDA_ERROR_OUT_OF_MEMORY)
Stacktrace:
[1] throw_api_error(::CUDA.cudaError_enum) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:97
[2] macro expansion at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:104 [inlined]
[3] cuDevicePrimaryCtxRetain at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/utils/call.jl:93 [inlined]
[4] CUDA.CuContext(::CUDA.CuPrimaryContext) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/lib/cudadrv/context/primary.jl:32
[5] context(::CUDA.CuDevice) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:257
[6] device! at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:294 [inlined]
[7] device! at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:273 [inlined]
[8] initialize_thread(::Int64) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:122
[9] prepare_cuda_call() at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:80
[10] device at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/state.jl:227 [inlined]
[11] alloc at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/pool.jl:293 [inlined]
[12] CUDA.CuArray{Float32,2}(::UndefInitializer, ::Tuple{Int64,Int64}) at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:20
[13] CuArray at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:76 [inlined]
[14] similar at ./abstractarray.jl:675 [inlined]
[15] convert at /users/PAS0177/shah1285/.julia/packages/GPUArrays/ZxsKE/src/host/construction.jl:82 [inlined]
[16] adapt_storage at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:330 [inlined]
[17] adapt_structure at /users/PAS0177/shah1285/.julia/packages/Adapt/8kQMV/src/Adapt.jl:42 [inlined]
[18] adapt at /users/PAS0177/shah1285/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
[19] cu at /users/PAS0177/shah1285/.julia/packages/CUDA/YeS8q/src/array.jl:342 [inlined]
[20] fmap(::typeof(CUDA.cu), ::Array{Float64,2}; cache::IdDict{Any,Any}) at /users/PAS0177/shah1285/.julia/packages/Functors/YlETM/src/functor.jl:35
[21] fmap at /users/PAS0177/shah1285/.julia/packages/Functors/YlETM/src/functor.jl:34 [inlined]
[22] gpu at /users/PAS0177/shah1285/.julia/packages/Flux/q3zeA/src/functor.jl:74 [inlined]
[23] |>(::Array{Float64,2}, ::typeof(Flux.gpu)) at ./operators.jl:834
[24] getdata_reg(::Int64, ::Int64, ::Float64) at /fs/scratch/PAS0177/NNFitter/src/regression.jl:25
[25] train_reg(::Int64, ::Int64, ::Float64; epochs::Int64) at /fs/scratch/PAS0177/NNFitter/src/regression.jl:55
[26] main() at /fs/scratch/PAS0177/NNFitter/scripts/trainner.jl:15
[27] top-level scope at /fs/scratch/PAS0177/NNFitter/scripts/trainner.jl:19
[28] include(::Function, ::Module, ::String) at ./Base.jl:380
[29] include(::Module, ::String) at ./Base.jl:368
[30] exec_options(::Base.JLOptions) at ./client.jl:296
[31] _start() at ./client.jl:506