`CUDA error: out of memory` with Flux

I already posted about this issue here.
After moving to Julia 1.5, I still get a similar error.
When I run yiyu-test.jl I get

ERROR: LoadError: CUDA error: out of memory (code 2, ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /home/natale/.julia/packages/CUDA/d6WNR/lib/cudadrv/error.jl:103
 [2] CUDA.CuModule(::String, ::Dict{CUDA.CUjit_option_enum,Any}) at /home/natale/.julia/packages/CUDA/d6WNR/lib/cudadrv/module.jl:42
 [3] _cufunction(::GPUCompiler.FunctionSpec{CUDA.var"#kernel#871"{CUDA.var"#877#878"{Float32}},Tuple{CUDA.CuDeviceArray{Int64,1,CUDA.AS.Global},CUDA.CuDeviceArray{Float32,2,CUDA.AS.Global}}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/natale/.julia/packages/CUDA/d6WNR/src/compiler/execution.jl:337
 [4] _cufunction at /home/natale/.julia/packages/CUDA/d6WNR/src/compiler/execution.jl:304 [inlined]
 [5] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{CUDA.var"#kernel#871"{CUDA.var"#877#878"{Float32}},Tuple{CUDA.CuDeviceArray{Int64,1,CUDA.AS.Global},CUDA.CuDeviceArray{Float32,2,CUDA.AS.Global}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/natale/.julia/packages/GPUCompiler/rABm5/src/cache.jl:24
 [6] kernel at /home/natale/.julia/packages/CUDA/d6WNR/src/indexing.jl:102 [inlined]
 [7] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{CUDA.var"#kernel#871"{CUDA.var"#877#878"{Float32}},Tuple{CUDA.CuDeviceArray{Int64,1,CUDA.AS.Global},CUDA.CuDeviceArray{Float32,2,CUDA.AS.Global}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/natale/.julia/packages/GPUCompiler/rABm5/src/cache.jl:0
 [8] cached_compilation at /home/natale/.julia/packages/GPUCompiler/rABm5/src/cache.jl:40 [inlined]
 [9] cufunction(::CUDA.var"#kernel#871"{CUDA.var"#877#878"{Float32}}, ::Type{Tuple{CUDA.CuDeviceArray{Int64,1,CUDA.AS.Global},CUDA.CuDeviceArray{Float32,2,CUDA.AS.Global}}}; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/natale/.julia/packages/CUDA/d6WNR/src/compiler/execution.jl:298
 [10] macro expansion at /home/natale/.julia/packages/CUDA/d6WNR/src/compiler/execution.jl:109 [inlined]
 [11] findfirst(::CUDA.var"#877#878"{Float32}, ::CUDA.CuArray{Float32,2}) at /home/natale/.julia/packages/CUDA/d6WNR/src/indexing.jl:120
 [12] #findmax#876 at /home/natale/.julia/packages/CUDA/d6WNR/src/indexing.jl:143 [inlined]
 [13] findmax(::CUDA.CuArray{Float32,2}) at /home/natale/.julia/packages/CUDA/d6WNR/src/indexing.jl:141
 [14] (::var"#7#8"{CUDA.CuArray{Float32,4}})(::Int64) at ./none:0
 [15] iterate at ./generator.jl:47 [inlined]
 [16] collect_to!(::Array{Int64,1}, ::Base.Generator{UnitRange{Int64},var"#7#8"{CUDA.CuArray{Float32,4}}}, ::Int64, ::Int64) at ./array.jl:732
 [17] collect_to_with_first!(::Array{Int64,1}, ::Int64, ::Base.Generator{UnitRange{Int64},var"#7#8"{CUDA.CuArray{Float32,4}}}, ::Int64) at ./array.jl:710
 [18] collect(::Base.Generator{UnitRange{Int64},var"#7#8"{CUDA.CuArray{Float32,4}}}) at ./array.jl:691
 [19] max_pred(::CUDA.CuArray{Float32,4}) at /home/natale/brainside/transflearn/yiyu-test.jl:26
 [20] accuracy(::CUDA.CuArray{Float32,4}, ::Flux.OneHotMatrix{CUDA.CuArray{Flux.OneHotVector,1}}) at /home/natale/brainside/transflearn/yiyu-test.jl:28
 [21] top-level scope at show.jl:641
 [22] top-level scope at /home/natale/brainside/transflearn/yiyu-test.jl:46
 [23] include(::Function, ::Module, ::String) at ./Base.jl:380
 [24] include(::Module, ::String) at ./Base.jl:368
 [25] exec_options(::Base.JLOptions) at ./client.jl:296
 [26] _start() at ./client.jl:506
in expression starting at /home/natale/brainside/transflearn/yiyu-test.jl:36

I posted the last version of my code in this gist.

Well… the code is now running with batches of size 10 rather than 1000.
I vaguely remember to have tried even with a batch size of 1 and to be getting the same error, but I didn’t save the example so I will remain wondering…
Btw, I was checking the GPU memory usage with nvidia-smi and, independently of the batch size, the code leaves only few MB of ram available when running the main loop, which I copy here for easy reference:

for epoch = 1:epochs
        @info "epoch" epoch
        for i in 1:batchnum
                batch = trainset[i] |> gpu
                gs = gradient(params(m)) do
                        l = loss(batch...)
                end
                @info "batch fraction" i/batchnum
                update!(opt, params(m), gs)
        end
        @show accuracy(valX, valY)
end

The error is being raised then trying to run accuracy.
I’m now doing a binary search to find out the critical batch size, for what is worth.

On a Quadro RTX 4000, the code turns out to work with batch sizes up to 37.

Have a look at https://juliagpu.gitlab.io/CUDA.jl/usage/memory/#Batching-iterator as well. The automatic memory reclamation there might allow for a larger batch size.

2 Likes

Thanks, that’s the reading I was looking for :slightly_smiling_face: