CUDA_ERROR_ILLEGAL_ADDRESS with CuArrays/Zygote getindex

Hello, I am getting an error CuError(CUDA_ERROR_ILLEGAL_ADDRESS) with the following stacktrace. My code is trying to take gradients of a deep learning model with Zygote. I wonder if this is a symptom of running out of memory, or a bug in CuArrays/Zygote?

The line where it fails in Zygote corresponds to ∇getindex. My code is indexing into a CuArray using an array of integers in the main memory, like this:

x[idc, :] # typeof(x) == CuArray{Float32, 2}; typeof(idc) == Vector{Int}
ERROR: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] cuMemcpyHtoD_v2(::CUDAdrv.CuPtr{Int64}, ::Ptr{Int64}, ::Int64) at /rds/general/user/et517/home/.julia/packages/CUDAdrv/3EzC1/src/error.jl:123
 [2] #unsafe_copyto!#6 at /rds/general/user/et517/home/.julia/packages/CUDAdrv/3EzC1/src/memory.jl:285 [inlined]
 [3] unsafe_copyto! at /rds/general/user/et517/home/.julia/packages/CUDAdrv/3EzC1/src/memory.jl:278 [inlined]
 [4] copyto! at /rds/general/user/et517/home/.julia/packages/CuArrays/ZYCpV/src/array.jl:254 [inlined]
 [5] copyto!(::CuArrays.CuArray{Int64,1,Nothing}, ::Array{Int64,1}) at /rds/general/user/et517/home/.julia/packages/GPUArrays/1wgPO/src/abstractarray.jl:118
 [6] convert(::Type{CuArrays.CuArray}, ::Array{Int64,1}) at /rds/general/user/et517/home/.julia/packages/GPUArrays/1wgPO/src/construction.jl:84
 [7] _adapt_structure at /rds/general/user/et517/home/.julia/packages/CuArrays/ZYCpV/src/array.jl:237 [inlined]
 [8] adapt_structure at /rds/general/user/et517/home/.julia/packages/Adapt/aeQPS/src/base.jl:12 [inlined]
 [9] adapt at /rds/general/user/et517/home/.julia/packages/Adapt/aeQPS/src/Adapt.jl:6 [inlined]
 [10] adapt_structure(::CUDAnative.Adaptor, ::SubArray{Float32,2,CuArrays.CuArray{Float32,2,Nothing},Tuple{Array{Int64,1},Base.Slice{Base.OneTo{Int64}}},false}) at /rds/general/user/et517/home/.julia/packages/CuArrays/ZYCpV/src/subarray.jl:63
 [11] adapt at /rds/general/user/et517/home/.julia/packages/Adapt/aeQPS/src/Adapt.jl:6 [inlined]
 [12] cudaconvert at /rds/general/user/et517/home/.julia/packages/CUDAnative/RhbZ0/src/execution.jl:211 [inlined]
 [13] map at ./tuple.jl:141 [inlined]
 [14] macro expansion at /rds/general/user/et517/home/.julia/packages/CUDAnative/RhbZ0/src/execution.jl:174 [inlined]
 [15] macro expansion at ./gcutils.jl:87 [inlined]
 [16] macro expansion at /rds/general/user/et517/home/.julia/packages/CUDAnative/RhbZ0/src/execution.jl:173 [inlined]
 [17] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::SubArray{Float32,2,CuArrays.CuArray{Float32,2,Nothing},Tuple{Array{Int64,1},Base.Slice{Base.OneTo{Int64}}},false}, ::Tuple{SubArray{Float32,2,CuArrays.CuArray{Float32,2,Nothing},Tuple{Array{Int64,1},Base.Slice{Base.OneTo{Int64}}},false},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Extruded{SubArray{Float32,2,CuArrays.CuArray{Float32,2,Nothing},Tuple{Array{Int64,1},Base.Slice{Base.OneTo{Int64}}},false},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CuArrays.CuArray{Float32,2,Nothing},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at /rds/general/user/et517/home/.julia/packages/CuArrays/ZYCpV/src/gpuarray_interface.jl:62
 [18] gpu_call at /rds/general/user/et517/home/.julia/packages/GPUArrays/1wgPO/src/abstract_gpu_interface.jl:151 [inlined]
 [19] gpu_call at /rds/general/user/et517/home/.julia/packages/GPUArrays/1wgPO/src/abstract_gpu_interface.jl:128 [inlined]
 [20] copyto! at /rds/general/user/et517/home/.julia/packages/GPUArrays/1wgPO/src/broadcast.jl:48 [inlined]
 [21] copyto! at ./broadcast.jl:842 [inlined]
 [22] materialize! at ./broadcast.jl:801 [inlined]
 [23] (::getfield(Zygote, Symbol("##984#986")){CuArrays.CuArray{Float32,2,CuArrays.CuArray{Float32,4,Nothing}},Tuple{Array{Int64,1},Colon}})(::CuArrays.CuArray{Float32,2,Nothing}) at /rds/general/user/et517/home/.julia/packages/Zygote/N2BNN/src/lib/array.jl:38
...

Don’t you see a kernel exception some time earlier? If you do, you can run with julia -g2 to see more details. If you don’t, try running with --check-bounds=yes.

I get lots of error messages like this:

error in running finalizer: CUDAdrv.CuError(code=CUDAdrv.cudaError_enum(0x000002bc), meta=nothing)

I have tried running julia -g2 --check-bounds=yes but got the same error.

This must be somehow related to it running out of memory, because when I reduce the minibatch size it goes away.

CUDA errors persist, so the ones you are seeing in the finalizer are the same you caught earlier one time (as reported in your first post), i.e., CUDA_ERROR_ILLEGAL_ADDRESS.

Are you switching devices, perhaps?

I am not switching devices, although the server where I run this and get the error has multiple GPUs.

When I run this on a less powerful machine with a single GPU and less memory, I get a nice “Out of GPU memory” error.