Out-of-bounds error when calling map!

I’m using CUDA.jl to carry out some diffusion simulations. It’s a pretty standard setup, but makes fairly heavy use of rem and mod to bin particles to cells in a grid. To this end, we call map! on a pair of F32 CuArrays, e.g. as:

N = 1000000
u = 1e-5  #or something
X = curand(Float32, N)
X1 = CUDA.zeros(Float32, N)
map!(x->CUDA.rem(x, u), X1, X)

We have to use this weird call as e.g. X1 .= rem.(X, u) doesn’t actually produce the same result (the revenge of issue 748?)

Anyway, this does work as intended - but intermittently fails with an inexplicable out of-bounds error. How can map! even go out of bounds? Why does this only happen sometimes?

I’ve verified this will happen on a GTX 1660 OC, and on a Tesla V100, with the same error.

ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: LoadError: KernelException: exception thrown during kernel execution on device GeForce GTX 1660
Stacktrace:
  [1] check_exceptions()
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\exceptions.jl:37
  [2] device_synchronize
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\context.jl:322 [inlined]
  [3] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\module.jl:41
  [4] CuModule
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\module.jl:23 [inlined]
  [5] cufunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:442
  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler C:\Users\James\.julia\packages\GPUCompiler\fG3xK\src\cache.jl:94
  [7] cufunction(f::GPUArrays.var"#map_kernel#18"{Int64}, tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, DiffusionSimulator.var"#43#49"{Float32}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:288
  [8] cufunction
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:282 [inlined]
  [9] macro expansion
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:102 [inlined]
 [10] #launch_heuristic#233
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\gpuarrays.jl:17 [inlined]
 [11] map!(f::Function, dest::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, xs::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays C:\Users\James\.julia\packages\GPUArrays\UBzTm\src\host\broadcast.jl:130
 [12] diff_sim_gpu(I::Matrix{Int32}, seq::Seq, simu::Simu)
    @ DiffusionSimulator c:\Users\James\.julia\dev\DiffusionSimulator\src\DiffusionSimulator.jl:189
 [13] top-level scope
    @ c:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70
1 Like

This works fine here. I had to change a couple of things: CUDA.rem doesn’t exist anymore, that’s now just rem from Base (but with the familiar Julia semantics, not those of CUDA.rem pre https://github.com/JuliaGPU/CUDA.jl/issues/1040); curand doesn’t exist anymore, that’s now just CUDA.rand, and u needs to be a constant if you want this to work in the REPL.
So I’m thinking you’re using an old version of CUDA.jl. Could you try with an up-to-date version?

Also, with the correct rem semantics there isn’t a difference between this map and rem.(X, u).

1 Like

I lazily put curand as a stand-in for random values. I use CUDA.rand otherwise on CUDA.jl v3.4.2.

u is set within the function that these call from. I’m still getting very different results from substituting
map!(x->rem(x, u), X1, X) with X1 .= rem.(X, u), though the latter doesn’t error.

I’m trying to write a minimum working example of this happening (and failing).

Edit: I was trying to replace the line with the suggested (still gives incorrect results), and added a device_synchronize to see if there was some cross-talk. Instead, this produced an identical error, but for a different line:

X1 .= rem.(X, u)
CUDA.device_synchronize()

Which produced the following error:

ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: LoadError: KernelException: exception thrown during kernel execution on device GeForce GTX 1660
Stacktrace:
 [1] check_exceptions()
   @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\exceptions.jl:37
 [2] device_synchronize
   @ C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\context.jl:322 [inlined]
 [3] diff_sim_gpu(I::Matrix{Int32}, seq::Seq, simu::Simu)
   @ DiffusionSimulator c:\Users\James\.julia\dev\DiffusionSimulator\src\DiffusionSimulator.jl:196
 [4] top-level scope
   @ c:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70
in expression starting at c:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70

I have no idea why this line should error.

GPU execution is asynchronous, so the errors trigger on the first synchronizing API call. So if you call device_synchronize, the host-side stacktrace will change. Try running with -g2 like the error suggests.

This is a very peculiar issue. It’s related to your specific set-up, as I can’t reproduce, and these fundamental operations are thoroughly tested on CI. Could you provide some more details: CUDA version, GPU compute capability, etc. Please post the output of CUDA.versioninfo().

Also try CUDA.jl#master.

That behaviour makes sense to me. When running Julia on debug level 2, we get a very different error, a PTX compile error:

ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 1)
Error in get_form_value default
If you think this is a bug, please file an issue and attach C:\Users\James\AppData\Local\Temp\jl_FeG1alSeLn.ptx
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:33
  [2] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:390
  [3] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler C:\Users\James\.julia\packages\GPUCompiler\fG3xK\src\cache.jl:89
  [4] cufunction(f::GPUArrays.var"#map_kernel#18"{Int64}, tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, DiffusionSimulator.var"#19#25"{Float32}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:288
  [5] cufunction
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:282 [inlined]
  [6] macro expansion
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:102 [inlined]
  [7] #launch_heuristic#233
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\gpuarrays.jl:17 [inlined]
  [8] map!(f::Function, dest::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, xs::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays C:\Users\James\.julia\packages\GPUArrays\UBzTm\src\host\broadcast.jl:130
  [9] diff_sim_gpu(I::Matrix{Int32}, seq::Seq, simu::Simu)
    @ DiffusionSimulator C:\Users\James\.julia\dev\DiffusionSimulator\src\DiffusionSimulator.jl:189
 [10] top-level scope
    @ C:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70
 [11] include(fname::String)
    @ Base.MainInclude .\client.jl:444
 [12] top-level scope
    @ REPL[4]:1
in expression starting at C:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70

The offending line (DiffusionSimulator:189) is map!(x->rem(x, u), X1, X). Changing rem to the broadcast type still affects the simulation results drastically.

CUDA.versioninfo() returns:

CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.4.0
NVIDIA driver 472.12.0

Libraries:
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+472.12
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 1660 (sm_75, 684.465 MiB / 6.000 GiB available)

We get the same error on the GPU server, but I’ll post this then check the behaviour there under the same conditions.

The only other detail I can think of that’s relevant is that we call this function twice (and sequentially). I also see that we’re referring to a cached compilation, and that this is an anonymous function. Could ptxas be trying to optimise the function twice and tripping itself up?

Ah, that’s a known bug in CUDA. Try running with -g2 but with the JULIA_CUDA_DEBUG_INFO environment variable set to false.

And now back to OOB (on the same line):

ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] throw_boundserror at C:\Users\James\.julia\packages\CUDA\9T5Sq\src\device\quirks.jl:4
 [2] checkbounds at .\abstractarray.jl:616
 [3] arrayref at C:\Users\James\.julia\packages\CUDA\9T5Sq\src\device\array.jl:109
 [4] getindex at C:\Users\James\.julia\packages\CUDA\9T5Sq\src\device\array.jl:190
 [5] _getindex at .\abstractarray.jl:1193
 [6] getindex at .\abstractarray.jl:1170
 [7] macro expansion at C:\Users\James\.julia\packages\GPUArrays\UBzTm\src\host\indexing.jl:135
 [8] getindex_kernel at C:\Users\James\.julia\packages\GPUArrays\UBzTm\src\host\indexing.jl:129
ERROR: LoadError: KernelException: exception thrown during kernel execution on device NVIDIA GeForce GTX 1660
Stacktrace:
  [1] check_exceptions()
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\exceptions.jl:37
  [2] device_synchronize
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\context.jl:322 [inlined]
  [3] CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\module.jl:41
  [4] CuModule
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\lib\cudadrv\module.jl:23 [inlined]
  [5] cufunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:442
  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler C:\Users\James\.julia\packages\GPUCompiler\fG3xK\src\cache.jl:94
  [7] cufunction(f::GPUArrays.var"#map_kernel#18"{Int64}, tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, DiffusionSimulator.var"#51#57"{Float32}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})    
    @ CUDA C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:288
  [8] cufunction
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:282 [inlined]
  [9] macro expansion
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\compiler\execution.jl:102 [inlined]
 [10] #launch_heuristic#233
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\gpuarrays.jl:17 [inlined]
 [11] map!(f::Function, dest::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, xs::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays C:\Users\James\.julia\packages\GPUArrays\UBzTm\src\host\broadcast.jl:130
 [12] diff_sim_gpu(I::Matrix{Int32}, seq::Seq, simu::Simu)
    @ DiffusionSimulator C:\Users\James\.julia\dev\DiffusionSimulator\src\DiffusionSimulator.jl:189
 [13] top-level scope
    @ C:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70
 [14] include(fname::String)
    @ Base.MainInclude .\client.jl:444
 [15] top-level scope
    @ REPL[9]:1
 [16] top-level scope
    @ C:\Users\James\.julia\packages\CUDA\9T5Sq\src\initialization.jl:66
in expression starting at C:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70

Yes, but with a device backtrace (which doesn’t reveal anything interesting, though).

Could you try with CUDA.jl#master? There’s been some changes to the compiler. (this might be a ptxas miscompilation I’ve seen before, but never on sm_75 hardware)

Dumb question, can I choose the version I’m asking for in the package manager without calling add https://github.com/JuliaGPU/CUDA.jl?

You can do add CUDA.jl#master (which you can even do in a temporary environment to avoid messing with your set-up). Or git clone the package and do julia --project from the cloned directory.

This did initially produce a different behaviour: The following snippet printed each time the simulation evolved in time (e.g. top level loop):

ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] throw_boundserror at C:\Users\James\.julia\packages\CUDA\O0mym\src\device\quirks.jl:4
 [2] checkbounds at .\abstractarray.jl:616
 [3] arrayref at C:\Users\James\.julia\packages\CUDA\O0mym\src\device\array.jl:109
 [4] getindex at C:\Users\James\.julia\packages\CUDA\O0mym\src\device\array.jl:192
 [5] macro expansion at C:\Users\James\.julia\packages\GPUArrays\0vqbc\src\host\indexing.jl:135
 [6] getindex_kernel at C:\Users\James\.julia\packages\GPUArrays\0vqbc\src\host\indexing.jl:129
WARNING: could not signal exception status to the host, execution will continue.
         Please file a bug.

Despite the repeated errors, the code did actually return the correct result. When subsequently run, we now get a ptxas error. I’ve called ENV["JULIA_CUDA_DEBUG_INFO"] = false, so it shouldn’t be that this time.

ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 1)
Error in get_form_value default
If you think this is a bug, please file an issue and attach C:\Users\James\AppData\Local\Temp\jl_udi9tr1Cqn.ptx
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:33
  [2] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:399
  [3] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler C:\Users\James\.julia\packages\GPUCompiler\7APUC\src\cache.jl:89
  [4] cufunction(f::GPUArrays.var"#map_kernel#18"{Int64}, tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, DiffusionSimulator.var"#17#23"{Float32}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:297
  [5] cufunction
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:291 [inlined]
  [6] macro expansion
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:102 [inlined]
  [7] #launch_heuristic#234
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\gpuarrays.jl:17 [inlined]
  [8] map!(f::Function, dest::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, xs::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays C:\Users\James\.julia\packages\GPUArrays\0vqbc\src\host\broadcast.jl:130
  [9] diff_sim_gpu(I::Matrix{Int32}, seq::Seq, simu::Simu)
    @ DiffusionSimulator C:\Users\James\.julia\dev\DiffusionSimulator\src\DiffusionSimulator.jl:168
 [10] top-level scope
    @ C:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70
 [11] include(fname::String)
    @ Base.MainInclude .\client.jl:444
 [12] top-level scope
    @ REPL[4]:1
 [13] top-level scope
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\initialization.jl:52
in expression starting at C:\Users\James\.julia\dev\DiffusionSimulator\test\runtests.jl:70

So just to be clear, this is the full MWE, right?

julia> using CUDA

julia> N = 1000000;

julia> const u = 1e-5
1.0e-5

julia> X = CUDA.rand(Float32, N);

julia> X1 = CUDA.zeros(Float32, N);

julia> map!(x->rem(x, u), X1, X);

From a REPL (i.e., not setting --check-bounds=yes, or running under Pkg.test), using a vanilla Julia 1.6.2 and CUDA.jl 3.4.2, using a regular artifact-provided CUDA?

My shortest MWE was:

using CUDA


function test1(u, X)
    X1 = similar(X)
    X1 .= rem.(X, u)
    return X1
end

function test2(u, X)
    
    X2 = similar(X)
    map!(x->rem(x, u), X2, X)
    return X2
end

function main()
    t = 0
    for i in 1:10000
        X = CUDA.rand(Float32, Int(1e6)) .- 0.5
        a= test1(0.1, X)
        b = test2(0.1, X)
        t += sum(a .!== b)
    end
    return t
end

main()

Which produced the error:

ERROR: LoadError: Failed to compile PTX code (ptxas exited with code 1)
Error in get_form_value default
If you think this is a bug, please file an issue and attach C:\Users\James\AppData\Local\Temp\jl_3S0cNQ349S.ptx
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:33
  [2] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:399
  [3] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler C:\Users\James\.julia\packages\GPUCompiler\7APUC\src\cache.jl:89
  [4] cufunction(f::GPUArrays.var"#map_kernel#18"{Int64}, tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, var"#1#2"{Float64}, 
Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float64, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:297
  [5] cufunction
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:291 [inlined]
  [6] macro expansion
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\compiler\execution.jl:102 [inlined]
  [7] #launch_heuristic#234
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\gpuarrays.jl:17 [inlined]
  [8] map!(f::Function, dest::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, xs::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays C:\Users\James\.julia\packages\GPUArrays\0vqbc\src\host\broadcast.jl:130
  [9] test2(u::Float64, X::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer})
    @ Main C:\Users\James\.julia\dev\DiffusionSimulator\test\modrem.jl:13
 [10] main()
    @ Main C:\Users\James\.julia\dev\DiffusionSimulator\test\modrem.jl:22
 [11] top-level scope
    @ C:\Users\James\.julia\dev\DiffusionSimulator\test\modrem.jl:28
 [12] include(fname::String)
    @ Base.MainInclude .\client.jl:444
 [13] top-level scope
    @ REPL[7]:1
 [14] top-level scope
    @ C:\Users\James\.julia\packages\CUDA\O0mym\src\initialization.jl:52
in expression starting at C:\Users\James\.julia\dev\DiffusionSimulator\test\modrem.jl:28

The offending line doesn’t always trigger an error, so the outer loop helps to force this.

I’m running this from the REPL in debug l2, no check-bounds etc, or Pkg.test, unmodified Julia 1.6.2 on both devices, totally standard add CUDA installs.

That error is unrelated, if you’re running with -g2 (which you generally shouldn’t, that was just to get a backtrace) you need to define that env var due to bugs in CUDA.

So, I function barriered the offending code, and that seems to stop any errors from arising (which is nice). Still unsure as to what was causing the problem.

My code still periodically produces an error, which is less nice. Considering the error is intermittent, I have to assume it’s memory related.

My code is guilty of a cardinal sin (according to the docs) - I allocate GPU memory in diff_sim_gpu, which is then returned to global scope in the script that executes it. I perform some analysis on this, then (potentially) call the script again. As far as I can tell, this causes the memory use to creep up until the function returns incorrectly. I’ll try ensuring everything happens in a local scope and see what happens.

That sounds like a miscompilation… We actually know of similar miscompilations on older hardware, but it often only manifests when using --check-bounds=yes. We also have a couple of workarounds in place for them, could you try (separately) enabling the following workarounds unconditionally (i.e. removing the surrounding branch):

If either works, this is definitely a miscompilation (likely caused by ptxas).

I’ve now tested this; neither of the two workarounds had any obvious effect on the behaviour of the code. I also started Julia with --check-bounds=no, and this didn’t seem to affect the intermittent incorrect results.

Replacing the above map with the broadcasted version still substantially changes the results. I’m currently just trying to find ways of simplifying this for the compiler by dividing my CPU, GPU alloc and GPU ops into seperate functions. Maybe this will help.

So to be crystal clear here: running Out-of-bounds error when calling map! - #13 by JAgho without -g2 results in a bounds errors during execution of test2, while using the broadcasting version from test1 and presumably comparing it against a CPU reference shows significant differences?

I’ve tried reproducing these errors in several systems, always failing to. So this will be hard to debug without access to the specific system.