Parallel calculations with CUDA

I’m going through the tutorial https://juliagpu.gitlab.io/CuArrays.jl/tutorials/generated/intro/ and have some problems and errors. First the problems:

I’m starting as in the tutorial

using CuArrays, CUDAnative, BenchmarkTools
N=2^20
x=CuArrays.fill(1f0,N)
y=CuArrays.fill(2f0,N)

function gpu_add1!(y, x)
    for i = 1:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

Here I have a question about @inbounds which is supposed to stop checking whether something is in range or not. Does somebody have an example which explicitly shows the difference.
When looping over an N dimensinal Array A

for i=1:N+1
@inbounds A[i]=1
end

gives the same error as without, so I would like to understand where it makes the difference.

Next I want to call gpu_add1!(y,x). What happens if I do not prefix @cuda? Does it run on CPU with graphiccard memory, because it seems to take longer than @cuda gpu_add1!(y,x).

I’m then running

@btime @cuda gpu_add1!(y,x) and get the error

ERROR: CUDA error: the launch timed out and was terminated (code 702, ERROR_LAUNCH_TIMEOUT)
Stacktrace:
 [1] cuLaunchKernel(::CUDAdrv.CuFunction, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::Int64, ::CUDAdrv.CuStream, ::Array{Ptr{Nothing},1}, ::Ptr{Nothing}) at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\error.jl:123
 [2] (::CUDAdrv.var"#350#351"{Bool,Int64,CUDAdrv.CuStream,CUDAdrv.CuFunction})(::Array{Ptr{Nothing},1}) at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:97
 [3] macro expansion at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:63 [inlined]
 [4] macro expansion at .\gcutils.jl:91 [inlined]
 [5] macro expansion at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:61 [inlined]
 [6] pack_arguments(::CUDAdrv.var"#350#351"{Bool,Int64,CUDAdrv.CuStream,CUDAdrv.CuFunction}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}) at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:40
 [7] #launch#349(::Int64, ::Int64, ::Bool, ::Int64, ::CUDAdrv.CuStream, ::typeof(CUDAdrv.launch), ::CUDAdrv.CuFunction, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{CuDeviceArray{Float32,1,CUDAnative.AS.Global},N} where N) at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:90
 [8] launch at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:85 [inlined]
 [9] #355 at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:164 [inlined]
 [10] macro expansion at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:125 [inlined]
 [11] macro expansion at .\gcutils.jl:91 [inlined]
 [12] macro expansion at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:124 [inlined]
 [13] convert_arguments at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:108 [inlined]
 [14] #cudacall#354 at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:163 [inlined]
 [15] cudacall at C:\Users\Diger\.julia\packages\CUDAdrv\3EzC1\src\execution.jl:163 [inlined]
 [16] #cudacall#199 at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:282 [inlined]
 [17] cudacall at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:279 [inlined]
 [18] macro expansion at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:263 [inlined]
 [19] #call#187(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.call), ::CUDAnative.HostKernel{gpu_add1!,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}) at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:240
 [20] call(::CUDAnative.HostKernel{gpu_add1!,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{CuDeviceArray{Float32,1,CUDAnative.AS.Global},N} where N) at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:240
 [21] #_#204(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::CUDAnative.HostKernel{gpu_add1!,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{CuDeviceArray{Float32,1,CUDAnative.AS.Global},N} where N) at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:454
 [22] (::CUDAnative.HostKernel{gpu_add1!,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}})(::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{CuDeviceArray{Float32,1,CUDAnative.AS.Global},N} where N) at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:454
 [23] macro expansion at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:178 [inlined]
 [24] macro expansion at .\gcutils.jl:91 [inlined]
 [25] macro expansion at C:\Users\Diger\.julia\packages\CUDAnative\RhbZ0\src\execution.jl:173 [inlined]
 [26] ##core#408() at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:297
 [27] ##sample#409(::BenchmarkTools.Parameters) at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:305
 [28] sample at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:320 [inlined]
 [29] #_lineartrial#41(::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(BenchmarkTools._lineartrial), ::BenchmarkTools.Benchmark{Symbol("##benchmark#407")}, ::BenchmarkTools.Parameters) at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:71
 [30] _lineartrial(::BenchmarkTools.Benchmark{Symbol("##benchmark#407")}, ::BenchmarkTools.Parameters) at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:63
 [31] #invokelatest#1 at .\essentials.jl:709 [inlined]
 [32] invokelatest at .\essentials.jl:708 [inlined]
 [33] #lineartrial#38 at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:33 [inlined]
 [34] lineartrial at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:33 [inlined]
 [35] #tune!#44(::Bool, ::String, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(tune!), ::BenchmarkTools.Benchmark{Symbol("##benchmark#407")}, ::BenchmarkTools.Parameters) at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:135
 [36] tune! at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:134 [inlined] (repeats 2 times)
 [37] top-level scope at C:\Users\Diger\.julia\packages\BenchmarkTools\7aqwe\src\execution.jl:391

A last error occurs when calling
index=threadIdx().x
in the REPL to see what it does. But it returns a looong error message which I can not post, because the window crashes at the end. Same with
stride = blockDim().x

I’m running Julia on windows10.