# Simple kernel not working

Hi again guys

this time I’m facing one problem that makes me think that I’m missing something very important in CUDA programming, so maybe somebody can help me out to understand what goes on. Consider this

``````m     = CuArrays.zeros(2);
cic   = CuArray(CartesianIndices(rand(3,4)));
res   = CuArrays.zeros(length(cic));
Ncuts = 2

function test2_CUDA(res,CI,m,Ncuts)
index    = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride   = blockDim().x * gridDim().x
for i    = index:stride:length(CI)
for k in 1:Ncuts
m[k]  = CI[i][k]-1
end;
for k in 1:Ncuts
res[i] += m[k]
end
end
end;

numblocks      = 256
@cuda threads  = 256 blocks = numblocks test2_CUDA(res,cic,m,Ncuts)
res

12-element CuArray{Float32,1,Nothing}:
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
``````

now the question is: how can it be all the elements in the array res give the same value? This is supposed to give the sum of the CartesianIndexes (minus 1 in each element), so definitely not a constant…

For the sake of comparison, the same calculation non-CUDA

``````# Test functions
#
function test2(res,CI,m,Ncuts)
for i in 1:length(CI)
for k in 1:Ncuts
m[k]  = CI[i][k]-1
end;
for k in 1:Ncuts
res[i] += m[k]
end
end
end;

m     = zeros(2);
cic   = CartesianIndices(rand(3,4));
res   = zeros(length(cic));
Ncuts = 2

test2(res,cic,m,Ncuts)
res
12-element Array{Float64,1}:
0.0
1.0
2.0
1.0
2.0
3.0
2.0
3.0
4.0
3.0
4.0
5.0
``````

Can somebody help me understand this?

Thanks a lot,

Ferran.

You are launching `256*256` threads`*`blocks, so `index` goes from 1 to 65536. Block and grid dimensions are both 256, so `stride = blockDim().x * gridDim().x = 65536` too, so your loop only performs a single iteration.

You can use `@cushow` (e.g. `@cushow index`) or `@cuprintln` (e.g. `@cuprintln "thread \$index: stride=\$stride"`) to debug your implementation.

4 Likes

Ah! yes, now I see. I said there was something I did not understand…
Last question then is: how could I make it go through every different element of `res`?
I mean, to make index `i` run over all values in the collection?
Ferran.

sorry I thought I understood by I didn’t. I’ve just included a @cuprintln(i) line in the test2_CUDA right after the

``````for i    = index:stride:length(CI)
``````

line and get

``````i : 1
i : 2
i : 3
i : 4
i : 5
i : 6
i : 7
i : 8
i : 9
i : 10
i : 11
i : 12
``````

so it seems index `i` runs over all possible i values, which is consistent with the fact that array `res[]` is fully actualized… but with the wrong values, because they all get the same final result.

I’m still struggling to understand what’s going on here… unless it really does the loop but the `m[k]` values share the same memory positions, so all threads get the same values (irrespective of `i`) corresponding to one of the threads (the ‘last’ one being processed,if that makes any sense?).

Or am I completely missing the point?

Thanks again,

Ferran.

Do you want `CI[i, k]`?
Note that you can use multidimensional arrays.

I suggest you write a CPU version of your code so it’s easier to see what’s going on.
Alternatively just `@cushow` absolutely everything to see where the problem is.

Already done in the original post, where I also wrote the CPU code to compare with…
Thw complete problem is more complex but quite more difficult to describe. But if I manage to make the Cuda version above as the non-Cuda, I’ll be happy

I think you just need to remove this loop. CUDA should do this loop implicitly.

I.e. `i = (blockIdx().x - 1) * blockDim().x + threadIdx().x`

and remove the stride stuff and the loop over `i`.

Hi,
unfortunately, if I did understand you correctly, that does not seem to work

``````using CuArrays
using CuArrays.CURAND
using CUDAnative

####### TEST EXAMPLE

m     = CuArrays.zeros(2);
cic   = CuArray(CartesianIndices(rand(3,4)));
res   = CuArrays.zeros(length(cic));
Ncuts = 2

function test3_CUDA(res,CI,m,Ncuts)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
for k in 1:Ncuts
m[k]  = CI[i][k]-1
end;
for k in 1:Ncuts
res[i] += m[k]
end
end;

NNN = 256
numblocks      = NNN
@time @cuda threads  = NNN blocks = numblocks test3_CUDA(res,cic,m,Ncuts)
res

11.693383 seconds (27.65 M allocations: 1.342 GiB, 4.19% gc time)
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
...
KernelException: exception thrown during kernel execution on device GeForce GTX 1050 Ti

Stacktrace:
[1] check_exceptions() at /home/mazzanti/.julia/packages/CUDAnative/C91oY/src/exceptions.jl:84
[2] prepare_cuda_call() at /home/mazzanti/.julia/packages/CUDAnative/C91oY/src/init.jl:37
[4] macro expansion at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/libcuda.jl:502 [inlined]
[5] macro expansion at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/error.jl:110 [inlined]
[6] cuMemcpyDtoH_v2(::Ptr{Float32}, ::CuPtr{Float32}, ::Int64) at /home/mazzanti/.julia/packages/CUDAapi/XuSHC/src/call.jl:93
[9] unsafe_copyto! at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:309 [inlined]
[10] copyto! at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:284 [inlined]
[11] copyto! at /home/mazzanti/.julia/packages/GPUArrays/JqOUg/src/host/abstractarray.jl:102 [inlined]
[12] Array{Float32,1}(::CuArray{Float32,1,Nothing}) at ./array.jl:541
[13] Array at ./boot.jl:429 [inlined]
[14] convert at ./array.jl:533 [inlined]
[18] print_array at /home/mazzanti/.julia/packages/GPUArrays/JqOUg/src/host/abstractarray.jl:52 [inlined]
[19] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at ./arrayshow.jl:346
[20] limitstringmime(::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/inline.jl:43
[21] display_mimestring(::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/display.jl:67
[22] display_dict(::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/display.jl:96
[23] #invokelatest#1 at ./essentials.jl:712 [inlined]
[24] invokelatest at ./essentials.jl:711 [inlined]
[25] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/execute_request.jl:112
[26] #invokelatest#1 at ./essentials.jl:712 [inlined]
[27] invokelatest at ./essentials.jl:711 [inlined]
[28] eventloop(::ZMQ.Socket) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/eventloop.jl:8