Simple kernel not working

Hi again guys :slight_smile:

this time I’m facing one problem that makes me think that I’m missing something very important in CUDA programming, so maybe somebody can help me out to understand what goes on. Consider this

m     = CuArrays.zeros(2);
cic   = CuArray(CartesianIndices(rand(3,4)));
res   = CuArrays.zeros(length(cic));
Ncuts = 2

function test2_CUDA(res,CI,m,Ncuts)  
    index    = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride   = blockDim().x * gridDim().x
    for i    = index:stride:length(CI)
        for k in 1:Ncuts
            m[k]  = CI[i][k]-1
        end;    
        for k in 1:Ncuts
            res[i] += m[k]
        end
    end
end;

numblocks      = 256
@cuda threads  = 256 blocks = numblocks test2_CUDA(res,cic,m,Ncuts)
res

12-element CuArray{Float32,1,Nothing}:
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0

now the question is: how can it be all the elements in the array res give the same value? This is supposed to give the sum of the CartesianIndexes (minus 1 in each element), so definitely not a constant…

For the sake of comparison, the same calculation non-CUDA

# Test functions
# 
function test2(res,CI,m,Ncuts)
    for i in 1:length(CI)
        for k in 1:Ncuts
            m[k]  = CI[i][k]-1
        end;    
        for k in 1:Ncuts
            res[i] += m[k]
        end
    end
end;

m     = zeros(2);
cic   = CartesianIndices(rand(3,4));
res   = zeros(length(cic));
Ncuts = 2

test2(res,cic,m,Ncuts)
res
12-element Array{Float64,1}:
 0.0
 1.0
 2.0
 1.0
 2.0
 3.0
 2.0
 3.0
 4.0
 3.0
 4.0
 5.0

Can somebody help me understand this?

Thanks a lot,

Ferran.

You are launching 256*256 threads*blocks, so index goes from 1 to 65536. Block and grid dimensions are both 256, so stride = blockDim().x * gridDim().x = 65536 too, so your loop only performs a single iteration.

You can use @cushow (e.g. @cushow index) or @cuprintln (e.g. @cuprintln "thread $index: stride=$stride") to debug your implementation.

4 Likes

Ah! yes, now I see. I said there was something I did not understand…
Last question then is: how could I make it go through every different element of res?
I mean, to make index i run over all values in the collection?
Thanks for your help,
Ferran.

sorry I thought I understood by I didn’t. I’ve just included a @cuprintln(i) line in the test2_CUDA right after the

for i    = index:stride:length(CI)

line and get

i : 1
i : 2
i : 3
i : 4
i : 5
i : 6
i : 7
i : 8
i : 9
i : 10
i : 11
i : 12

so it seems index i runs over all possible i values, which is consistent with the fact that array res[] is fully actualized… but with the wrong values, because they all get the same final result.

I’m still struggling to understand what’s going on here… unless it really does the loop but the m[k] values share the same memory positions, so all threads get the same values (irrespective of i) corresponding to one of the threads (the ‘last’ one being processed,if that makes any sense?).

Or am I completely missing the point?

Thanks again,

Ferran.

Do you want CI[i, k]?
Note that you can use multidimensional arrays.

I suggest you write a CPU version of your code so it’s easier to see what’s going on.
Alternatively just @cushow absolutely everything to see where the problem is.

Already done in the original post, where I also wrote the CPU code to compare with…
Thw complete problem is more complex but quite more difficult to describe. But if I manage to make the Cuda version above as the non-Cuda, I’ll be happy :slight_smile:

I think you just need to remove this loop. CUDA should do this loop implicitly.

I.e. i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

and remove the stride stuff and the loop over i.

Hi,
unfortunately, if I did understand you correctly, that does not seem to work

using CuArrays
using CuArrays.CURAND
using CUDAnative
using CUDAdrv

####### TEST EXAMPLE

m     = CuArrays.zeros(2);
cic   = CuArray(CartesianIndices(rand(3,4)));
res   = CuArrays.zeros(length(cic));
Ncuts = 2

function test3_CUDA(res,CI,m,Ncuts)  
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    for k in 1:Ncuts
        m[k]  = CI[i][k]-1
    end;    
    for k in 1:Ncuts
        res[i] += m[k]
    end
end;

NNN = 256
numblocks      = NNN
@time @cuda threads  = NNN blocks = numblocks test3_CUDA(res,cic,m,Ncuts)
res

11.693383 seconds (27.65 M allocations: 1.342 GiB, 4.19% gc time)
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
...
KernelException: exception thrown during kernel execution on device GeForce GTX 1050 Ti

Stacktrace:
 [1] check_exceptions() at /home/mazzanti/.julia/packages/CUDAnative/C91oY/src/exceptions.jl:84
 [2] prepare_cuda_call() at /home/mazzanti/.julia/packages/CUDAnative/C91oY/src/init.jl:37
 [3] initialize_api() at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/error.jl:100
 [4] macro expansion at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/libcuda.jl:502 [inlined]
 [5] macro expansion at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/error.jl:110 [inlined]
 [6] cuMemcpyDtoH_v2(::Ptr{Float32}, ::CuPtr{Float32}, ::Int64) at /home/mazzanti/.julia/packages/CUDAapi/XuSHC/src/call.jl:93
 [7] #unsafe_copyto!#6 at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/memory.jl:324 [inlined]
 [8] unsafe_copyto! at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/memory.jl:317 [inlined]
 [9] unsafe_copyto! at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:309 [inlined]
 [10] copyto! at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:284 [inlined]
 [11] copyto! at /home/mazzanti/.julia/packages/GPUArrays/JqOUg/src/host/abstractarray.jl:102 [inlined]
 [12] Array{Float32,1}(::CuArray{Float32,1,Nothing}) at ./array.jl:541
 [13] Array at ./boot.jl:429 [inlined]
 [14] convert at ./array.jl:533 [inlined]
 [15] adapt_storage at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:262 [inlined]
 [16] adapt_structure at /home/mazzanti/.julia/packages/Adapt/3Zk8x/src/Adapt.jl:83 [inlined]
 [17] adapt at /home/mazzanti/.julia/packages/Adapt/3Zk8x/src/Adapt.jl:81 [inlined]
 [18] print_array at /home/mazzanti/.julia/packages/GPUArrays/JqOUg/src/host/abstractarray.jl:52 [inlined]
 [19] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at ./arrayshow.jl:346
 [20] limitstringmime(::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/inline.jl:43
 [21] display_mimestring(::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/display.jl:67
 [22] display_dict(::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/display.jl:96
 [23] #invokelatest#1 at ./essentials.jl:712 [inlined]
 [24] invokelatest at ./essentials.jl:711 [inlined]
 [25] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/execute_request.jl:112
 [26] #invokelatest#1 at ./essentials.jl:712 [inlined]
 [27] invokelatest at ./essentials.jl:711 [inlined]
 [28] eventloop(::ZMQ.Socket) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/eventloop.jl:8
 [29] (::IJulia.var"#15#18")() at ./task.jl:358

Just guessing but maybe you are trying to access the array at non-existent places, since your arrays are tiny but your number of threads and blocks is large?

I do not think so, the last problem appeared when removing the for loop. By keeping it and printing its index value, it just runs over the elements of the array (so you may say one is wasting a lot of threads for nothing). To my eyes it looks like removing the for loop is not a very good idea -provided I’m not yet understanding what goes on, so go figure…