Simple kernel not working

Ferran_Mazzanti · July 10, 2020, 8:43am

Hi again guys

this time I’m facing one problem that makes me think that I’m missing something very important in CUDA programming, so maybe somebody can help me out to understand what goes on. Consider this

m     = CuArrays.zeros(2);
cic   = CuArray(CartesianIndices(rand(3,4)));
res   = CuArrays.zeros(length(cic));
Ncuts = 2

function test2_CUDA(res,CI,m,Ncuts)  
    index    = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride   = blockDim().x * gridDim().x
    for i    = index:stride:length(CI)
        for k in 1:Ncuts
            m[k]  = CI[i][k]-1
        end;    
        for k in 1:Ncuts
            res[i] += m[k]
        end
    end
end;

numblocks      = 256
@cuda threads  = 256 blocks = numblocks test2_CUDA(res,cic,m,Ncuts)
res

12-element CuArray{Float32,1,Nothing}:
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0
 3.0

now the question is: how can it be all the elements in the array res give the same value? This is supposed to give the sum of the CartesianIndexes (minus 1 in each element), so definitely not a constant…

For the sake of comparison, the same calculation non-CUDA

# Test functions
# 
function test2(res,CI,m,Ncuts)
    for i in 1:length(CI)
        for k in 1:Ncuts
            m[k]  = CI[i][k]-1
        end;    
        for k in 1:Ncuts
            res[i] += m[k]
        end
    end
end;

m     = zeros(2);
cic   = CartesianIndices(rand(3,4));
res   = zeros(length(cic));
Ncuts = 2

test2(res,cic,m,Ncuts)
res
12-element Array{Float64,1}:
 0.0
 1.0
 2.0
 1.0
 2.0
 3.0
 2.0
 3.0
 4.0
 3.0
 4.0
 5.0

Can somebody help me understand this?

Thanks a lot,

Ferran.

maleadt · July 10, 2020, 12:52pm

You are launching 256*256 threads*blocks, so index goes from 1 to 65536. Block and grid dimensions are both 256, so stride = blockDim().x * gridDim().x = 65536 too, so your loop only performs a single iteration.

You can use @cushow (e.g. @cushow index) or @cuprintln (e.g. @cuprintln "thread $index: stride=$stride") to debug your implementation.

Ferran_Mazzanti · July 10, 2020, 3:17pm

Ah! yes, now I see. I said there was something I did not understand…
Last question then is: how could I make it go through every different element of res?
I mean, to make index i run over all values in the collection?
Thanks for your help,
Ferran.

Ferran_Mazzanti · July 11, 2020, 9:12am

sorry I thought I understood by I didn’t. I’ve just included a @cuprintln(i) line in the test2_CUDA right after the

for i    = index:stride:length(CI)

line and get

i : 1
i : 2
i : 3
i : 4
i : 5
i : 6
i : 7
i : 8
i : 9
i : 10
i : 11
i : 12

so it seems index i runs over all possible i values, which is consistent with the fact that array res[] is fully actualized… but with the wrong values, because they all get the same final result.

I’m still struggling to understand what’s going on here… unless it really does the loop but the m[k] values share the same memory positions, so all threads get the same values (irrespective of i) corresponding to one of the threads (the ‘last’ one being processed,if that makes any sense?).

Or am I completely missing the point?

Thanks again,

Ferran.

dpsanders · July 11, 2020, 3:01pm

Do you want CI[i, k]?
Note that you can use multidimensional arrays.

I suggest you write a CPU version of your code so it’s easier to see what’s going on.
Alternatively just @cushow absolutely everything to see where the problem is.

Ferran_Mazzanti · July 11, 2020, 3:41pm

Already done in the original post, where I also wrote the CPU code to compare with…
Thw complete problem is more complex but quite more difficult to describe. But if I manage to make the Cuda version above as the non-Cuda, I’ll be happy

dpsanders · July 11, 2020, 3:45pm

I think you just need to remove this loop. CUDA should do this loop implicitly.

dpsanders · July 11, 2020, 3:46pm

I.e. i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

and remove the stride stuff and the loop over i.

Ferran_Mazzanti · July 11, 2020, 4:11pm

Hi,
unfortunately, if I did understand you correctly, that does not seem to work

using CuArrays
using CuArrays.CURAND
using CUDAnative
using CUDAdrv

####### TEST EXAMPLE

m     = CuArrays.zeros(2);
cic   = CuArray(CartesianIndices(rand(3,4)));
res   = CuArrays.zeros(length(cic));
Ncuts = 2

function test3_CUDA(res,CI,m,Ncuts)  
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    for k in 1:Ncuts
        m[k]  = CI[i][k]-1
    end;    
    for k in 1:Ncuts
        res[i] += m[k]
    end
end;

NNN = 256
numblocks      = NNN
@time @cuda threads  = NNN blocks = numblocks test3_CUDA(res,cic,m,Ncuts)
res

11.693383 seconds (27.65 M allocations: 1.342 GiB, 4.19% gc time)
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
...
KernelException: exception thrown during kernel execution on device GeForce GTX 1050 Ti

Stacktrace:
 [1] check_exceptions() at /home/mazzanti/.julia/packages/CUDAnative/C91oY/src/exceptions.jl:84
 [2] prepare_cuda_call() at /home/mazzanti/.julia/packages/CUDAnative/C91oY/src/init.jl:37
 [3] initialize_api() at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/error.jl:100
 [4] macro expansion at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/libcuda.jl:502 [inlined]
 [5] macro expansion at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/error.jl:110 [inlined]
 [6] cuMemcpyDtoH_v2(::Ptr{Float32}, ::CuPtr{Float32}, ::Int64) at /home/mazzanti/.julia/packages/CUDAapi/XuSHC/src/call.jl:93
 [7] #unsafe_copyto!#6 at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/memory.jl:324 [inlined]
 [8] unsafe_copyto! at /home/mazzanti/.julia/packages/CUDAdrv/Uc14X/src/memory.jl:317 [inlined]
 [9] unsafe_copyto! at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:309 [inlined]
 [10] copyto! at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:284 [inlined]
 [11] copyto! at /home/mazzanti/.julia/packages/GPUArrays/JqOUg/src/host/abstractarray.jl:102 [inlined]
 [12] Array{Float32,1}(::CuArray{Float32,1,Nothing}) at ./array.jl:541
 [13] Array at ./boot.jl:429 [inlined]
 [14] convert at ./array.jl:533 [inlined]
 [15] adapt_storage at /home/mazzanti/.julia/packages/CuArrays/YFdj7/src/array.jl:262 [inlined]
 [16] adapt_structure at /home/mazzanti/.julia/packages/Adapt/3Zk8x/src/Adapt.jl:83 [inlined]
 [17] adapt at /home/mazzanti/.julia/packages/Adapt/3Zk8x/src/Adapt.jl:81 [inlined]
 [18] print_array at /home/mazzanti/.julia/packages/GPUArrays/JqOUg/src/host/abstractarray.jl:52 [inlined]
 [19] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at ./arrayshow.jl:346
 [20] limitstringmime(::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/inline.jl:43
 [21] display_mimestring(::MIME{Symbol("text/plain")}, ::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/display.jl:67
 [22] display_dict(::CuArray{Float32,1,Nothing}) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/display.jl:96
 [23] #invokelatest#1 at ./essentials.jl:712 [inlined]
 [24] invokelatest at ./essentials.jl:711 [inlined]
 [25] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/execute_request.jl:112
 [26] #invokelatest#1 at ./essentials.jl:712 [inlined]
 [27] invokelatest at ./essentials.jl:711 [inlined]
 [28] eventloop(::ZMQ.Socket) at /home/mazzanti/.julia/packages/IJulia/DrVMH/src/eventloop.jl:8
 [29] (::IJulia.var"#15#18")() at ./task.jl:358

dpsanders · July 12, 2020, 2:51am

Just guessing but maybe you are trying to access the array at non-existent places, since your arrays are tiny but your number of threads and blocks is large?

Ferran_Mazzanti · July 12, 2020, 12:20pm

I do not think so, the last problem appeared when removing the for loop. By keeping it and printing its index value, it just runs over the elements of the array (so you may say one is wasting a lot of threads for nothing). To my eyes it looks like removing the for loop is not a very good idea -provided I’m not yet understanding what goes on, so go figure…

Topic		Replies	Views
CUDA \| nested loops kernel GPU question	5	168	May 12, 2025
Indexing 2D CuArray General Usage	3	1115	July 18, 2021
Create a simple CUDA.sum kernel GPU	3	1963	January 3, 2021
Combine CartesianIndices for effective CUDA kernels GPU gpu , indexing	4	793	May 11, 2021
CUDA.jl - Sub-Vector Indexing Problem Inside CUDA Kernel GPU cuda , error , cuarrays , error-message , staticarrays	2	1244	March 28, 2022

Simple kernel not working

Related topics