Hi everyone,

I jumped into CUDA programming mid-way without good background knowledge. It is just because `CUDAnative.jl`

is so easy to program with. Now I have a problem related to CUDA, not not the package `CUDAnative.jl`

. But I believe that there will be experts here to answer my question!

My question came from the effort in 2d numerical integral. The following function is the final step to do the summation of `BF[X,w1,w2]*dω1dω2[w1,w2]`

on the grid. The equivalent CPU code is

```
OUT[x] = sum( BF[x,:,:] .* dω1dω2[:,:] ) for x in X
```

Since the mesh for `w1,w2`

is large, I expect acceleration with GPU.

The following code consists of two parts. The first part has a double-while loop, which accumulates on the shared memory `sdata`

. The second part is a familiar one, which “fold” the shared memory into the final result. Similar code for a 1d shared memory is a standard CUDA sample code.

My question is: what is the warp execution order if I use `threadIdx().x`

and `threadIdx().y`

? Should I reverse the order of “folding”, by firstly do it on the y-dimension? The answer is crucial to the performance of the second part. If I got the wrong understanding, the second part will probably be “divergent” – different threads in a warp are executing different if-else branches. The warp size is 32.

```
#NOTE goal: sum( BF[X,:,:] .* dω1dω2[:,:] ) ==> OUT[X,b3] for each X
function Trapezoid_kernel!(
dimsBF::NTuple{5,Int64},
OUT, #signature (:X, :ω)
BF, #signature (:X, :ω1-1, :ω2-1)
dω1dω2 #signature (:ω1-1, :ω2-1)
)
# -------------------------------------------------
(dimBF1,dimBF2,dimBF3) = dimsBF
# -------------------------------------------------
sdata = @cuStaticSharedMem(ComplexF64,(32,32)) #XXX could be larger!
sdata[threadIdx().x,threadIdx().y] = zero(ComplexF64)
# -------------------------------------------------
if blockIdx().z<=dimBF1 && blockIdx().x==1 && blockIdx().y==1
tidx = threadIdx().x
while tidx <= dimBF2
tidy = threadIdx().y
while tidy <= dimBF3
#NOTE accumulate on shared memory
#NOTE BF[X,:,:] .* dω1dω2[:,:] where X = blockIdx().z
sdata[threadIdx().x,threadIdx().y] += (BF[blockIdx().z,tidx,tidy]*dω1dω2[tidx,tidy])
tidy += 32
end
tidx += 32
end
# ---------------------- REDUCE -------------------------
sync_threads()
# ------------------------------------------------------
#QUESTION the warp sequence? how to avoid digergence ?
if threadIdx().x<=16
@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+16,threadIdx().y]
end
sync_threads()
if threadIdx().x<=8
@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+8,threadIdx().y]
end
sync_threads()
if threadIdx().x<=4
@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+4,threadIdx().y]
end
sync_threads()
if threadIdx().x<=2
@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+2,threadIdx().y]
end
sync_threads()
if threadIdx().x==1
@inbounds sdata[threadIdx().x,threadIdx().y] += sdata[threadIdx().x+1,threadIdx().y]
end
sync_threads()
if threadIdx().x==1 && threadIdx().y<=16
@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+16]
sync_threads()
@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+8]
sync_threads()
@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+4]
sync_threads()
@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+2]
sync_threads()
@inbounds sdata[1,threadIdx().y] += sdata[1,threadIdx().y+1]
sync_threads()
end
sync_threads()
# ---------------------- WRITE -------------------------
if threadIdx().x==1 && threadIdx().y==1
@inbounds OUT[blockIdx().z] = sdata[1,1]
end
end
nothing
end
```