Global memory behavior explained

Can anyone explain the following behavior?

I am using CUDA and have a nested set of kernels, one first-level kernel and three second-level kernels. I’m performing three summations/reductions and storing the results in a multi-dimensional array of global and shared memory. The following lines are print statements in the top level kernel. The first line, temp1, has the values: thread index X, thread index Y, the value in shared memory, the value in a view of global memory, and the value in global memory. Lines two and three are the two other calculated values. So far, so good. However, for the temps print statement, the shared memory value of the triplet has not changed, but the two global values have, e.g., -262.088511 8461469.123870 8461469.123870. Note that the temps print statement immediately follows the temp3 print statement.

temp1 : 1 1 -262.088511 -262.088511 -262.088511
temp2 : 1 1 -13786.282061 -13786.282061 -13786.282061
temp3 : 1 1 126786.890137 126786.890137 126786.890137
temps : 1 1 -262.088511 8461469.123870 8461469.123870 -13786.282061 -777701.541176 -777701.541176 126786.890137 1076325.584789 1076325.584789

This problem appears to be a synchronization issue, but using host "synchronize’ and “device_synchronize” doesn’t appear to help. Do I need implement the algorithm using lower level code?

– Paul

Please post a MWE, I’m having a hard time understanding your issue. Do know that shared memory is only shared between threads in a block though, that’s how CUDA works.

2 Likes

Yes, I do know that shared memory is only associated with a block. The problem that I am having is that the global memory is being corrupted or modified before exiting from the parent kernel. I can get the code to return the proper values if I include several print statements, but once I remove them, the problem returns. It appears to be an issue with nested dynamic kernels. I’m still too much of a novice to understand the details of CUDA programming. I’ll see if I can replicate the problem with a MWE.

Ah, that’s a crucial part: https://github.com/JuliaGPU/CUDA.jl/issues/43. I thought I had worked around that ptxas bug, but maybe there’s some part left. However, first run under cuda-memcheck --tool=synccheck, as IIRC this behavior also triggers when you synchronize threads from a divergent point (i.e. a barrier error).

1 Like

My kernel is also performing a reduction or summation as in issue 43. The kernel does now appear to be working properly and may have for some time. I’m just not sure. However, part of the problem is that my kernel calculates complex values and assigns them to a kernel argument. The complex values do not get assigned and therefore the returned values are incorrect, i.e., zeros. This may have been my problem all along. I just assumed that if there are no compiler errors, then complex numbers should work. This does not appear to be the case. The kernels can read complex numbers just fine, but not write to them. If I return pairs of floating point values and generate the complex values on the host, then it works as expected.

1 Like

That should not be happening. Again, please provide some code. If you can demonstrate the issue, we can look into fixing it.

Sorry for the delayed response. Here is an example, although probably not minimal, demonstrating, at least for me, that complex values are not copied to global memory. All values in the final array should be 1. I’ve resorted storing the real and imaginary parts as a floating point array and then converting them to complex values on the host.

– Paul

using CUDA

function reduce!(gtemp)
temp = @cuStaticSharedMem(Float64, 512)

b::UInt32  = blockIdx().x
t::UInt32  = threadIdx().x
j1::UInt32 = (b - 1)*(2*blockDim().x) + t
j2::UInt32 = (b - 1)*(2*blockDim().x) + t + blockDim().x

temp[t] = gtemp[j1] + gtemp[j2]
sync_threads()

#  Sum (reduce) observations in each block
s::UInt32 = div(blockDim().x, 2)
while s > 32
    if t <= s
        temp[t] += temp[t+s]
    end
    s >>= 1
    sync_threads()
end

temp[t] += temp[t+32]
temp[t] += temp[t+16]
temp[t] += temp[t+ 8]
temp[t] += temp[t+ 4]
temp[t] += temp[t+ 2]
temp[t] += temp[t+ 1]

if t == 1
    gtemp[b] = temp[t]
end

return

end

function reduceA!(gtemp)
# temp = @cuDynamicSharedMem(Float64, blockDim().x)
temp = @cuStaticSharedMem(Float64, 512)

b::UInt32  = blockIdx().x
t::UInt32  = threadIdx().x
j1::UInt32 = (b - 1)*(2*blockDim().x) + t
j2::UInt32 = (b - 1)*(2*blockDim().x) + t + blockDim().x

temp[t] = gtemp[j1] + gtemp[j2]
sync_threads()

#  Sum (reduce) observations in each block
s::UInt32 = div(blockDim().x, 2)
while s > 32
    if t <= s
        temp[t] += temp[t+s]
    end
    s >>= 1
    sync_threads()
end
if blockDim().x > 32 && t <= 32
    temp[t] += temp[t+32]
end
if blockDim().x > 16 && t <= 16
    temp[t] += temp[t+16]
end
if blockDim().x >  8 && t <=  8
    temp[t] += temp[t+ 8]
end
if blockDim().x >  4 && t <=  4
    temp[t] += temp[t+ 4]
end
if blockDim().x >  2 && t <=  2
    temp[t] += temp[t+ 2]
end
if blockDim().x >  1 && t <=  1
    temp[t] += temp[t+ 1]
end

if t == 1
    gtemp[b] = temp[t]
end

return

end

function childAdd1!(gtemp, data1) #, data2, data3)

b::UInt32  = blockIdx().x
t::UInt32  = threadIdx().x
j1::UInt32 = (b - 1)*(2*blockDim().x) + t
j2::UInt32 = (b - 1)*(2*blockDim().x) + t + blockDim().x
j3::UInt32 = (b - 1)*blockDim().x + t

gtemp[j3] = data1[j1] + data1[j2]
return

end

function parentKern!(ctemp, gtemp, data1, data2, data3)

x = (blockIdx().x - 1)*blockDim().x + threadIdx().x
y = (blockIdx().y - 1)*blockDim().y + threadIdx().y

threads::UInt32 = 256
T1::UInt32 = threads
B1::UInt32 = div(length(data1), 2*256)
T2::UInt32 = threads
B2::UInt32 = div(length(data1), 4*256)

temp1 = view(gtemp, :, 1, x, y)
@cuda dynamic=true threads=T1 blocks=B1 childAdd1!(temp1, data1)
@cuda dynamic=true threads=T2 blocks=B2 reduce!(temp1)
@cuda dynamic=true threads=div(B2,2) blocks=1 reduceA!(view(temp1,1:B2))

temp2 = view(gtemp, :, 2, x, y)
@cuda dynamic=true threads=T1 blocks=B1 childAdd1!(temp2, data2)
@cuda dynamic=true threads=T2 blocks=B2 reduce!(temp2)
@cuda dynamic=true threads=div(B2,2) blocks=1 reduceA!(view(temp2,1:B2))

temp3 = view(gtemp, :, 3, x, y)
@cuda dynamic=true threads=T1 blocks=B1 childAdd1!(temp3, data3)
@cuda dynamic=true threads=T2 blocks=B2 reduce!(temp3)
@cuda dynamic=true threads=div(B2,2) blocks=1 reduceA!(view(temp3,1:B2))

ctemp = (temp1[1] + temp2[1]*im)/temp3[1]

return

end

ctemp = CuArray(zeros(Complex{Float64}, 4, 8))
gtemp = CuArray(zeros(Float64, 81024, 3, 4, 8))
data1 = CuArray(ones(Float64, 16
1024))
data2 = CuArray(ones(Float64, 161024))
data3 = CuArray(ones(Float64, 16
1024))

@cuda blocks=(1,1) threads=(4,8) parentKern!(ctemp, gtemp, data1, data2, data3)

println(Array(gtemp[1,:,:,:]))
println(Array(ctemp))

Please simplify the example, it’s too complicated to debug and I’d have to reduce it anyway before I can check what’s wrong. Or does the problem disappear when you simplify the example?

Also please use triple backticks to quote code.

I’ll try simplifying it.

The problem that I was having was not with the complex number type, but kernel synchronization. Adding device_synchronize() after each nested kernel ensured that all writes to global memory were complete before exiting a kernel. The data was therefore visible between the lower and higher level kernels. However, I was not able to solve this problem between the highest level kernel and the host. Adding synchronize() after the kernel made no impact. I solved this problem by using UnifiedBuffers, which also significantly improved performance and enabled the GPU to run at nearly 100%.

1 Like