Hi Tim,
thanks for the quick response!
I also tried this by myself and got the same conclusion.
Actually the operation mentioned earlier is followed by further CUDAnative calls. Here is the entire piece of code (it is a frequency sum of two Green’s functions ):
count = 0
for (iq,q_) in enumerate(Π_MKgrid.GRID[:])
for (iω,ω_) in enumerate(Π_ω)
println("count=",count)
t0 = time_ns()
#----------------------BLOCK 1------------------------------
t0 = time_ns()
ωstart = (ω_+G2.ω[1]) - G2ext.ω[1] + 1
ωend = (ω_+G2.ω[end]) - G2ext.ω[1] + 1
IND = indG2ext[ωstart:ωend,transformed_index(q_,G1.K)][:]
t1 = time_ns()
#----------------------BLOCK 2------------------------------
t1 = time_ns()
cu_Ind2 = CuArray(IND) # THE TROUBLESOME OPERATION
t2 = time_ns()
#----------------------BLOCK 3------------------------------
t2 = time_ns()
println("time1 = ", (t1-t0)/1e6, "ms")
println("time2 = ", (t2-t1)/1e6, "ms")
t2 = time_ns()
#----------------------BLOCK 4------------------------------
t2 = time_ns()
@cuda threads=(1024,1,1) blocks=(NG,dimX,dimY) shmem=1024 TrAXBY_CUDA!(cuTMP,cu_G1.ωk,cu_G2ext.ωk,cu_Ind2, IV1,JV1,VV1,IV2,JV2,VV2,CTRL_PARAM)
t3 = time_ns()
#----------------------BLOCK 5------------------------------
t3 = time_ns()
println("time3 = ", (t3-t2)/1e6, "ms")
t3 = time_ns()
#----------------------BLOCK 6------------------------------
t3 = time_ns()
@cuda threads=(1024,1) blocks=(dimX,dimY) shmem=1024 REDUCE_SPECIAL_VER!(iω,iq,NG,dimX,dimY,cuTMP,Π_ωk)
t4 = time_ns()
t4 = time_ns()
println("time4 = ", (t4-t3)/1e6, "ms")
println("------------------------------")
count += 1
end
end
It generates a steady timing profile as the following
count=0
time1 = 0.025915ms
time2 = 0.41788ms
time3 = 13158.3349ms
time4 = 18897.119132ms
------------------------------
count=1
time1 = 0.028788ms
time2 = 3.650909ms
time3 = 0.032666ms
time4 = 0.015446ms
------------------------------
count=2
time1 = 0.017196ms
time2 = 18585.719301ms
time3 = 0.038298ms
time4 = 0.0162ms
------------------------------
count=3
time1 = 0.025236ms
time2 = 18517.259823ms
time3 = 0.035288ms
time4 = 0.015438ms
------------------------------
count=4
time1 = 0.025769ms
time2 = 18517.42081ms
time3 = 0.035106ms
time4 = 0.01575ms
------------------------------
...
...
What troubles me was the line with the comment # THE TROUBLESOME OPERATION
, which correspond to the timing output time2 = ... ms
I over-simplified the problem in the original question. You can see that it is anomalous: after a few loops time2
always get ~ 18000 ms, which is at least 5 order of magnitude than expected.
I suspect that my timing method is wrong.
But, somehow magically, if I comment out the @cuda ...
calling in BLOCK 4 ( which is an essential calculation follows # THE TROUBLESOME OPERATION
and takes its result cu_Ind2
as input ) , the timing of # THE TROUBLESOME OPERATION
just restored to a normal time elapse (I haven’t copied the timing profile here).
I am so confused …