Hi all, I am building up a time marching simulation via using CUDAnative,
I found that after several time steps , the performance drop significantly.
If I “synchronize” all the thread at some point, the performance was back but the action of doing synchronize take a long time.
I transform my problems into a simple case as following code:
Define a simple matrix adding function:
function madd(a, b, c)
i = threadIdx().x
j = blockIdx().x
c[i,j] = a[i,j] + b[i,j]block
return
end
Initialize the data in GPU device
d_a=cu(rand(2^10,2^10))
d_b=cu(rand(2^10,2^10))
d_c=cu(zeros(2^10,2^10))
the main code is here
for timestep=1:100000 # time marching
@time begin # @time to evaluate how long does it take for each time step
for j=1:50 # here I just want to enhance the complexity in each time step
@cuda blocks=1024 threads=1024 madd(d_a,d_b,d_c)
d_c.=0
end
if mod(i,60)==1
synchronize()
println("syn()")
end
end
end
the output would be
syn()
19.936656 seconds (25.94 M allocations: 1.266 GiB, 4.47% gc time)
0.001374 seconds (5.25 k allocations: 174.219 KiB)
0.001409 seconds (5.25 k allocations: 174.219 KiB)
0.001331 seconds (5.25 k allocations: 174.219 KiB)
0.001368 seconds (5.25 k allocations: 174.219 KiB)
0.001205 seconds (5.25 k allocations: 174.219 KiB)
0.001572 seconds (5.25 k allocations: 174.219 KiB)
0.001526 seconds (5.25 k allocations: 174.219 KiB)
0.001383 seconds (5.25 k allocations: 174.219 KiB)
0.001252 seconds (5.25 k allocations: 174.219 KiB)
.
.
.
0.001218 seconds (5.25 k allocations: 174.219 KiB)
0.001629 seconds (5.25 k allocations: 174.219 KiB)
0.001264 seconds (5.25 k allocations: 174.219 KiB)
0.002820 seconds (5.25 k allocations: 174.219 KiB)
0.003801 seconds (5.25 k allocations: 174.219 KiB)
0.798378 seconds (5.25 k allocations: 174.219 KiB)
1.605217 seconds (5.25 k allocations: 174.219 KiB)
1.068703 seconds (5.25 k allocations: 174.219 KiB)
1.565054 seconds (5.25 k allocations: 174.219 KiB)
1.497024 seconds (5.25 k allocations: 174.219 KiB)
1.217929 seconds (5.25 k allocations: 174.219 KiB)
syn()
34.657185 seconds (5.43 k allocations: 184.656 KiB, 0.02% gc time)
0.001539 seconds (5.25 k allocations: 174.219 KiB)
.
.
.
The output shows that after several time steps (around 50 steps in my laptop), time cost of each time step increases dramatically from O(ms) to O(s).
Then if I execute synchronize(), which costs really long time, the performance is back, but it seems not worth to do it.
How can I improve the performance here? Thank you!