CUDAnative , performance drop after several timesteps

Hi all, I am building up a time marching simulation via using CUDAnative,
I found that after several time steps , the performance drop significantly.
If I “synchronize” all the thread at some point, the performance was back but the action of doing synchronize take a long time.

I transform my problems into a simple case as following code:

Define a simple matrix adding function:

function madd(a, b, c)
    i =  threadIdx().x
    j =  blockIdx().x
    c[i,j] = a[i,j] + b[i,j]block
    return
end

Initialize the data in GPU device

d_a=cu(rand(2^10,2^10))
d_b=cu(rand(2^10,2^10))
d_c=cu(zeros(2^10,2^10))

the main code is here

for timestep=1:100000    # time marching
    @time begin  # @time to evaluate how long does it take for each time step
        for j=1:50  # here I just want to enhance the complexity in each time step 
            @cuda blocks=1024 threads=1024 madd(d_a,d_b,d_c)
            d_c.=0
        end
        if mod(i,60)==1
            synchronize()
            println("syn()")
        end        
    end
end

the output would be

syn()
 19.936656 seconds (25.94 M allocations: 1.266 GiB, 4.47% gc time)
  0.001374 seconds (5.25 k allocations: 174.219 KiB)
  0.001409 seconds (5.25 k allocations: 174.219 KiB)
  0.001331 seconds (5.25 k allocations: 174.219 KiB)
  0.001368 seconds (5.25 k allocations: 174.219 KiB)
  0.001205 seconds (5.25 k allocations: 174.219 KiB)
  0.001572 seconds (5.25 k allocations: 174.219 KiB)
  0.001526 seconds (5.25 k allocations: 174.219 KiB)
  0.001383 seconds (5.25 k allocations: 174.219 KiB)
  0.001252 seconds (5.25 k allocations: 174.219 KiB)
                                 .
                                 .
                                 .
  0.001218 seconds (5.25 k allocations: 174.219 KiB)
  0.001629 seconds (5.25 k allocations: 174.219 KiB)
  0.001264 seconds (5.25 k allocations: 174.219 KiB)
  0.002820 seconds (5.25 k allocations: 174.219 KiB)
  0.003801 seconds (5.25 k allocations: 174.219 KiB)
  0.798378 seconds (5.25 k allocations: 174.219 KiB)
  1.605217 seconds (5.25 k allocations: 174.219 KiB)
  1.068703 seconds (5.25 k allocations: 174.219 KiB)
  1.565054 seconds (5.25 k allocations: 174.219 KiB)
  1.497024 seconds (5.25 k allocations: 174.219 KiB)
  1.217929 seconds (5.25 k allocations: 174.219 KiB)
syn()
 34.657185 seconds (5.43 k allocations: 184.656 KiB, 0.02% gc time)
  0.001539 seconds (5.25 k allocations: 174.219 KiB)
                                 .
                                 .
                                 .
  

The output shows that after several time steps (around 50 steps in my laptop), time cost of each time step increases dramatically from O(ms) to O(s).
Then if I execute synchronize(), which costs really long time, the performance is back, but it seems not worth to do it.

How can I improve the performance here? Thank you!

This is expected. You’re queuing two asynchronous operations per inner loop iteration (one kernel, one broadcast), without waiting for the results. After a while, the GPU will be saturated and execution will slow down. Synchronizing just makes the CPU wait until all previously launched operations have finished, after which you can queue new operations before saturating the GPU again, giving you the impression that “performance is back”.

That said, you probably shouldn’t be launching so many kernels; try and aggregate the kernels from that inner loop in a single kernel and your performance will be much better. Launching kernels is not free.

Finally, to understand what the GPU is doing and not be confused by the asynchronicity, try running under nvprof or nvvp to visualize the execution.

1 Like

Thank you Tim!

So what you mean is, after launching the GPU kernel, the code will then execute next line immediately without waiting GPU kernel finishing computing.

That is to say, for the following example, if “function2” requires the input information from the output1 in “function1”, the gpu may run these two function at the same time so the result probably would be wrong because it is possible the output1 doesn’t update in time as function2 access it.

Am I correct?

for timestep=1:100
    @cuda blocks=1024 threads=1024 function1(input1,output1)
    @cuda blocks=1024 threads=1024 function2(output1,output2)
end

Correct.

No, because you’re launching these kernels on the same stream (the default one), where operations will execute in order on the GPU (but still asynchronously wrt. the CPU). If you want these operations to possibly execute concurrently, you would launch them on separate streams.

I went to the CUDANative.jl doc page and searched for ‘stream’ but that did not return anything. How to launch kernels in separate streams?

There’s some docs in the docstrings of @cuda:

julia> using CUDAnative

help?> @cuda
  @cuda [kwargs...] func(args...)

  High-level interface for executing code on a GPU. 

...
  Several keyword arguments are supported that influence the behavior of @cuda.

...

    •    arguments that influence kernel launch: see CUDAnative.HostKernel and CUDAnative.DeviceKernel

help?> CUDAnative.HostKernel
  (::HostKernel)(args...; kwargs...)
  (::DeviceKernel)(args...; kwargs...)

  Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. 

...

    •    stream (defaults to the default stream)


julia> using CUDAdrv

help?> CuStream

  CuStream(flags=STREAM_DEFAULT)

  Create a CUDA stream.

Also have a look at the CUDAnative tests.

1 Like