CUDAnative , performance drop after several timesteps

YJC · September 7, 2019, 6:03am

Hi all, I am building up a time marching simulation via using CUDAnative,
I found that after several time steps , the performance drop significantly.
If I “synchronize” all the thread at some point, the performance was back but the action of doing synchronize take a long time.

I transform my problems into a simple case as following code:

Define a simple matrix adding function:

function madd(a, b, c)
    i =  threadIdx().x
    j =  blockIdx().x
    c[i,j] = a[i,j] + b[i,j]block
    return
end

Initialize the data in GPU device

d_a=cu(rand(2^10,2^10))
d_b=cu(rand(2^10,2^10))
d_c=cu(zeros(2^10,2^10))

the main code is here

for timestep=1:100000    # time marching
    @time begin  # @time to evaluate how long does it take for each time step
        for j=1:50  # here I just want to enhance the complexity in each time step 
            @cuda blocks=1024 threads=1024 madd(d_a,d_b,d_c)
            d_c.=0
        end
        if mod(i,60)==1
            synchronize()
            println("syn()")
        end        
    end
end

the output would be

syn()
 19.936656 seconds (25.94 M allocations: 1.266 GiB, 4.47% gc time)
  0.001374 seconds (5.25 k allocations: 174.219 KiB)
  0.001409 seconds (5.25 k allocations: 174.219 KiB)
  0.001331 seconds (5.25 k allocations: 174.219 KiB)
  0.001368 seconds (5.25 k allocations: 174.219 KiB)
  0.001205 seconds (5.25 k allocations: 174.219 KiB)
  0.001572 seconds (5.25 k allocations: 174.219 KiB)
  0.001526 seconds (5.25 k allocations: 174.219 KiB)
  0.001383 seconds (5.25 k allocations: 174.219 KiB)
  0.001252 seconds (5.25 k allocations: 174.219 KiB)
                                 .
                                 .
                                 .
  0.001218 seconds (5.25 k allocations: 174.219 KiB)
  0.001629 seconds (5.25 k allocations: 174.219 KiB)
  0.001264 seconds (5.25 k allocations: 174.219 KiB)
  0.002820 seconds (5.25 k allocations: 174.219 KiB)
  0.003801 seconds (5.25 k allocations: 174.219 KiB)
  0.798378 seconds (5.25 k allocations: 174.219 KiB)
  1.605217 seconds (5.25 k allocations: 174.219 KiB)
  1.068703 seconds (5.25 k allocations: 174.219 KiB)
  1.565054 seconds (5.25 k allocations: 174.219 KiB)
  1.497024 seconds (5.25 k allocations: 174.219 KiB)
  1.217929 seconds (5.25 k allocations: 174.219 KiB)
syn()
 34.657185 seconds (5.43 k allocations: 184.656 KiB, 0.02% gc time)
  0.001539 seconds (5.25 k allocations: 174.219 KiB)
                                 .
                                 .
                                 .

The output shows that after several time steps (around 50 steps in my laptop), time cost of each time step increases dramatically from O(ms) to O(s).
Then if I execute synchronize(), which costs really long time, the performance is back, but it seems not worth to do it.

How can I improve the performance here? Thank you!

maleadt · September 7, 2019, 8:12am

This is expected. You’re queuing two asynchronous operations per inner loop iteration (one kernel, one broadcast), without waiting for the results. After a while, the GPU will be saturated and execution will slow down. Synchronizing just makes the CPU wait until all previously launched operations have finished, after which you can queue new operations before saturating the GPU again, giving you the impression that “performance is back”.

That said, you probably shouldn’t be launching so many kernels; try and aggregate the kernels from that inner loop in a single kernel and your performance will be much better. Launching kernels is not free.

Finally, to understand what the GPU is doing and not be confused by the asynchronicity, try running under nvprof or nvvp to visualize the execution.

YJC · September 7, 2019, 8:41pm

Thank you Tim!

So what you mean is, after launching the GPU kernel, the code will then execute next line immediately without waiting GPU kernel finishing computing.

That is to say, for the following example, if “function2” requires the input information from the output1 in “function1”, the gpu may run these two function at the same time so the result probably would be wrong because it is possible the output1 doesn’t update in time as function2 access it.

Am I correct?

for timestep=1:100
    @cuda blocks=1024 threads=1024 function1(input1,output1)
    @cuda blocks=1024 threads=1024 function2(output1,output2)
end

maleadt · September 7, 2019, 9:15pm

Correct.

No, because you’re launching these kernels on the same stream (the default one), where operations will execute in order on the GPU (but still asynchronously wrt. the CPU). If you want these operations to possibly execute concurrently, you would launch them on separate streams.

KajWiik · September 7, 2019, 10:48pm

I went to the CUDANative.jl doc page and searched for ‘stream’ but that did not return anything. How to launch kernels in separate streams?

maleadt · September 8, 2019, 7:43am

There’s some docs in the docstrings of @cuda:

julia> using CUDAnative

help?> @cuda
  @cuda [kwargs...] func(args...)

  High-level interface for executing code on a GPU. 

...
  Several keyword arguments are supported that influence the behavior of @cuda.

...

    •    arguments that influence kernel launch: see CUDAnative.HostKernel and CUDAnative.DeviceKernel

help?> CUDAnative.HostKernel
  (::HostKernel)(args...; kwargs...)
  (::DeviceKernel)(args...; kwargs...)

  Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. 

...

    •    stream (defaults to the default stream)


julia> using CUDAdrv

help?> CuStream

  CuStream(flags=STREAM_DEFAULT)

  Create a CUDA stream.

Also have a look at the CUDAnative tests.

Topic		Replies	Views
Most efficient way of _waiting_ for GPU results? GPU	20	3097	January 31, 2019
What is the optimal way of updating CuArray? GPU cudanative	7	1543	July 5, 2018
How to use multiple GPUs correctly? GPU question	2	2767	October 16, 2019
Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead? GPU	12	1648	November 11, 2018
Timing square function in CUDA GPU	4	1705	December 11, 2018

CUDAnative , performance drop after several timesteps

Related topics