Some CUDA functions suddenly become very slow

I am new to GPU computation and I want to start from julia.
I use CUDA.jl to do this and write some code following the instruction.

However, it takes very long time for the following simple code:

module TestMyCUDA
function test()
  CUDA.memory_status()


  num_c=100
  num_x=512
  @time array1 = CuArray{Float32}(rand(Float64,num_c,3))
  @time array2 = CUDA.zeros(Complex{Float32}, num_x,num_x,num_x)  # allocate the lattice on the GPU

  @time @cuda dosomething!(array2,array1,num_x)
  return nothing
end

function dosomething!(array2,array1,num_x)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  stride = gridDim().x * blockDim().x

  for N=index:stride:num_x^3

    i =        (N-1) ÷ num_x^2 +1
    j =       ((N-1) % num_x^2 ) ÷ num_x + 1
    k =       ((N-1) % num_x^2 ) % num_x + 1

    array2[i,j,k]=1.0+1.0im
  end
  return nothing
end
end

import .TestMyCUDA
TestMyCUDA.test()

In this example code, I simply do 2 things: allocating two CuArray’s, and giving array2 some value.

It works out when running it for the first time:

Effective GPU memory usage: 19.83% (3.172 GiB/16.000 GiB)
Memory pool usage: 2.000 GiB (2.031 GiB reserved)
  0.000130 seconds (14 allocations: 4.141 KiB)
  0.014412 seconds (851 allocations: 33.094 KiB, 82.98% compilation time)
  0.371885 seconds (86.76 k allocations: 7.021 MiB, 4.08% gc time, 19.74% compilation time)

However, when running it for second time, the first allocation will take a surprisingly long time:

Effective GPU memory usage: 26.08% (4.172 GiB/16.000 GiB)
Memory pool usage: 2.000 GiB (3.031 GiB reserved)
 63.046641 seconds (15 allocations: 4.156 KiB)
  0.000133 seconds (42 allocations: 1.766 KiB)
  0.579747 seconds (67.56 k allocations: 5.685 MiB, 4.38% compilation time)

Interestingly, it is not always the allocation that takes very long time.
If I insert some other code, the dosomething! function will become very slow and the allocation function becomes normal speed.

Do anyone know why this happens?

CUDA APIs are asynchronous, so your initial timing is just the time to launch the kernel. The kernel actually does take 60s to finish, because its implemented badly: You’re iterating elements on a single GPU thread, which is the wrong way to use GPUs. Please read the CUDA.jl introductory tutorial, this exact pitfall is explained there: Introduction · CUDA.jl

Thank you!

I think you are right that the “delay” comes from the asynchronous of CUDA.

Actually, in my real code, I am not using only one GPU thread.
But my real code is also very heavy even for a lot of threads.

I made a mistake by simply use @time to find the bottleneck.

You can use CUDA.@time instead, which does the synchronization for you, or just make it a habit to add CUDA.@sync to time measurement utilities (it be @time, @btime, @benchmark, etc). Also consider using CUDA.@profile which would have showed you the actual execution time of your kernel.