Some CUDA functions suddenly become very slow

baizhan · July 14, 2024, 9:32am

I am new to GPU computation and I want to start from julia.
I use CUDA.jl to do this and write some code following the instruction.

However, it takes very long time for the following simple code:

module TestMyCUDA
function test()
  CUDA.memory_status()


  num_c=100
  num_x=512
  @time array1 = CuArray{Float32}(rand(Float64,num_c,3))
  @time array2 = CUDA.zeros(Complex{Float32}, num_x,num_x,num_x)  # allocate the lattice on the GPU

  @time @cuda dosomething!(array2,array1,num_x)
  return nothing
end

function dosomething!(array2,array1,num_x)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  stride = gridDim().x * blockDim().x

  for N=index:stride:num_x^3

    i =        (N-1) ÷ num_x^2 +1
    j =       ((N-1) % num_x^2 ) ÷ num_x + 1
    k =       ((N-1) % num_x^2 ) % num_x + 1

    array2[i,j,k]=1.0+1.0im
  end
  return nothing
end
end

import .TestMyCUDA
TestMyCUDA.test()

In this example code, I simply do 2 things: allocating two CuArray’s, and giving array2 some value.

It works out when running it for the first time:

Effective GPU memory usage: 19.83% (3.172 GiB/16.000 GiB)
Memory pool usage: 2.000 GiB (2.031 GiB reserved)
  0.000130 seconds (14 allocations: 4.141 KiB)
  0.014412 seconds (851 allocations: 33.094 KiB, 82.98% compilation time)
  0.371885 seconds (86.76 k allocations: 7.021 MiB, 4.08% gc time, 19.74% compilation time)

However, when running it for second time, the first allocation will take a surprisingly long time:

Effective GPU memory usage: 26.08% (4.172 GiB/16.000 GiB)
Memory pool usage: 2.000 GiB (3.031 GiB reserved)
 63.046641 seconds (15 allocations: 4.156 KiB)
  0.000133 seconds (42 allocations: 1.766 KiB)
  0.579747 seconds (67.56 k allocations: 5.685 MiB, 4.38% compilation time)

Interestingly, it is not always the allocation that takes very long time.
If I insert some other code, the dosomething! function will become very slow and the allocation function becomes normal speed.

Do anyone know why this happens?

maleadt · July 14, 2024, 10:29am

CUDA APIs are asynchronous, so your initial timing is just the time to launch the kernel. The kernel actually does take 60s to finish, because its implemented badly: You’re iterating elements on a single GPU thread, which is the wrong way to use GPUs. Please read the CUDA.jl introductory tutorial, this exact pitfall is explained there: Introduction · CUDA.jl

baizhan · July 14, 2024, 11:22am

Thank you!

I think you are right that the “delay” comes from the asynchronous of CUDA.

Actually, in my real code, I am not using only one GPU thread.
But my real code is also very heavy even for a lot of threads.

I made a mistake by simply use @time to find the bottleneck.

maleadt · July 14, 2024, 8:00pm

You can use CUDA.@time instead, which does the synchronization for you, or just make it a habit to add CUDA.@sync to time measurement utilities (it be @time, @btime, @benchmark, etc). Also consider using CUDA.@profile which would have showed you the actual execution time of your kernel.