Hello!
When looping simple math operations it was found that CUDA.jl is slowed down after some number of iterations. In this case, most of the execution time is consumed by CuLaunchKernel procedure. I thought that a custom CUDA kernel will cure this problem. However, i got the similar performance result when running the kernel. The code is as follows:
using CUDA
function math2!(D, E, F)
@. F = D * E + D / E - D * E + D^2 - E^2 + D / D - E / E
return
end
F = CUDA.zeros(4096, 4096)
D = CUDA.rand(4096, 4096)
E = CUDA.rand(4096, 4096)
CUDA.@profile for iter = 1:10000
math2!(D, E, F)
end
Profiler ran for 63.42 s, capturing 3060002 events.
Host-side activity: calling CUDA APIs took 47.2 s (74.42% of the trace)
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββ
β Time (%) β Total time β Calls β Time distribution β Name β
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββ€
β 74.10% β 47.0 s β 10000 β 4.7 ms Β± 17.59 ( 0.0 β₯ 390.86) β cuLaunchKernel β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββ
And the custom CUDA kernel is as follows:
function math1!(D, E, F)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
F[ix,iy] = D[ix,iy] * E[ix,iy] + D[ix,iy] / E[ix,iy] - D[ix,iy] * E[ix,iy] +
D[ix,iy]^2 - E[ix,iy]^2 + D[ix,iy] / D[ix,iy] - E[ix,iy] / E[ix,iy]
return
end
threads = (32, 32)
blocks = (128, 128)
nx, ny = threads[1]*blocks[1], threads[2]*blocks[2]
F = CUDA.zeros(nx, ny)
D = CUDA.rand(nx, ny)
E = CUDA.rand(nx, ny)
CUDA.@profile for iter = 1:10000
@cuda blocks=blocks threads=threads math1!(D, E, F)
end
rofiler ran for 28.22 s, capturing 180002 events.
Host-side activity: calling CUDA APIs took 16.5 s (58.47% of the trace)
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββ
β Time (%) β Total time β Calls β Time distribution β Name β
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββ€
β 58.44% β 16.49 s β 10000 β 1.65 ms Β± 9.28 ( 0.0 β₯ 303.43) β cuLaunchKernel β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββ
Note please that low number of iterations is executed in no time. For example, it takes around 7 ms to compute 1000 iterations.
The software and hardware are as follows:
CUDA runtime 12.6, artifact installation
CUDA driver 12.2
NVIDIA driver 536.23.0
CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+536.23
Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0
Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6
1 device:
0: NVIDIA GeForce GTX 960 (sm_52, 1.850 GiB / 4.000 GiB available)
Can someone give me a hint in accordance with this issue?