CUDA.jl is slowed down after some number of iterations

Alex90 · December 17, 2024, 8:40am

Hello!
When looping simple math operations it was found that CUDA.jl is slowed down after some number of iterations. In this case, most of the execution time is consumed by CuLaunchKernel procedure. I thought that a custom CUDA kernel will cure this problem. However, i got the similar performance result when running the kernel. The code is as follows:

using CUDA
function math2!(D, E, F)
@.    F = D * E + D / E - D * E + D^2 - E^2 + D / D - E / E
    return
end
F       = CUDA.zeros(4096, 4096)
D       = CUDA.rand(4096, 4096)
E       = CUDA.rand(4096, 4096)
CUDA.@profile for iter = 1:10000
math2!(D, E, F)
end
Profiler ran for 63.42 s, capturing 3060002 events.

Host-side activity: calling CUDA APIs took 47.2 s (74.42% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name           │
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────┤
│   74.10% │     47.0 s │ 10000 │    4.7 ms ± 17.59  (   0.0 ‥ 390.86) │ cuLaunchKernel │
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────┘

And the custom CUDA kernel is as follows:

function math1!(D, E, F)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    F[ix,iy] = D[ix,iy] * E[ix,iy] + D[ix,iy] / E[ix,iy] - D[ix,iy] * E[ix,iy] +
               D[ix,iy]^2 - E[ix,iy]^2 + D[ix,iy] / D[ix,iy] - E[ix,iy] / E[ix,iy]
    return
end
threads = (32, 32)
blocks  = (128, 128)
nx, ny  = threads[1]*blocks[1], threads[2]*blocks[2]
F       = CUDA.zeros(nx, ny)
D       = CUDA.rand(nx, ny)
E       = CUDA.rand(nx, ny)
CUDA.@profile for iter = 1:10000
@cuda blocks=blocks threads=threads math1!(D, E, F)
end
rofiler ran for 28.22 s, capturing 180002 events.

Host-side activity: calling CUDA APIs took 16.5 s (58.47% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name           │
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────┤
│   58.44% │    16.49 s │ 10000 │   1.65 ms ± 9.28   (   0.0 ‥ 303.43) │ cuLaunchKernel │
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────┘

Note please that low number of iterations is executed in no time. For example, it takes around 7 ms to compute 1000 iterations.
The software and hardware are as follows:

CUDA runtime 12.6, artifact installation
CUDA driver 12.2
NVIDIA driver 536.23.0

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+536.23

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce GTX 960 (sm_52, 1.850 GiB / 4.000 GiB available)

Can someone give me a hint in accordance with this issue?

eldee · December 20, 2024, 7:49pm

Hi,

Not really an answer, but by adding CUDA.@sync the individual kernel launches take approximately as long when using 1000 and 10000 iterations. So maybe there is some implicit synchronisation going on, perhaps because the scheduler has difficulty managing resources at 10000 (not necessarily sequential) iterations.

Profiling

julia> CUDA.@profile for iter = 1:1000
           @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 1.06 s, capturing 30022 events.

Host-side activity: calling CUDA APIs took 6.12 ms (0.58% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────┤
│    0.30% │    3.18 ms │  1000 │   3.18 µs ± 2.44   (   2.3 ‥ 31.8) │ cuLaunchKernel      │
│    0.08% │   848.9 µs │     1 │                                    │ cuModuleLoadDataEx  │
│    0.00% │    35.7 µs │     1 │                                    │ cuModuleGetFunction │
│    0.00% │     5.6 µs │     1 │                                    │ cuCtxSynchronize    │
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────┘

Device-side activity: GPU was busy for 763.61 ms (72.12% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                                       ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────────────────────
│   72.12% │  763.61 ms │  1000 │ 763.61 µs ± 1862.44 (542.91 ‥ 28205.17) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EE ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────────────────────
                                                                                                        1 column omitted


julia> CUDA.@profile for iter = 1:10000
           @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 5.88 s, capturing 300002 events.

Host-side activity: calling CUDA APIs took 3.74 s (63.66% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name           │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼────────────────┤
│   63.22% │     3.72 s │ 10000 │ 371.75 µs ± 3344.27 (   2.3 ‥ 205377.5) │ cuLaunchKernel │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴────────────────┘

Device-side activity: GPU was busy for 5.86 s (99.67% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                        ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────
│   99.67% │     5.86 s │ 10000 │  586.1 µs ± 758.69 (542.49 ‥ 28883.05) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EES ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────
                                                                                                        1 column omitted


julia> CUDA.@profile for iter = 1:1000
           CUDA.@sync @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 952.09 ms, capturing 805355 events.

Host-side activity: calling CUDA APIs took 132.22 ms (13.89% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│   52.61% │  500.88 ms │  1000 │ 500.88 µs ± 1977.04 (   1.5 ‥ 14872.2) │ cuStreamSynchronize │
│    0.92% │    8.75 ms │  1000 │   8.75 µs ± 4.92   (   6.7 ‥ 109.8)    │ cuLaunchKernel      │
│    0.00% │     1.9 µs │     2 │ 949.95 ns ± 212.05 (800.01 ‥ 1099.89)  │ cuCtxSetCurrent     │
│    0.00% │   400.0 ns │     2 │  200.0 ns ± 0.0    ( 200.0 ‥ 200.0)    │ cuCtxGetDevice      │
│    0.00% │  300.12 ns │     2 │ 150.06 ns ± 70.63  (100.12 ‥ 200.0)    │ cuDeviceGetCount    │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘

Device-side activity: GPU was busy for 841.61 ms (88.40% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                                       ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────────────────────
│   88.40% │  841.61 ms │  1000 │ 841.61 µs ± 1987.07 (543.07 ‥ 15180.51) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EE ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────────────────────
                                                                                                        1 column omitted


julia> CUDA.@profile for iter = 1:10000
           CUDA.@sync @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 6.64 s, capturing 8040585 events.

Host-side activity: calling CUDA APIs took 982.98 ms (14.81% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬───────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                  │
├──────────┼────────────┼───────┼───────────────────────────────────────┼───────────────────────┤
│   34.72% │      2.3 s │ 10000 │ 230.43 µs ± 721.81 (   1.1 ‥ 37695.7) │ cuStreamSynchronize   │
│    1.37% │   90.72 ms │ 10000 │   9.07 µs ± 15.9   (   6.9 ‥ 1500.6)  │ cuLaunchKernel        │
│    0.00% │     9.0 µs │     1 │                                       │ cuMemGetInfo          │
│    0.00% │     1.1 µs │     2 │ 550.06 ns ± 636.32 (100.12 ‥ 1000.01) │ cuMemPoolGetAttribute │
└──────────┴────────────┴───────┴───────────────────────────────────────┴───────────────────────┘

Device-side activity: GPU was busy for 5.76 s (86.81% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                        ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────
│   86.81% │     5.76 s │ 10000 │ 576.04 µs ± 721.27 (541.05 ‥ 37987.78) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EES ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────
                                                                                                        1 column omitted

I’ve also moved the post from New to Julia to GPU, as I think you’re more likely to get responses there.

Alex90 · December 21, 2024, 6:45am

Hello,
Thanks for your help! I have also tested the code with CUDA.@sync
I don’t understand why the synchronization reduces the CuLaunchKernel time whereas the overall time remains the same due to additional synchronization procedure. When searching for the info, i found the same problem in CUDA/C++. In this topic, the slowdown of Kernel was healed by copying the data from the device to the host and vice versa with the reset of GPU after some number of iterations. I tried device_reset! in Julia but there was no any effect on performance. Also, if i run the code consecutively multiple times with 1000 iterations, the Kernel slow down is appeared

maleadt · December 21, 2024, 7:38am

Kernel launches are asynchronous, this works by enqueuing the kernel in some sort of command queue structure. That buffer isn’t unbounded, so at some point you have too many pending asynchronous launches, resulting in the launch becoming “synchronous” (or at least waiting for a new entry in the command queue structure). That is why after a number of iteration, the performance characteristics change. If you sync across every iteration, you ensure the queue never fills up completely, without that affecting the performance characteristics of the application as a whole.

Alex90 · December 21, 2024, 9:37am

Many thanks for the explanation. But is there a way to fix this performance issue? I wonder why the language developed for high performance scientific computing has this type of problem since, for example, computational physics tasks are mostly iterative.

eldee · December 21, 2024, 10:52am

In which sense did this solve the slowdown? Such a copy will force synchronisation, but it should make things slower. Note for example that the code for 1000 iterations with CUDA.@sync runs slower than without it. Adding it just levels the playing field with 10000 iterations, so to speak.

You could launch fewer, but larger kernels, i.e. group multiple iterations in a single kernel by looping inside of it.

Note that this also applies to CPU code.

CPU example

using .Threads, ChunkSplitters, BenchmarkTools

function foo!(C, A, B)  # length(C) small tasks
    tasks = map(eachindex(C)) do i
        @spawn C[i] = A[i] + B[i]
    end
    wait.(tasks)
end

function foo2!(C, A, B)  # nthreads() big tasks
    tasks = map(chunks(eachindex(C), n=nthreads())) do is
        @spawn begin
            for i in is
                C[i] = A[i] + B[i]
            end
        end
    end
    wait.(tasks)
end

A = rand(1000); B = rand(1000); C = similar(A);
display(@benchmark foo!($C, $A, $B))   # median: 330.900 μs
display(@benchmark foo2!($C, $A, $B))  # median: 6.829 μs

A = rand(10000); B = rand(10000); C = similar(A);
display(@benchmark foo!($C, $A, $B))   # median: 3.560 ms
display(@benchmark foo2!($C, $A, $B))  # median: 7.883 μs

So foo! becomes 10x slower, while foo2! sees only a mild increase. I would interpret this as the task scheduling incurring a significant overhead over the actual (trivial) calculations (as is also evident from comparing foo! and foo2! at the same data length, of course).

By the way, I would consider the term ‘iterations’ in your example somewhat of a misnomer, as in my eyes iterations should be run sequentially, which is not the case here: the output F is not used as input for a later iteration.

You said before that C/C++ CUDA had the same issue, so you should probably blame it on GPU architectures instead of Julia.

Alex90 · December 21, 2024, 12:38pm

As stated in cuda kernal slows down after some iterations - #2 by little_jimmy - CUDA Programming and Performance - NVIDIA Developer Forums

I fix the problem by copying GPU data to host and do gpu device reset after 1000 iterations, then move data back to GPU and continue with iteration 1001. Maybe GPU gets “tired” after large iteration and it needs reset. My method involved extra copy, but still faster than not using reset.

I forgot to mention that in the CUDA/C++ case, it was more likely hardware problem since another GPU worked fine. In my case, it is most likely software problem since i tried three different GPU (GTX 960, GTX 1060 and Tesla T4)

You could launch fewer, but larger kernels, i.e. group multiple iterations in a single kernel by looping inside of it.

I thought about it too. But i don’t think that i can implement it in my main program where a huge amount of linear algebra operations are performed. Also one iteration in my main program (without a custom kernel) is performed in no time. The slowdown is increased with iterations as in the present example.

Note that this also applies to CPU code.

CPU code works fine. No slowdown was observed in one of my main programs

By the way, I would consider the term ‘iterations’ in your example somewhat of a misnomer, as in my eyes iterations should be run sequentially, which is not the case here: the output F is not used as input for a later iteration.

Yes, you are right. I just gave a simple example describing my problem.

maleadt · December 21, 2024, 9:23pm

I think you are misunderstanding the asynchronous nature of GPU programming. There doesn’t seem to be a real performance issue here, and the behavior would manifest identically when doing this from C++. The command queue is part of the CUDA driver, and is not managed by CUDA.jl. The fact that adding a CUDA.@sync “fixes” the issue from your point of view without affecting the total application run time demonstrates that this is not a real issue, no?

If anything, I would indeed try to reduce the amount of outstanding kernel launches to reduce the load on the CPU. If that is not necessary, i.e., if you can’t use the CPU in between kernel launches for anything useful anyway, there is no problem to keep your application as it is and have the kernel launch perceive to take up lots of time from the CPU’s point of view (which, again, is only spinning there because the command queue is filled).

Alex90 · December 22, 2024, 3:24am

Thanks for the answer. If it is an expected behavior, my Julia journey ends here since i get the same performance in Matlab (25.92 s in Julia vs 26.82 in Matlab)

maleadt · December 22, 2024, 8:27am

To be clear, this root of the problem seems that you were measuring incorrectly:

Here you were merely measuring the time to submit 1000 iterations, not the time to complete them.

julia> @time main(1000)
  0.004035 seconds (17.03 k allocations: 328.969 KiB)

julia> @time CUDA.@sync main(1000)
  0.244309 seconds (17.03 k allocations: 329.016 KiB)

julia> @time main(10000)
  2.195521 seconds (170.03 k allocations: 3.205 MiB)

julia> @time CUDA.@sync main(10000)
  2.442455 seconds (170.03 k allocations: 3.205 MiB)

With this kind of application, where you’re writing a custom low-level kernel and directly submitting it from a for loop, it is not reasonable to expect Julia to perform any different from other programming languages that support CUDA. Julia doesn’t magically give you better performance, it’s still up to you to write an application that efficiently uses the hardware you have at your disposal.

Topic		Replies	Views
CUDA.jl tutorial code kernel slower than broadcast Performance gpu , cuda	4	674	March 30, 2021
Optimizing CUDA.jl performance for small array operations GPU performance , cuda	5	2380	February 8, 2021
CUDA.jl kernel is half as fast as c++ Kernel Performance cuda , cudajl	11	1484	September 26, 2022
Occasional long delays in CUDA.jl GPU	16	1597	August 11, 2023
Some CUDA functions suddenly become very slow New to Julia	3	176	July 14, 2024

CUDA.jl is slowed down after some number of iterations

Related topics