CUDA.jl is slowed down after some number of iterations

eldee · December 20, 2024, 7:49pm

Hi,

Not really an answer, but by adding CUDA.@sync the individual kernel launches take approximately as long when using 1000 and 10000 iterations. So maybe there is some implicit synchronisation going on, perhaps because the scheduler has difficulty managing resources at 10000 (not necessarily sequential) iterations.

Profiling

julia> CUDA.@profile for iter = 1:1000
           @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 1.06 s, capturing 30022 events.

Host-side activity: calling CUDA APIs took 6.12 ms (0.58% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────┤
│    0.30% │    3.18 ms │  1000 │   3.18 µs ± 2.44   (   2.3 ‥ 31.8) │ cuLaunchKernel      │
│    0.08% │   848.9 µs │     1 │                                    │ cuModuleLoadDataEx  │
│    0.00% │    35.7 µs │     1 │                                    │ cuModuleGetFunction │
│    0.00% │     5.6 µs │     1 │                                    │ cuCtxSynchronize    │
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────┘

Device-side activity: GPU was busy for 763.61 ms (72.12% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                                       ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────────────────────
│   72.12% │  763.61 ms │  1000 │ 763.61 µs ± 1862.44 (542.91 ‥ 28205.17) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EE ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────────────────────
                                                                                                        1 column omitted


julia> CUDA.@profile for iter = 1:10000
           @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 5.88 s, capturing 300002 events.

Host-side activity: calling CUDA APIs took 3.74 s (63.66% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name           │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼────────────────┤
│   63.22% │     3.72 s │ 10000 │ 371.75 µs ± 3344.27 (   2.3 ‥ 205377.5) │ cuLaunchKernel │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴────────────────┘

Device-side activity: GPU was busy for 5.86 s (99.67% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                        ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────
│   99.67% │     5.86 s │ 10000 │  586.1 µs ± 758.69 (542.49 ‥ 28883.05) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EES ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────
                                                                                                        1 column omitted


julia> CUDA.@profile for iter = 1:1000
           CUDA.@sync @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 952.09 ms, capturing 805355 events.

Host-side activity: calling CUDA APIs took 132.22 ms (13.89% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│   52.61% │  500.88 ms │  1000 │ 500.88 µs ± 1977.04 (   1.5 ‥ 14872.2) │ cuStreamSynchronize │
│    0.92% │    8.75 ms │  1000 │   8.75 µs ± 4.92   (   6.7 ‥ 109.8)    │ cuLaunchKernel      │
│    0.00% │     1.9 µs │     2 │ 949.95 ns ± 212.05 (800.01 ‥ 1099.89)  │ cuCtxSetCurrent     │
│    0.00% │   400.0 ns │     2 │  200.0 ns ± 0.0    ( 200.0 ‥ 200.0)    │ cuCtxGetDevice      │
│    0.00% │  300.12 ns │     2 │ 150.06 ns ± 70.63  (100.12 ‥ 200.0)    │ cuDeviceGetCount    │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘

Device-side activity: GPU was busy for 841.61 ms (88.40% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                       │ Name                                       ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────────────────────
│   88.40% │  841.61 ms │  1000 │ 841.61 µs ± 1987.07 (543.07 ‥ 15180.51) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EE ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────────────────────
                                                                                                        1 column omitted


julia> CUDA.@profile for iter = 1:10000
           CUDA.@sync @cuda blocks=blocks threads=threads math1!(D, E, F)
       end
Profiler ran for 6.64 s, capturing 8040585 events.

Host-side activity: calling CUDA APIs took 982.98 ms (14.81% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬───────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                  │
├──────────┼────────────┼───────┼───────────────────────────────────────┼───────────────────────┤
│   34.72% │      2.3 s │ 10000 │ 230.43 µs ± 721.81 (   1.1 ‥ 37695.7) │ cuStreamSynchronize   │
│    1.37% │   90.72 ms │ 10000 │   9.07 µs ± 15.9   (   6.9 ‥ 1500.6)  │ cuLaunchKernel        │
│    0.00% │     9.0 µs │     1 │                                       │ cuMemGetInfo          │
│    0.00% │     1.1 µs │     2 │ 550.06 ns ± 636.32 (100.12 ‥ 1000.01) │ cuMemPoolGetAttribute │
└──────────┴────────────┴───────┴───────────────────────────────────────┴───────────────────────┘

Device-side activity: GPU was busy for 5.76 s (86.81% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                        ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────
│   86.81% │     5.76 s │ 10000 │ 576.04 µs ± 721.27 (541.05 ‥ 37987.78) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EES ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────
                                                                                                        1 column omitted

I’ve also moved the post from New to Julia to GPU, as I think you’re more likely to get responses there.

Topic		Replies	Views
Occasional long delays in CUDA.jl GPU	17	1832	March 15, 2025
Most efficient way of _waiting_ for GPU results? GPU	20	3165	January 31, 2019
Some CUDA functions suddenly become very slow New to Julia	3	259	July 14, 2024
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	450	June 4, 2025
CUDA.jl tutorial code kernel slower than broadcast Performance gpu , cuda	4	720	March 30, 2021

CUDA.jl is slowed down after some number of iterations

Related topics