Hi,
Not really an answer, but by adding CUDA.@sync the individual kernel launches take approximately as long when using 1000 and 10000 iterations. So maybe there is some implicit synchronisation going on, perhaps because the scheduler has difficulty managing resources at 10000 (not necessarily sequential) iterations.
Profiling
julia> CUDA.@profile for iter = 1:1000
@cuda blocks=blocks threads=threads math1!(D, E, F)
end
Profiler ran for 1.06 s, capturing 30022 events.
Host-side activity: calling CUDA APIs took 6.12 ms (0.58% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼────────────────────────────────────┼─────────────────────┤
│ 0.30% │ 3.18 ms │ 1000 │ 3.18 µs ± 2.44 ( 2.3 ‥ 31.8) │ cuLaunchKernel │
│ 0.08% │ 848.9 µs │ 1 │ │ cuModuleLoadDataEx │
│ 0.00% │ 35.7 µs │ 1 │ │ cuModuleGetFunction │
│ 0.00% │ 5.6 µs │ 1 │ │ cuCtxSynchronize │
└──────────┴────────────┴───────┴────────────────────────────────────┴─────────────────────┘
Device-side activity: GPU was busy for 763.61 ms (72.12% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────────────────────
│ 72.12% │ 763.61 ms │ 1000 │ 763.61 µs ± 1862.44 (542.91 ‥ 28205.17) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EE ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────────────────────
1 column omitted
julia> CUDA.@profile for iter = 1:10000
@cuda blocks=blocks threads=threads math1!(D, E, F)
end
Profiler ran for 5.88 s, capturing 300002 events.
Host-side activity: calling CUDA APIs took 3.74 s (63.66% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼─────────────────────────────────────────┼────────────────┤
│ 63.22% │ 3.72 s │ 10000 │ 371.75 µs ± 3344.27 ( 2.3 ‥ 205377.5) │ cuLaunchKernel │
└──────────┴────────────┴───────┴─────────────────────────────────────────┴────────────────┘
Device-side activity: GPU was busy for 5.86 s (99.67% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────
│ 99.67% │ 5.86 s │ 10000 │ 586.1 µs ± 758.69 (542.49 ‥ 28883.05) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EES ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────
1 column omitted
julia> CUDA.@profile for iter = 1:1000
CUDA.@sync @cuda blocks=blocks threads=threads math1!(D, E, F)
end
Profiler ran for 952.09 ms, capturing 805355 events.
Host-side activity: calling CUDA APIs took 132.22 ms (13.89% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│ 52.61% │ 500.88 ms │ 1000 │ 500.88 µs ± 1977.04 ( 1.5 ‥ 14872.2) │ cuStreamSynchronize │
│ 0.92% │ 8.75 ms │ 1000 │ 8.75 µs ± 4.92 ( 6.7 ‥ 109.8) │ cuLaunchKernel │
│ 0.00% │ 1.9 µs │ 2 │ 949.95 ns ± 212.05 (800.01 ‥ 1099.89) │ cuCtxSetCurrent │
│ 0.00% │ 400.0 ns │ 2 │ 200.0 ns ± 0.0 ( 200.0 ‥ 200.0) │ cuCtxGetDevice │
│ 0.00% │ 300.12 ns │ 2 │ 150.06 ns ± 70.63 (100.12 ‥ 200.0) │ cuDeviceGetCount │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘
Device-side activity: GPU was busy for 841.61 ms (88.40% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────┬─────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────────┼─────────────────────────────────────────────
│ 88.40% │ 841.61 ms │ 1000 │ 841.61 µs ± 1987.07 (543.07 ‥ 15180.51) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EE ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────────┴─────────────────────────────────────────────
1 column omitted
julia> CUDA.@profile for iter = 1:10000
CUDA.@sync @cuda blocks=blocks threads=threads math1!(D, E, F)
end
Profiler ran for 6.64 s, capturing 8040585 events.
Host-side activity: calling CUDA APIs took 982.98 ms (14.81% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬───────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼───────────────────────┤
│ 34.72% │ 2.3 s │ 10000 │ 230.43 µs ± 721.81 ( 1.1 ‥ 37695.7) │ cuStreamSynchronize │
│ 1.37% │ 90.72 ms │ 10000 │ 9.07 µs ± 15.9 ( 6.9 ‥ 1500.6) │ cuLaunchKernel │
│ 0.00% │ 9.0 µs │ 1 │ │ cuMemGetInfo │
│ 0.00% │ 1.1 µs │ 2 │ 550.06 ns ± 636.32 (100.12 ‥ 1000.01) │ cuMemPoolGetAttribute │
└──────────┴────────────┴───────┴───────────────────────────────────────┴───────────────────────┘
Device-side activity: GPU was busy for 5.76 s (86.81% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────
│ 86.81% │ 5.76 s │ 10000 │ 576.04 µs ± 721.27 (541.05 ‥ 37987.78) │ _Z6math1_13CuDeviceArrayI7Float32Li2ELi1EES ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────
1 column omitted
I’ve also moved the post from New to Julia to GPU, as I think you’re more likely to get responses there.