Thanks for the explanation! That makes sense. The only problem is that unlike your, my result does not show the matching host- and device-side timings:
julia> CUDA.@profile CUBLAS.nrm2(d)
Profiler ran for 986.81 µs, capturing 28 events.
Host-side activity: calling CUDA APIs took 951.05 µs (96.38% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│ 91.62% │ 904.08 µs │ 2 │ 452.04 µs ± 613.66 ( 18.12 ‥ 885.96) │ cudaMemcpyAsync │
│ 2.83% │ 27.89 µs │ 2 │ 13.95 µs ± 11.63 ( 5.72 ‥ 22.17) │ cudaLaunchKernel │
│ 0.63% │ 6.2 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.29% │ 2.86 µs │ 1 │ │ cudaEventRecord │
│ 0.27% │ 2.62 µs │ 1 │ │ cudaEventQuery │
│ 0.24% │ 2.38 µs │ 1 │ │ cudaStreamSynchronize │
│ 0.22% │ 2.15 µs │ 3 │ 715.26 ns ± 630.8 (238.42 ‥ 1430.51) │ cudaStreamGetCaptureInfo │
│ 0.10% │ 953.67 ns │ 1 │ │ cuStreamGetCaptureInfo │
│ 0.10% │ 953.67 ns │ 4 │ 238.42 ns ± 476.84 ( 0.0 ‥ 953.67) │ cudaGetLastError │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────┘
Device-side activity: GPU was busy for 17.4 µs (1.76% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬──────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼──────────────────────────────────────────
│ 1.45% │ 14.31 µs │ 2 │ 7.15 µs ± 0.67 ( 6.68 ‥ 7.63) │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cu ⋯
│ 0.19% │ 1.91 µs │ 1 │ │ [copy pageable to device memory] ⋯
│ 0.12% │ 1.19 µs │ 1 │ │ [copy device to pageable memory] ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴──────────────────────────────────────────
Here, the host-side cudaMemcpyAsync
took 904.08 µs, which is ~50X longer than the sum of the device-side times (14.31 + 1.91 + 1.19) µs. I am curious if you have any insights as to why this happens on my machine…
EDIT. I executed the code multiple times but got almost the same results, so I don’t think this is the time-to-first-plot (TTFP) issue.