cudaMemcpyAsync: where is it used?

Thanks for the explanation! That makes sense. The only problem is that unlike your, my result does not show the matching host- and device-side timings:

julia> CUDA.@profile CUBLAS.nrm2(d)
Profiler ran for 986.81 µs, capturing 28 events.

Host-side activity: calling CUDA APIs took 951.05 µs (96.38% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                     │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│   91.62% │  904.08 µs │     2 │ 452.04 µs ± 613.66 ( 18.12 ‥ 885.96)  │ cudaMemcpyAsync          │
│    2.83% │   27.89 µs │     2 │  13.95 µs ± 11.63  (  5.72 ‥ 22.17)   │ cudaLaunchKernel         │
│    0.63% │     6.2 µs │     1 │                                       │ cudaFuncGetAttributes    │
│    0.29% │    2.86 µs │     1 │                                       │ cudaEventRecord          │
│    0.27% │    2.62 µs │     1 │                                       │ cudaEventQuery           │
│    0.24% │    2.38 µs │     1 │                                       │ cudaStreamSynchronize    │
│    0.22% │    2.15 µs │     3 │ 715.26 ns ± 630.8  (238.42 ‥ 1430.51) │ cudaStreamGetCaptureInfo │
│    0.10% │  953.67 ns │     1 │                                       │ cuStreamGetCaptureInfo   │
│    0.10% │  953.67 ns │     4 │ 238.42 ns ± 476.84 (   0.0 ‥ 953.67)  │ cudaGetLastError         │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 17.4 µs (1.76% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬──────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                    ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼──────────────────────────────────────────
│    1.45% │   14.31 µs │     2 │   7.15 µs ± 0.67   (  6.68 ‥ 7.63) │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cu ⋯
│    0.19% │    1.91 µs │     1 │                                    │ [copy pageable to device memory]        ⋯
│    0.12% │    1.19 µs │     1 │                                    │ [copy device to pageable memory]        ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴──────────────────────────────────────────

Here, the host-side cudaMemcpyAsync took 904.08 µs, which is ~50X longer than the sum of the device-side times (14.31 + 1.91 + 1.19) µs. I am curious if you have any insights as to why this happens on my machine…

EDIT. I executed the code multiple times but got almost the same results, so I don’t think this is the time-to-first-plot (TTFP) issue.