cudaMemcpyAsync: where is it used?

wsshin · August 13, 2024, 6:22pm

Thanks for the explanation! That makes sense. The only problem is that unlike your, my result does not show the matching host- and device-side timings:

julia> CUDA.@profile CUBLAS.nrm2(d)
Profiler ran for 986.81 µs, capturing 28 events.

Host-side activity: calling CUDA APIs took 951.05 µs (96.38% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                     │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│   91.62% │  904.08 µs │     2 │ 452.04 µs ± 613.66 ( 18.12 ‥ 885.96)  │ cudaMemcpyAsync          │
│    2.83% │   27.89 µs │     2 │  13.95 µs ± 11.63  (  5.72 ‥ 22.17)   │ cudaLaunchKernel         │
│    0.63% │     6.2 µs │     1 │                                       │ cudaFuncGetAttributes    │
│    0.29% │    2.86 µs │     1 │                                       │ cudaEventRecord          │
│    0.27% │    2.62 µs │     1 │                                       │ cudaEventQuery           │
│    0.24% │    2.38 µs │     1 │                                       │ cudaStreamSynchronize    │
│    0.22% │    2.15 µs │     3 │ 715.26 ns ± 630.8  (238.42 ‥ 1430.51) │ cudaStreamGetCaptureInfo │
│    0.10% │  953.67 ns │     1 │                                       │ cuStreamGetCaptureInfo   │
│    0.10% │  953.67 ns │     4 │ 238.42 ns ± 476.84 (   0.0 ‥ 953.67)  │ cudaGetLastError         │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 17.4 µs (1.76% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬──────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                    ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼──────────────────────────────────────────
│    1.45% │   14.31 µs │     2 │   7.15 µs ± 0.67   (  6.68 ‥ 7.63) │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cu ⋯
│    0.19% │    1.91 µs │     1 │                                    │ [copy pageable to device memory]        ⋯
│    0.12% │    1.19 µs │     1 │                                    │ [copy device to pageable memory]        ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴──────────────────────────────────────────

Here, the host-side cudaMemcpyAsync took 904.08 µs, which is ~50X longer than the sum of the device-side times (14.31 + 1.91 + 1.19) µs. I am curious if you have any insights as to why this happens on my machine…

EDIT. I executed the code multiple times but got almost the same results, so I don’t think this is the time-to-first-plot (TTFP) issue.

Topic		Replies	Views
How to accelerate GPU operation？ GPU question	12	364	November 18, 2024
Most efficient way of _waiting_ for GPU results? GPU	20	3037	January 31, 2019
Accessing array elements too slow? GPU	10	593	April 23, 2021
CUDA v2 - performance regression on matrix multiplication GPU	14	1712	November 10, 2020
CUDA Profiler General Usage cuda	4	119	August 28, 2024

cudaMemcpyAsync: where is it used?

Related topics