I was able to identify one place where cudaMemcpyAsync
is used: normalize!
:
julia> d = CUDA.rand(1024);
julia> CUDA.@profile normalize!(d)
Profiler ran for 1.98 s, capturing 39 events.
Host-side activity: calling CUDA APIs took 223.64 µs (0.01% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│ 0.01% │ 99.18 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.00% │ 57.94 µs │ 3 │ 19.31 µs ± 11.62 ( 5.96 ‥ 27.18) │ cudaLaunchKernel │
│ 0.00% │ 40.53 µs │ 2 │ 20.27 µs ± 3.71 ( 17.64 ‥ 22.89) │ cudaMemcpyAsync │
...
Is this something expected? If so, is there a way to perform normalize!
on a CUDA array without copying data between the host and device? I call normalize!
in a loop, so this is quite costly.
UPDATE. I find that norm
also uses cudaMemcpyAsync
:
julia> CUDA.@profile norm(d)
Profiler ran for 801.43 ms, capturing 28 events.
Host-side activity: calling CUDA APIs took 170.23 µs (0.02% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│ 0.01% │ 68.9 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.00% │ 39.58 µs │ 2 │ 19.79 µs ± 0.34 ( 19.55 ‥ 20.03) │ cudaMemcpyAsync │
│ 0.00% │ 36.0 µs │ 2 │ 18.0 µs ± 16.35 ( 6.44 ‥ 29.56) │ cudaLaunchKernel │
...