cudaMemcpyAsync: where is it used?

I was able to identify one place where cudaMemcpyAsync is used: normalize!:

julia> d = CUDA.rand(1024);

julia> CUDA.@profile normalize!(d)
Profiler ran for 1.98 s, capturing 39 events.

Host-side activity: calling CUDA APIs took 223.64 µs (0.01% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                     │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│    0.01% │   99.18 µs │     1 │                                       │ cudaFuncGetAttributes    │
│    0.00% │   57.94 µs │     3 │  19.31 µs ± 11.62  (  5.96 ‥ 27.18)   │ cudaLaunchKernel         │
│    0.00% │   40.53 µs │     2 │  20.27 µs ± 3.71   ( 17.64 ‥ 22.89)   │ cudaMemcpyAsync          │
...

Is this something expected? If so, is there a way to perform normalize! on a CUDA array without copying data between the host and device? I call normalize! in a loop, so this is quite costly.

UPDATE. I find that norm also uses cudaMemcpyAsync:

julia> CUDA.@profile norm(d)
Profiler ran for 801.43 ms, capturing 28 events.

Host-side activity: calling CUDA APIs took 170.23 µs (0.02% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                     │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│    0.01% │    68.9 µs │     1 │                                       │ cudaFuncGetAttributes    │
│    0.00% │   39.58 µs │     2 │  19.79 µs ± 0.34   ( 19.55 ‥ 20.03)   │ cudaMemcpyAsync          │
│    0.00% │    36.0 µs │     2 │   18.0 µs ± 16.35  (  6.44 ‥ 29.56)   │ cudaLaunchKernel         │
...