Performance difference among three CUDA kernels with same results

I would caution that BenchmarkTools does not capture GPU memory allocations, but in this case we can see from the lowering that it indeed does not make a copy:

julia> using CUDA

julia> a = CUDA.rand(10, 10, 10);

julia> Meta.@lower a[1:3, 2:4, 3:5] .*= 0.1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = 1:3
│   %2 = 2:4
│   %3 = 3:5
│   %4 = Base.dotview(a, %1, %2, %3)
│   %5 = Base.getindex(a, %1, %2, %3)
│   %6 = Base.broadcasted(*, %5, 0.1)
│   %7 = Base.materialize!(%4, %6)
└──      return %7
))))

julia> Meta.@lower @views a[1:3, 2:4, 3:5] .*= 0.1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1  = a
│         ##a#274 = %1
│   %3  = 1:3
│         i#275 = %3
│   %5  = 2:4
│         i#276 = %5
│   %7  = 3:5
│         i#277 = %7
│   %9  = Base.dotview(##a#274, i#275, i#276, i#277)
│   %10 = (Base.maybeview)(##a#274, i#275, i#276, i#277)
│   %11 = Base.broadcasted(*, %10, 0.1)
│   %12 = Base.materialize!(%9, %11)
└──       return %12
))))

julia> CUDA.@time Base.dotview(a, 1:3, 2:4, 3:5);
  0.000007 seconds (6 CPU allocations: 256 bytes)

julia> CUDA.@time a[1:3, 2:4, 3:5];
  0.017774 seconds (42 CPU allocations: 1.578 KiB) (1 GPU allocation: 108 bytes, 0.05% memmgmt time)
1 Like