Correct implementation of CuArray's slicing operations

Hello, sometimes I need slicing operations on CuArray. I would like to know the correct (and efficient) way to achieve this. I made 4 benchmarks:

Datasets:

using BenchmarkTools
using CUDA
CUDA.allowscalar(true)
# size of array
dsize = 10_000_000
# index
idx_h = collect(1:Int64(dsize/2))
idx_d = CuArray(idx_h)
# datasets
dt1_h = ones(dsize)
dt2_d = CuArray(dt1_h)
dt3_d = CuArray(dt1_h)
dt4_d = CuArray(dt1_h);

Functions:

  1. cpu version
    @views function t1!(idx_h, dt1_h)
        dt1_h[idx_h] .+= 1.0
        return nothing
    end
    
  2. gpu version: data and index are on GPU
    @views function t2!(idx_d, dt2_d)
        dt2_d[idx_d] .+= 1.0
        return nothing
    end
    
  3. gpu version: index on cpu, data on GPU
    @views function t3!(idx_h, dt3_d)
        dt3_d[idx_h] .+= 1.0
        return nothing
    end
    
  4. gpu version: kernel function
    function kernel!(idx_d, dt4_d, sizeidx)
        ix = (blockIdx().x-1)*blockDim().x+threadIdx().x
        if ix≤sizeidx
            dt4_d[idx_d[ix]] += 1.0
        end
        return nothing
    end
    function t4!(idx_d, dt4_d)
        tds = 768
        bls = cld(length(idx_d), 768)
        CUDA.@sync begin
            @cuda threads=tds blocks=bls kernel!(idx_d, dt4_d, length(idx_d))
        end
        return nothing
    end
    

Results:

@benchmark t1!($idx_h, $dt1_h)
@benchmark CUDA.@sync t2!($idx_d, $dt2_d)
@benchmark CUDA.@sync t3!($idx_h, $dt3_d)
@benchmark t4!($idx_d, $dt4_d)
Item Name Time
1 cpu version 14.062 ms ± 2.516 ms
2 data and index are on GPU 30.188 ms ± 10.241 ms
3 index on cpu, data on GPU 18.524 ms ± 5.147 ms
4 kernel function 163.619 μs ± 104.392 μs

Questions:

  1. According to the results, putting the index on the gpu will be slower than Item 3. However, from my understanding, it seems that Item2 should be a bit faster than Item3 since they are both on the GPU.
  2. I noticed that when I did the benchmark for Item3, GPU memory is almost fully occupied (24 GB). What leads to this phenomenon?
  3. From the benchmark results it seems that kernel function is the most efficient way to execute, can we achieve similar performance with CuArray alone?

view(dt4_d, idx_d) .+ 1 should essentially do the same as your custom kernel, by passing a SubArray to the broadcast kernel. So that involves much more complex functionality (SubArray and broadcast), but that shouldn’t warrant a 2 order of magnitude slowdown.

Oh wait, the problem is the memory copy that happens during view for the purpose of bounds checking. So if you do @inbounds view(data, idx) .+= 1 that’s almost as fast as your custom kernel version.

This issue will be fixed by Rework host indexing. by maleadt · Pull Request #499 · JuliaGPU/GPUArrays.jl · GitHub. @inbounds will still be required to get to the level of performance of the custom kernel, but the bounds check is now significantly faster. See view(data, idx) boundschecking is disproportionately expensive · Issue #1678 · JuliaGPU/CUDA.jl · GitHub for detailed timings.