Hello, sometimes I need slicing operations on CuArray. I would like to know the correct (and efficient) way to achieve this. I made 4 benchmarks:
Datasets:
using BenchmarkTools
using CUDA
CUDA.allowscalar(true)
# size of array
dsize = 10_000_000
# index
idx_h = collect(1:Int64(dsize/2))
idx_d = CuArray(idx_h)
# datasets
dt1_h = ones(dsize)
dt2_d = CuArray(dt1_h)
dt3_d = CuArray(dt1_h)
dt4_d = CuArray(dt1_h);
Functions:
- cpu version
@views function t1!(idx_h, dt1_h) dt1_h[idx_h] .+= 1.0 return nothing end
- gpu version: data and index are on GPU
@views function t2!(idx_d, dt2_d) dt2_d[idx_d] .+= 1.0 return nothing end
- gpu version: index on cpu, data on GPU
@views function t3!(idx_h, dt3_d) dt3_d[idx_h] .+= 1.0 return nothing end
- gpu version: kernel function
function kernel!(idx_d, dt4_d, sizeidx) ix = (blockIdx().x-1)*blockDim().x+threadIdx().x if ix≤sizeidx dt4_d[idx_d[ix]] += 1.0 end return nothing end function t4!(idx_d, dt4_d) tds = 768 bls = cld(length(idx_d), 768) CUDA.@sync begin @cuda threads=tds blocks=bls kernel!(idx_d, dt4_d, length(idx_d)) end return nothing end
Results:
@benchmark t1!($idx_h, $dt1_h)
@benchmark CUDA.@sync t2!($idx_d, $dt2_d)
@benchmark CUDA.@sync t3!($idx_h, $dt3_d)
@benchmark t4!($idx_d, $dt4_d)
Item | Name | Time |
---|---|---|
1 | cpu version | 14.062 ms ± 2.516 ms |
2 | data and index are on GPU | 30.188 ms ± 10.241 ms |
3 | index on cpu, data on GPU | 18.524 ms ± 5.147 ms |
4 | kernel function | 163.619 μs ± 104.392 μs |
Questions:
- According to the results, putting the index on the gpu will be slower than
Item 3
. However, from my understanding, it seems thatItem2
should be a bit faster thanItem3
since they are both on the GPU. - I noticed that when I did the benchmark for
Item3
, GPU memory is almost fully occupied (24 GB). What leads to this phenomenon? - From the benchmark results it seems that kernel function is the most efficient way to execute, can we achieve similar performance with CuArray alone?