Correct implementation of CuArray's slicing operations

FujiwaraTakumiEH · November 21, 2022, 6:39pm

Hello, sometimes I need slicing operations on CuArray. I would like to know the correct (and efficient) way to achieve this. I made 4 benchmarks:

Datasets:

using BenchmarkTools
using CUDA
CUDA.allowscalar(true)
# size of array
dsize = 10_000_000
# index
idx_h = collect(1:Int64(dsize/2))
idx_d = CuArray(idx_h)
# datasets
dt1_h = ones(dsize)
dt2_d = CuArray(dt1_h)
dt3_d = CuArray(dt1_h)
dt4_d = CuArray(dt1_h);

Functions:

cpu version

@views function t1!(idx_h, dt1_h)
    dt1_h[idx_h] .+= 1.0
    return nothing
end

gpu version: data and index are on GPU

@views function t2!(idx_d, dt2_d)
    dt2_d[idx_d] .+= 1.0
    return nothing
end

gpu version: index on cpu, data on GPU

@views function t3!(idx_h, dt3_d)
    dt3_d[idx_h] .+= 1.0
    return nothing
end

gpu version: kernel function

function kernel!(idx_d, dt4_d, sizeidx)
    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x
    if ix≤sizeidx
        dt4_d[idx_d[ix]] += 1.0
    end
    return nothing
end
function t4!(idx_d, dt4_d)
    tds = 768
    bls = cld(length(idx_d), 768)
    CUDA.@sync begin
        @cuda threads=tds blocks=bls kernel!(idx_d, dt4_d, length(idx_d))
    end
    return nothing
end

Results:

@benchmark t1!($idx_h, $dt1_h)
@benchmark CUDA.@sync t2!($idx_d, $dt2_d)
@benchmark CUDA.@sync t3!($idx_h, $dt3_d)
@benchmark t4!($idx_d, $dt4_d)

Item	Name	Time
1	cpu version	14.062 ms ± 2.516 ms
2	data and index are on GPU	30.188 ms ± 10.241 ms
3	index on cpu, data on GPU	18.524 ms ± 5.147 ms
4	kernel function	163.619 μs ± 104.392 μs

Questions:

According to the results, putting the index on the gpu will be slower than Item 3. However, from my understanding, it seems that Item2 should be a bit faster than Item3 since they are both on the GPU.
I noticed that when I did the benchmark for Item3, GPU memory is almost fully occupied (24 GB). What leads to this phenomenon?
From the benchmark results it seems that kernel function is the most efficient way to execute, can we achieve similar performance with CuArray alone?

maleadt · November 23, 2022, 4:18pm

view(dt4_d, idx_d) .+ 1 should essentially do the same as your custom kernel, by passing a SubArray to the broadcast kernel. So that involves much more complex functionality (SubArray and broadcast), but that shouldn’t warrant a 2 order of magnitude slowdown.

maleadt · November 23, 2022, 4:33pm

Oh wait, the problem is the memory copy that happens during view for the purpose of bounds checking. So if you do @inbounds view(data, idx) .+= 1 that’s almost as fast as your custom kernel version.

maleadt · October 31, 2023, 10:29am

This issue will be fixed by Rework host indexing. by maleadt · Pull Request #499 · JuliaGPU/GPUArrays.jl · GitHub. @inbounds will still be required to get to the level of performance of the custom kernel, but the bounds check is now significantly faster. See view(data, idx) boundschecking is disproportionately expensive · Issue #1678 · JuliaGPU/CUDA.jl · GitHub for detailed timings.

Topic		Replies	Views
Performance of view with cuArrays GPU	11	2671	November 11, 2018
Is it possible to index a CuArray with a CuArray? GPU question	1	835	January 11, 2019
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	645	November 20, 2019
Using @view with CuArrays GPU	6	1156	September 20, 2023
CuArrays using @views+mul!+transpose+slicing GPU	4	554	April 30, 2021

Correct implementation of CuArray's slicing operations

Datasets:

Functions:

Results:

Questions:

Related topics