Understanding stride loop

I try to read performance tips :sweat_smile:) (Performance Tips · CUDA.jl)

and can’t understand what is the point of this example, because seems while executed only once.

function gpu_add5!(y, x)
    index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    i = index
    while i <= length(y)
        @inbounds y[i] += x[i]
        i += stride
    end
    return
end
function bench_gpu5!(y, x)
    kernel = @cuda launch=false gpu_add5!(y, x)
    config = launch_configuration(kernel.fun)
    threads = min(length(y), config.threads)
    blocks = cld(length(y), threads)

    CUDA.@sync kernel(y, x; threads, blocks)
end

for example:

        using CUDA

        function test!(x) 
            gpukernel = @cuda launch=false kernel_test!(x) 
            config = launch_configuration(gpukernel.fun)
            Nx = length(x)
            maxThreads = config.threads
            maxThreads = 3
            Tx  = min(maxThreads, Nx)
            Bx  = cld(Nx, Tx)
            CUDA.@sync gpukernel(x; threads = Tx, blocks = Bx)
        end
        function kernel_test!(x) 
            index  = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
            stride = gridDim().x * blockDim().x
            i = index
            while  i <= length(x)
                @cuprintln "i = $i, index = $index, threadIdx: $(threadIdx().x), blockIdx $(blockIdx().x), blockDim $(blockDim().x)" 
                i += stride
            end
            return nothing
        end
        test!(CUDA.zeros(10)) 

give:

i = 10, index = 10, threadIdx: 1, blockIdx 4, blockDim 3
i = 7, index = 7, threadIdx: 1, blockIdx 3, blockDim 3
i = 8, index = 8, threadIdx: 2, blockIdx 3, blockDim 3
i = 9, index = 9, threadIdx: 3, blockIdx 3, blockDim 3
i = 1, index = 1, threadIdx: 1, blockIdx 1, blockDim 3
i = 2, index = 2, threadIdx: 2, blockIdx 1, blockDim 3
i = 3, index = 3, threadIdx: 3, blockIdx 1, blockDim 3
i = 4, index = 4, threadIdx: 1, blockIdx 2, blockDim 3
i = 5, index = 5, threadIdx: 2, blockIdx 2, blockDim 3
i = 6, index = 6, threadIdx: 3, blockIdx 2, blockDim 3

i always == index

may be this was idea of example with gpu_add5!

        function test2!(x) 
            gpukernel = @cuda launch=false kernel_test2!(x) 
            config = launch_configuration(gpukernel.fun)
            Nx = length(x)
            maxThreads = config.threads
            maxThreads = 3
            Tx  = min(maxThreads, Nx)
            CUDA.@sync gpukernel(x; threads = Tx, blocks = 1)
        end
        function kernel_test2!(x) 
            index = threadIdx().x
            stride = blockDim().x
            i = index
            while i <= length(x)
                @cuprintln "i = $i, index = $index, threadIdx: $(threadIdx().x), blockIdx $(blockIdx().x), blockDim $(blockDim().x)" 
                i += stride
            end
            return nothing
        end
        test2!(CUDA.zeros(10))

that give:

i = 1, index = 1, threadIdx: 1, blockIdx 1, blockDim 3
i = 2, index = 2, threadIdx: 2, blockIdx 1, blockDim 3
i = 3, index = 3, threadIdx: 3, blockIdx 1, blockDim 3
i = 4, index = 1, threadIdx: 1, blockIdx 1, blockDim 3
i = 5, index = 2, threadIdx: 2, blockIdx 1, blockDim 3
i = 6, index = 3, threadIdx: 3, blockIdx 1, blockDim 3
i = 7, index = 1, threadIdx: 1, blockIdx 1, blockDim 3
i = 8, index = 2, threadIdx: 2, blockIdx 1, blockDim 3
i = 9, index = 3, threadIdx: 3, blockIdx 1, blockDim 3
i = 10, index = 1, threadIdx: 1, blockIdx 1, blockDim 3

Using a grid-stride loop like that makes it possible to “decouple” the launch configuration from the iteration domain. Here, it doesn’t matter much indeed, but it does still result in a more flexible kernel implementation that isn’t tied to exactly how that host method launches it, so it’s a good thing to do.

Hi! Could you kindly explain, in gpu_add5! this cycle make only one iteration anyway, isn’t it?:

And

while i <= length(y)
        @inbounds y[i] += x[i]
        i += stride
    end

can be replaced by:

if i <= length(y)
        @inbounds y[i] += x[i]
end

That’s not the case; the stride is smaller than the array because you may not be able to launch enough threads.

Hi! But when it make more than 1 cycle?

        function test!(x) 
            gpukernel = @cuda launch=false kernel_test!(x) 
            config = launch_configuration(gpukernel.fun)
            Nx = length(x)
            maxThreads = config.threads
            Tx  = min(maxThreads, Nx)
            Bx  = cld(Nx, Tx)
            CUDA.@sync gpukernel(x; threads = Tx, blocks = Bx)
        end
        function kernel_test!(x) 
            index  = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
            stride = gridDim().x * blockDim().x
            i = index
            n = 1
            while  i <= length(x)
                x[i] = n
                i += stride
                n += 1
            end
            return nothing
        end
        x = CUDA.zeros(10_000_000)
        test!(x) 
sum(x)

in this case it make only one cycle too everywhere…

Is it make sense when you have array more than 65535 * 1024 elements ( config.threads * config.blocks)?

in this case it it correct to modify kernel call like that:

gpukernel = @cuda launch=false kernel_test!(x) 
            config = launch_configuration(gpukernel.fun)
            Nx = length(x)
            maxThreads = config.threads
            maxBlocks  = config.blocks
            Tx  = min(maxThreads, Nx)
            Bx  = min(maxBlocks, cld(Nx, Tx))
            CUDA.@sync gpukernel(x; threads = Tx, blocks = Bx)

Mostly. config.blocks isn’t a maximum, it’s a suggested minimum. So in principle you never need a grid stride since you can almost always extend the block size, however, you can’t in all dimensions, and sometimes it can put additional pressure on the block scheduler where a simple while loop in a kernel doesn’t.

All this isn’t CUDA.jl specific though, so refer to the NVIDIA blog post for other details and advantages: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

3 Likes

Thank you very much for explanation!

One off-topic question is it possible to get limit of shared memory for current device?

There’s attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN), but IIRC getting to that limit does require configuring the kernel using attributes(kernel.fun)[CUDA.FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES] = ....

1 Like