Combine CartesianIndices for effective CUDA kernels

Suppose, I have a multidimensional array and I want to perform some operation along one of its axes. A sample generic code can look like this:

s = (1, 2, 3, 4, 5)   # shape of the array
dim = 3   # dimension, along which I want to perform the operation

A = zeros(s)

spre = s[1:dim-1]
sdim = s[dim]
spost = s[dim+1:end]

Cpre = CartesianIndices(spre)
Cpost = CartesianIndices(spost)

for Ipost in Cpost
for Ipre in Cpre
    for i=1:sdim
        A[Ipre, i, Ipost] = i
    end
end
end

Next, I would like to launch these for loops on GPU using CUDA. To do it in a most efficient way I have to merge the two outer loops into one with an index through which I can iterate with strides. For that reason I need to merge the two Cartesian indices Cpre and Cpost into one set of indices. On CPU I came to the following solution:

Cprepost = CartesianIndices((spre..., spost...))

for I in Cprepost
    Ipre = Tuple(I)[1:dim-1]
    Ipost = Tuple(I)[dim:end]
    for i=1:sdim
        A[Ipre..., i, Ipost...] = i
    end
end

By analogy, I would expect that the kernel for GPU will look like

using CUDA

function kernel(A, dim, Cprepost, sdim)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    for I in Cprepost[id:stride:end]
        Ipre = Tuple(I)[1:dim-1]
        Ipost = Tuple(I)[dim:end]
        for i=1:sdim
            A[Ipre..., i, Ipost...] = i
        end
    end

    return nothing
end

Agpu = CUDA.zeros(s)

@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)

However, this code causes “unsupported call through a literal pointer” error.
The only kernel which so far works for me is the following one:

function kernel(A, dim, Cprepost, sdim)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x

    dim = 3

    for k=id:stride:length(Cprepost)
        Ipre = Tuple(Cprepost[k])[1:dim-1]
        Ipost = Tuple(Cprepost[k])[dim:end]
        for i=1:sdim
            A[Ipre..., i, Ipost...] = i
        end
    end
    return nothing
end

@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)

However, here I have to explicitly define the dim variable within the kernel, because otherwise I obtain the same “unsupported call through a literal pointer” error.

Can you please help me to write the corresponding CUDA kernel.
Thank you.

Have a look at CUDA.jl’s mapreducedim! kernel, it does things like that.

1 Like

This creates a CPU array which can’t be dynamically allocated on the GPU. Try @view Cprepost[id:stride:end] instead.

You can use a Val type to specialize on dim, which should make this type stable.

It seems none of the approaches works.

P.S. I am reading CUDA sources now

Well, it seems I found the solution:

using CUDA

s = (1, 2, 3, 4, 5)
dim = 3

spre = s[1:dim-1]
sdim = s[dim]
spost = s[dim+1:end]

Cpre = CartesianIndices(spre)
Cpost = CartesianIndices(spost)
C = CartesianIndices((length(Cpre), length(Cpost)))


function kernel(A, Cpre, Cpost, C, sdim)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for k=id:stride:length(C)
        i1 = C[k][1]
        i2 = C[k][2]
        Ipre = Cpre[i1]
        Ipost = Cpost[i2]
        for i=1:sdim
            A[Ipre, i, Ipost] = i
        end
    end
    return nothing
end


Agpu = CUDA.zeros(s)

@cuda threads=length(C) kernel(Agpu, Cpre, Cpost, C, sdim)

The trick was to create an additional set of Cartesian indices C which maps the linear index into the corresponding indices of Cpre and Cpost.