 # Combine CartesianIndices for effective CUDA kernels

Suppose, I have a multidimensional array and I want to perform some operation along one of its axes. A sample generic code can look like this:

``````s = (1, 2, 3, 4, 5)   # shape of the array
dim = 3   # dimension, along which I want to perform the operation

A = zeros(s)

spre = s[1:dim-1]
sdim = s[dim]
spost = s[dim+1:end]

Cpre = CartesianIndices(spre)
Cpost = CartesianIndices(spost)

for Ipost in Cpost
for Ipre in Cpre
for i=1:sdim
A[Ipre, i, Ipost] = i
end
end
end
``````

Next, I would like to launch these `for` loops on GPU using CUDA. To do it in a most efficient way I have to merge the two outer loops into one with an index through which I can iterate with strides. For that reason I need to merge the two Cartesian indices `Cpre` and `Cpost` into one set of indices. On CPU I came to the following solution:

``````Cprepost = CartesianIndices((spre..., spost...))

for I in Cprepost
Ipre = Tuple(I)[1:dim-1]
Ipost = Tuple(I)[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
``````

By analogy, I would expect that the kernel for GPU will look like

``````using CUDA

function kernel(A, dim, Cprepost, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x

for I in Cprepost[id:stride:end]
Ipre = Tuple(I)[1:dim-1]
Ipost = Tuple(I)[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end

return nothing
end

Agpu = CUDA.zeros(s)

@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)
``````

However, this code causes “unsupported call through a literal pointer” error.
The only kernel which so far works for me is the following one:

``````function kernel(A, dim, Cprepost, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x

dim = 3

for k=id:stride:length(Cprepost)
Ipre = Tuple(Cprepost[k])[1:dim-1]
Ipost = Tuple(Cprepost[k])[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
return nothing
end

@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)
``````

However, here I have to explicitly define the `dim` variable within the kernel, because otherwise I obtain the same “unsupported call through a literal pointer” error.

Can you please help me to write the corresponding CUDA kernel.
Thank you.

Have a look at CUDA.jl’s `mapreducedim!` kernel, it does things like that.

1 Like

This creates a CPU array which can’t be dynamically allocated on the GPU. Try `@view Cprepost[id:stride:end] ` instead.

You can use a `Val` type to specialize on `dim`, which should make this type stable.

It seems none of the approaches works.

P.S. I am reading CUDA sources now

Well, it seems I found the solution:

``````using CUDA

s = (1, 2, 3, 4, 5)
dim = 3

spre = s[1:dim-1]
sdim = s[dim]
spost = s[dim+1:end]

Cpre = CartesianIndices(spre)
Cpost = CartesianIndices(spost)
C = CartesianIndices((length(Cpre), length(Cpost)))

function kernel(A, Cpre, Cpost, C, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
for k=id:stride:length(C)
i1 = C[k]
i2 = C[k]
Ipre = Cpre[i1]
Ipost = Cpost[i2]
for i=1:sdim
A[Ipre, i, Ipost] = i
end
end
return nothing
end

Agpu = CUDA.zeros(s)

@cuda threads=length(C) kernel(Agpu, Cpre, Cpost, C, sdim)
``````

The trick was to create an additional set of Cartesian indices `C` which maps the linear index into the corresponding indices of `Cpre` and `Cpost`.