Use of CartesianIndices with CUDA?

Hi,
I have a function that iterates over all elements of a four-dimenasional array full of CartesianIndices elements

21×31×41×51 Array{CartesianIndex{4},4}:
[:, :, 1, 1] =
 CartesianIndex(1, 1, 1, 1)   …  CartesianIndex(1, 31, 1, 1)
 CartesianIndex(2, 1, 1, 1)      CartesianIndex(2, 31, 1, 1)
 CartesianIndex(3, 1, 1, 1)      CartesianIndex(3, 31, 1, 1)
 CartesianIndex(4, 1, 1, 1)      CartesianIndex(4, 31, 1, 1)
 CartesianIndex(5, 1, 1, 1)      CartesianIndex(5, 31, 1, 1)
...

the fact is the problem itself the function does is 100% parallelizable, as it has to do some independent operations for each element of the array, and do the product at then (not really the product, but similar stuff). That seems to be perfect for CUDA as there are many elements to process, each one independent of the rest.

Now the question is: can I directly work with these CartesianIndex elements in CUDA? Is this implemented? Or should I convert that to a 4-dimensional array amd work with that? In case I shall convert to an array, how do you properly do a nested for loop (one for each dimension of the CartesianIndex, (and here I have 4), taking full advantage of the CUDA parallel capabilities?

Thanks a lot,

Ferran.

You can use CartesianIndex objects on the GPU – did you try it? But the real answer to your question depends on how you’re going to use them.

EDIT: Which you will need to provide more information about to be able to give a useful anwer.

for instance, this throws an error on my machine

using CuArrays
using CuArrays.CURAND
using CUDAnative
using CUDAdrv

# include the path to user-defined modules
# 
#push!(LOAD_PATH,homedir()*"/Julia_1/Modules");
#push!(LOAD_PATH,homedir()*"/Julia_1/Modules/RBM");
#push!(LOAD_PATH,homedir()*"/Julia_1/Modules/CUDA");

A = rand(3,4,5)

aux_CI = CartesianIndices(A)

display(aux_CI)
println()
println(size(aux_CI))
println(length(aux_CI))

tot = CuArrays.zeros(length(aux_CI))

function bucle_1(y,tot)
    index    = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride   = blockDim().x * gridDim().x
    for i    = index:stride:length(y)
        tot[i] = y[1]+y[2]
    end
end;

numblocks      = 256
@cuda threads  = 256 blocks = numblocks bucle_1(aux_CI,tot)

I typed it fast so it may be me making mistakes, but anyway…

Best regards,

Ferran.

aux_CI is an array that lives on the CPU. You need to make a CuArray version.

Change

tot[i] = y[1]+y[2]

to

tot[i] = y[i][1]+y[i][2]

You can pass CartesianIndices(x) directly to GPU kernels, it will pass OneTo(len) instead of constructing an array.

Is there a place to find examples using CartesianIndex objects on GPU? Is it as straightforward as defining the CartesianIndices object, passing it to the kernel function, and then accessing elements using a linear index constructed from the block and thread index values?

Yes, there’s several kernels like that in CUDA.jl and GPUArrays.jl, e.g., https://github.com/JuliaGPU/GPUArrays.jl/blob/b988cdcc81011ded7223f250d127a1e544ea2d2a/src/host/broadcast.jl#L53-L72
(where @cartesianidx is a simple macro that creates an iterator based on the current block/thread index).