Efficient CuArray shift/rotation

Hi all,

I have a workload which regularly needs to perform operations on adjacent elements in an array and accumulate the result. I would sometimes like to be able to run this parallelised across the whole GPU, sometimes have many copies, each running sequentially within a single GPU thread, and sometimes run it sequentially on the CPU.

For the parallel option, circshift does the job (though at the cost of copying the whole array). This copy requires a memory allocation, so doesn’t work in the sequential case.

julia> a = CuArray([1,2,3]);

julia> sum(a + circshift(a, -1))

For the sequential case, I can use array indices, but getindex isn’t allowed in the parallel case.

sum(a[i] + a[mod1(i+1, length(a))] for i in eachindex(a))

It looks like ShiftedArrays would be ideal for this. It offers a lazy version of circshift which doesn’t need to allocate any memory. On the CPU, this is by far the quickest option I’ve found, but it’s not compatible with the GPU.

julia> sum(a + ShiftedArrays.circshift(a, -1))
ERROR: GPU compilation of kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Int64, 1}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CircShiftedVector{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Bool}, Tuple{Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Obviously I can have multiple options and use whichever one is appropriate, but I’d like to have a single implementation if possible. Is there an obvious way to do this?

The most common way to do this in parallel computing is not by shifting the array, but rather by using “ghost cells” — make redundant copies of boundary data in a communication step, after which point the operations on each “chunk” of the array can occur in parallel. See e.g. Principles of Parallel Programming by Lin and Snyder, quoted here: Seemingly unnecessary allocation within for loop - #9 by ptoche and the GitHub - fverdugo/PartitionedArrays.jl: Vectors and sparse matrices partitioned into pieces for parallel distributed-memory computations. package (though this is not GPU-based).