Efficient CuArray shift/rotation

swibe · September 6, 2022, 3:44pm

Hi all,

I have a workload which regularly needs to perform operations on adjacent elements in an array and accumulate the result. I would sometimes like to be able to run this parallelised across the whole GPU, sometimes have many copies, each running sequentially within a single GPU thread, and sometimes run it sequentially on the CPU.

For the parallel option, circshift does the job (though at the cost of copying the whole array). This copy requires a memory allocation, so doesn’t work in the sequential case.

julia> a = CuArray([1,2,3]);

julia> sum(a + circshift(a, -1))
12

For the sequential case, I can use array indices, but getindex isn’t allowed in the parallel case.

sum(a[i] + a[mod1(i+1, length(a))] for i in eachindex(a))

It looks like ShiftedArrays would be ideal for this. It offers a lazy version of circshift which doesn’t need to allocate any memory. On the CPU, this is by far the quickest option I’ve found, but it’s not compatible with the GPU.

julia> sum(a + ShiftedArrays.circshift(a, -1))
ERROR: GPU compilation of kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Int64, 1}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CircShiftedVector{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Bool}, Tuple{Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Obviously I can have multiple options and use whichever one is appropriate, but I’d like to have a single implementation if possible. Is there an obvious way to do this?

stevengj · September 6, 2022, 4:59pm

The most common way to do this in parallel computing is not by shifting the array, but rather by using “ghost cells” — make redundant copies of boundary data in a communication step, after which point the operations on each “chunk” of the array can occur in parallel. See e.g. Principles of Parallel Programming by Lin and Snyder, quoted here: Seemingly unnecessary allocation within for loop - #9 by ptoche and the GitHub - fverdugo/PartitionedArrays.jl: Vectors and sparse matrices partitioned into pieces for parallel distributed-memory computations. package (though this is not GPU-based).

RainerHeintzmann · April 20, 2023, 7:55pm

Please feel free to try out this fork:

It should hopefully do the trick.
Please let me know if you find any problems.

Topic		Replies	Views
Circshift() shifts array in opposite direction to circshift!() New to Julia question , function	14	643	December 13, 2023
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	644	November 20, 2019
Moving ahead with CUDA support GPU images , cuda , paddedviews , juliaimages	2	279	March 17, 2025
Rotr90 of a CUDA.CuArray GPU	5	332	October 6, 2022
Any function like `push!` for `CuArray` GPU	2	156	September 8, 2024

Efficient CuArray shift/rotation

Related topics