Hi all,
I have a workload which regularly needs to perform operations on adjacent elements in an array and accumulate the result. I would sometimes like to be able to run this parallelised across the whole GPU, sometimes have many copies, each running sequentially within a single GPU thread, and sometimes run it sequentially on the CPU.
For the parallel option, circshift
does the job (though at the cost of copying the whole array). This copy requires a memory allocation, so doesn’t work in the sequential case.
julia> a = CuArray([1,2,3]);
julia> sum(a + circshift(a, -1))
12
For the sequential case, I can use array indices, but getindex
isn’t allowed in the parallel case.
sum(a[i] + a[mod1(i+1, length(a))] for i in eachindex(a))
It looks like ShiftedArrays would be ideal for this. It offers a lazy version of circshift
which doesn’t need to allocate any memory. On the CPU, this is by far the quickest option I’ve found, but it’s not compatible with the GPU.
julia> sum(a + ShiftedArrays.circshift(a, -1))
ERROR: GPU compilation of kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Int64, 1}, Tuple{Bool}, Tuple{Int64}}, Base.Broadcast.Extruded{CircShiftedVector{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{Bool}, Tuple{Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument
Obviously I can have multiple options and use whichever one is appropriate, but I’d like to have a single implementation if possible. Is there an obvious way to do this?