Correct implementation of CuArray's slicing operations

Oh wait, the problem is the memory copy that happens during view for the purpose of bounds checking. So if you do @inbounds view(data, idx) .+= 1 that’s almost as fast as your custom kernel version.