Increment elements of array by index

Given an array A, indices I and values v, I want to add elements of v to A at I. Something like this:

A = zeros(4)
I = [1, 3]
v = [1, 1]

# partial solution
A[I] .+= v

Now the hard part:

  1. Indices may repeat, e.g. I = [1, 3, 1]. In this case A[1] should be updated with values in both - v[1] and v[3].
  2. Code should work on CuArrays with allowscalar(false).

Best ideas I came with all operate on iterate over indices and thus don’t work on CuArrays. Is there a workaround?

1 Like

I am having a lot of trouble understanding what you mean here:

Could you clarify?

Perhaps you can show your version that iterates over indices that will make it easier to help find a solution that is GPU friendly. Your current ‘partial solution’ errors, so I don’t really get what you’re trying to do.

Further questions:

  • Does I need to be a mutable array? Can it be a CartesianIndex?

I’ve updated tge partial solution, now it should work. Iterative algorithm for 1D arrays would be:

for (i, x) in zip(I, v)
    A[i] += x

In other words, we iteratively take values from v and add them to elements of A specified by corresponding indices in I.

There are no restrictions on I.

Hm, I see now. When you say 'there are no restrictions on I, does that mean that I does not need to be a GPU array? How big do you anticipate I becoming?

Fully-GPU solution would be great, but in general I can convert I to CPU.

Let me give you some context. I’m trying to fix this function which is the derivative of getindex(A, I...). It works fine in a typical scenario where elements of I are all unique, e.g. when you have in your code something like:


the derivative of A in this case is:

dy = ...
dA = zeros(size(A))
dA[1:3] = dy

But the code breaks if indices repeat, e.g. in:

A[[1, 3, 1]]

Here the correct derivative would be:

dA = zeros(size(A))
dA[1] += dy[1]
dA[3] += dy[2]
dA[1] += dy[3]

Here I = [1, 3, 1]. In theory, it can be much larger, but if it’s not very optimal for, say, length(I) > 10_000, I think it’s fine :slight_smile:

Aha, I see! This is a rather awkard thing to to on the GPU, though it should be possible to do if we can phrase it the right way. I’m not sure I have a good answer.

I had hoped Tullio.jl would handle this, but it seems to just silently assume that I does not have repeated indices:

#+begin_src julia
using CUDA, Tullio, KernelAbstractions

let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
    @tullio A[I[i]] += v[i]

: 4-element CuArray{Float32,1}:
:  1.0
:  0.0
:  2.0
:  0.0

@mcabbott Do you have any ideas here? Is this intended behaviour from Tullio?

That’s a bug, sorry. Or it was, hopefully fixed on master.

The macro does notice that the index i is (potentially) unsafe to break among different threads, and thus avoids that on the CPU. This wasn’t being passed to KernelAbstractions.

But I think the safe way will be slow, it’s got to work sequentially.

1 Like

Using Tullio#master I’m getting the following error from the code above:

julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
           @tullio A[I[i]] += v[i]
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
 [1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
 [2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385
 [3] launch_config(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/backends/cuda.jl:156
 [4] (::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"})(::CuArray{Float32,1}, ::Vararg{Any,N} where N; ndrange::Tuple{}, dependencies::KernelAbstractions.CudaEvent, workgroupsize::Nothing, progress::Function) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/backends/cuda.jl:163
 [5] 𝒜𝒸𝓉! at /home/user/.julia/packages/Tullio/RAkkV/src/macro.jl:1166 [inlined]
 [6] 𝒜𝒸𝓉! at /home/user/.julia/packages/Tullio/RAkkV/src/macro.jl:1163 [inlined]
 [7] threader(::var"#𝒜𝒸𝓉!#1", ::Type{CuArray{T,1} where T}, ::CuArray{Float32,1}, ::Tuple{CuArray{Int64,1},CuArray{Int64,1}}, ::Tuple{}, ::Tuple{Base.OneTo{Int64}}, ::Function, ::Int64, ::Bool) at /home/user/.julia/packages/Tullio/RAkkV/src/eval.jl:86
 [8] top-level scope at /home/user/.julia/packages/Tullio/RAkkV/src/macro.jl:1002
 [9] top-level scope at REPL[2]:2

But on CPU it indeed does the correct thing.

Another approach I thought about was grouping indices and summing corresponding values before adding them to A, but it seems to boil down to the same issue.

That’s also no good, sorry. It’s trying to launch just one kernel (if I have the terminology straight) to do this serially, hence length(ndrange) == 1. Which KernelAbstractions allows on the CPU.

Any way to solve this seems quite serial. It’s possible that if you sorted the indices, you could then divide the work into safely disjoint sets?

It turns out Zygote has the same issue on CuArrays.

Eventually I adapted code from a nice library ScatterNNlib.jl which implements the operation in a pretty much straightforward way and achieves roughly the same performance as PyTorch implementation. Semantics of the function scatter_add!() there is a bit confusing, so here’s my wrapper for the specific case in this post:

function scatter_add2!(A::AbstractArray, v::AbstractArray, I...)
    # replace (:) with actual index in A
    I = [i == (:) ? (1:size(A, d)) : i for (d, i) in enumerate(I)]
    II = collect(Iterators.product(I...))
    if A isa CuArray
        II = cu(II)
    A_ = reshape(A, 1, size(A)...)
    v_ = reshape(v, 1, size(v)...)
    scatter_add!(A_, v_, II)
    return dropdims(A_, dims=1)

A = zeros(4);
I = [1, 3, 1];
v = [1, 1, 1];

scatter_add2!(A, v, I)
# ==> 4-element Array{Float64,1}:
# ==>  2.0
# ==>  0.0
# ==>  1.0
# ==>  0.0