How to "reduce" an array?


The wording is a bit bad in the title and I struggle to explain what I want to do properly, but basically what I want to do is;

Suppose I have an array I consisting of indices:

I = [1 1 1 2 2 2 3 3 3]

And an array of results, R:

R = [2 4 7 8 5 1 9 8 1]

Then I want to produce a sum of R based on the indices of I, such that a final result would be:

FR = [13 14 18]

Where 13 comes from the first 3 indices being 1 1 1 in I.

Hope someone got what I mean… :sweat_smile:

EDIT: And I want to do it using a broadcasting operation, no for loops

Kind regards

1 Like

I am not sure how practical this for your actual use case, but this could work:

sum(reshape(R, 3, :), dims=1)

This obviously only works if the sums are the same sizes and adjacent.

Thank you for your comment!

No that does not work for me, I could also have been:

I = [1 2 3 3 2 1 1 1 2]

Just wanted to provide a simple example since I struggled with explaining it :slight_smile:

Kind regards

I am not sure how to do this without a for loop, is there any reason you can’t use one?

julia> (unique(I) .== I)*R'
3×1 Matrix{Int64}:

…but a for-loop would be more efficient


Doing GPU programming, want to see if I can do it without making a kernel - cannot use scalar indexing

1 Like

Thank you for the suggestion!

I see why it would work, but as you say quite slow

Maybe something like this:

function sumind(a, ind)
    val = zeros(Int, maximum(a))
    for i in eachindex(ind)
        val[a[i]] += ind[i]  # fixed typo, thanks @rafael.guerra 
    return val

BTW, vectors are more fundamental and efficient than matrices. So I suggest making vectors

I = [1, 1, 1, 2, 2, 2, 3, 3, 3]
R = [2, 4, 7, 8, 5, 1, 9, 8, 1]

instead of 1xN matrices. It will probably not make a noticeable difference in this example, but it does help the compiler know what is going on.

1 Like

Typo: here R should be ind

1 Like

Thank you for all the suggestions guys!

I ended up writing my own kernel as:

I  = CuArray([1,1,1,2,2,2,3,3,3])
R  = CuArray([2,4,7,8,5,1,9,8,1])
FR = CuArray([0,0,0])

function gpu_add_simple!(y, x, I)
    index = threadIdx().x
    stride = blockDim().x
    for i = index:stride:length(I)
        @inbounds y[I[i]] +=  x[i]
    return nothing

@cuda gpu_add_simple!(FR,R,I)

3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:

Inspired by some GPU Julia docs.

I am sure some with GPU knowledge could make the function even more functional, but this does what I want.

EDIT: And I had enabled “CUDA.allowscalar(false)”. I wrongly thought for loops would break this assumption.

Kind regards

I was thinking of something like this, but doesn’t this have a race condition for an arbitrary index array?

Seems like it does, results get wrong when using multiple threads/blocks - hoping someone more experienced can help out :slight_smile:

Yes to the race condition. This is NNlib.scatter(+, R, I), for which there’s a kernel which should be written so as to avoid it.

Thank you for the comment!

I just tried it out and it does work!

I have one question though, how come it does not work when R is a vector of static arrays instead of just numbers?

Shouldn’t I be able to add those together too?

And is it easy to learn how to modify my custom kernel to avoid the race condition?

Kind regards

It breaks for me when trying to use CuArray as index - but if I dont use CuArray the calculation becomes slow on the CPU

Kind regards

Without looking closely I’d assume the failure on a vector of SVector is a bug. Make an issue (with a minimal example, and code not a screenshot)… or better see if you can fix it, code is here and here.

Thanks done

Anyone who knows how to remove the race condition?

The code I am working on is being held back a lot by it being so slow only with 1 thread and block…

Kind regards

I think you can use CUDA.@atomic:

function gpu_add_scatter!(y, x, I)
    index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    @inbounds if index<length(y)
        CUDA.@atomic y[I[index]] += x[index]

More in the source code of NNLibCUDA.jl: NNlibCUDA.jl/scatter.jl at master · FluxML/NNlibCUDA.jl · GitHub

Thank you for the suggestion!

I have not tested it, but ended up using the Flux library, which afaik refers to that package you link to, when CuArrays are input.

Instead of using StaticArrays, I had to split the invidiual coordinates out in arrays and then recombine to staticarrays again, pseudo code:

function GPUCalculateKernelGradient!(WgI,αD,Q,xᵢⱼ,H,I_,J_,N)
    WgL  = αD*5 .* (Q .- 2).^3 .* Q ./ (8H .* (Q * H .+ 1e-6)) .* xᵢⱼ

    WgIx = NNlib.scatter(+, getindex.(WgL,1), I_; dstsize = N) - NNlib.scatter(+, getindex.(WgL,1), J_; dstsize = N)
    WgIy = NNlib.scatter(+, getindex.(WgL,2), I_; dstsize = N) - NNlib.scatter(+, getindex.(WgL,2), J_; dstsize = N)
    WgIz = NNlib.scatter(+, getindex.(WgL,3), I_; dstsize = N) - NNlib.scatter(+, getindex.(WgL,3), J_; dstsize = N)
    WgI .= map((x,y,z)->SVector(x,y,z),WgIx,WgIy,WgIz)

    return WgL

Where WgL is a CuArray of SVector{3,Float64}

Just to help others out in the future, until StaticArrays can be used directly.

Kind regards