How to "reduce" an array?

Ahmed_Salih · January 13, 2023, 11:29pm

Hello

The wording is a bit bad in the title and I struggle to explain what I want to do properly, but basically what I want to do is;

Suppose I have an array I consisting of indices:

I = [1 1 1 2 2 2 3 3 3]

And an array of results, R:

R = [2 4 7 8 5 1 9 8 1]

Then I want to produce a sum of R based on the indices of I, such that a final result would be:

FR = [13 14 18]

Where 13 comes from the first 3 indices being 1 1 1 in I.

Hope someone got what I mean…

EDIT: And I want to do it using a broadcasting operation, no for loops

Kind regards

jmair · January 13, 2023, 11:33pm

I am not sure how practical this for your actual use case, but this could work:

sum(reshape(R, 3, :), dims=1)

This obviously only works if the sums are the same sizes and adjacent.

Ahmed_Salih · January 13, 2023, 11:35pm

Thank you for your comment!

No that does not work for me, I could also have been:

I = [1 2 3 3 2 1 1 1 2]

Just wanted to provide a simple example since I struggled with explaining it

Kind regards

jmair · January 13, 2023, 11:36pm

I am not sure how to do this without a for loop, is there any reason you can’t use one?

stillyslalom · January 13, 2023, 11:37pm

julia> (unique(I) .== I)*R'
3×1 Matrix{Int64}:
 13
 14
 18

…but a for-loop would be more efficient

Ahmed_Salih · January 13, 2023, 11:41pm

Doing GPU programming, want to see if I can do it without making a kernel - cannot use scalar indexing

Ahmed_Salih · January 13, 2023, 11:41pm

Thank you for the suggestion!

I see why it would work, but as you say quite slow

DNF · January 13, 2023, 11:59pm

Maybe something like this:

function sumind(a, ind)
    val = zeros(Int, maximum(a))
    for i in eachindex(ind)
        val[a[i]] += ind[i]  # fixed typo, thanks @rafael.guerra 
    end
    return val
end

BTW, vectors are more fundamental and efficient than matrices. So I suggest making vectors

I = [1, 1, 1, 2, 2, 2, 3, 3, 3]
R = [2, 4, 7, 8, 5, 1, 9, 8, 1]

instead of 1xN matrices. It will probably not make a noticeable difference in this example, but it does help the compiler know what is going on.

rafael.guerra · January 14, 2023, 12:38am

Typo: here R should be ind

Ahmed_Salih · January 14, 2023, 12:39am

Thank you for all the suggestions guys!

I ended up writing my own kernel as:

I  = CuArray([1,1,1,2,2,2,3,3,3])
R  = CuArray([2,4,7,8,5,1,9,8,1])
FR = CuArray([0,0,0])

function gpu_add_simple!(y, x, I)
    index = threadIdx().x
    stride = blockDim().x
    for i = index:stride:length(I)
        @inbounds y[I[i]] +=  x[i]
    end
    return nothing
end

@cuda gpu_add_simple!(FR,R,I)

FR
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 13
 14
 18

Inspired by some GPU Julia docs.

I am sure some with GPU knowledge could make the function even more functional, but this does what I want.

EDIT: And I had enabled “CUDA.allowscalar(false)”. I wrongly thought for loops would break this assumption.

Kind regards

jmair · January 14, 2023, 12:45am

I was thinking of something like this, but doesn’t this have a race condition for an arbitrary index array?

Ahmed_Salih · January 14, 2023, 12:49am

Seems like it does, results get wrong when using multiple threads/blocks - hoping someone more experienced can help out

mcabbott · January 14, 2023, 12:52am

Yes to the race condition. This is NNlib.scatter(+, R, I), for which there’s a kernel which should be written so as to avoid it.

Ahmed_Salih · January 14, 2023, 1:01am

Thank you for the comment!

I just tried it out and it does work!

I have one question though, how come it does not work when R is a vector of static arrays instead of just numbers?

Shouldn’t I be able to add those together too?

And is it easy to learn how to modify my custom kernel to avoid the race condition?

Kind regards

Ahmed_Salih · January 14, 2023, 1:05am

It breaks for me when trying to use CuArray as index - but if I dont use CuArray the calculation becomes slow on the CPU

Kind regards

mcabbott · January 14, 2023, 1:11am

Without looking closely I’d assume the failure on a vector of SVector is a bug. Make an issue (with a minimal example, and code not a screenshot)… or better see if you can fix it, code is here and here.

Ahmed_Salih · January 14, 2023, 1:17am

Thanks done

Ahmed_Salih · January 14, 2023, 2:33pm

Ahmed_Salih:

function gpu_add_simple!(y, x, I)
    index = threadIdx().x
    stride = blockDim().x
    for i = index:stride:length(I)
        @inbounds y[I[i]] +=  x[i]
    end
    return nothing
end

Anyone who knows how to remove the race condition?

The code I am working on is being held back a lot by it being so slow only with 1 thread and block…

Kind regards

jmair · January 14, 2023, 4:49pm

I think you can use CUDA.@atomic:

function gpu_add_scatter!(y, x, I)
    index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    @inbounds if index<length(y)
        CUDA.@atomic y[I[index]] += x[index]
    end
    nothing
end

More in the source code of NNLibCUDA.jl: NNlibCUDA.jl/scatter.jl at master · FluxML/NNlibCUDA.jl · GitHub

Ahmed_Salih · January 14, 2023, 10:58pm

Thank you for the suggestion!

I have not tested it, but ended up using the Flux library, which afaik refers to that package you link to, when CuArrays are input.

Instead of using StaticArrays, I had to split the invidiual coordinates out in arrays and then recombine to staticarrays again, pseudo code:

function GPUCalculateKernelGradient!(WgI,αD,Q,xᵢⱼ,H,I_,J_,N)
    WgL  = αD*5 .* (Q .- 2).^3 .* Q ./ (8H .* (Q * H .+ 1e-6)) .* xᵢⱼ

    WgIx = NNlib.scatter(+, getindex.(WgL,1), I_; dstsize = N) - NNlib.scatter(+, getindex.(WgL,1), J_; dstsize = N)
    WgIy = NNlib.scatter(+, getindex.(WgL,2), I_; dstsize = N) - NNlib.scatter(+, getindex.(WgL,2), J_; dstsize = N)
    WgIz = NNlib.scatter(+, getindex.(WgL,3), I_; dstsize = N) - NNlib.scatter(+, getindex.(WgL,3), J_; dstsize = N)
    WgI .= map((x,y,z)->SVector(x,y,z),WgIx,WgIy,WgIz)

    return WgL
end

Where WgL is a CuArray of SVector{3,Float64}

Just to help others out in the future, until StaticArrays can be used directly.

Kind regards

Topic		Replies	Views
Increment elements of array by index GPU tullio	10	1830	November 11, 2020
Add specific elements of a CUDA matrix GPU question , indexing , cuda , arithmetic	1	286	March 21, 2024
How do I reduce into an array index on a GPU? GPU	0	163	July 10, 2023
Scalar indexing using broadcast notation? Signal and Image Processing	4	210	December 12, 2023
GPU Map without reduction on multiple arrays indices GPU	1	693	February 8, 2019

How to "reduce" an array?

Related topics