Inconsistency in `accumulate` between `Array` and `CuArray.`

Hi all.
I’m performing the following computation using GPUArrays:

julia> using GPUArrays

julia> src = CuArray(rand(0:9, 10))
10-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 3
 4
 3
 7
 3
 9
 7
 2
 9
 6

julia> dst = similar(src);

julia> op = (acc, x) -> acc + (1 - ((x >> 0) & 0x1))
#211 (generic function with 1 method)

julia> GPUArrays.neutral_element(::typeof(op), T) = one(T)

julia> accumulate!(op, dst, src; dims=1)
10-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 1
 2
 2
 2
 2
 2
 2
 3
 3
 2

julia> src = collect(src)
10-element Vector{Int64}:
 3
 4
 3
 7
 3
 9
 7
 2
 9
 6

julia> dst = collect(dst);

julia> accumulate!(op, dst, src; init=0)
10-element Vector{Int64}:
 0
 1
 1
 1
 1
 1
 1
 2
 2
 3

I noticed that the results differ between CPU and GPU execution.
Why is that the case? Also, could someone explain what the “neutral element” means in this context?

My guess is that the issue is that op is non-associative (and non-commutative),

julia> op(10, op(20, 30))
10

julia> op(op(10, 20), 30)
12

julia> op(2, 3)
2

julia> op(3, 2)
4

and the GPU version of accumulate! will change the order of operations to achieve parallelisation. E.g. it will compute, using infix notation, ((a op b) op c) op d as (a op b) op (c op d).


The neutral element for an operation is just some value n such that op(x, n) == x == op(n, x) for all x. In this case n = 1 indeed is appropriate. The parallel algorithm will presumably always (to avoid branching) combine two elements x and y, even when no appropriate y exists. In that case it will use y = n, so that the outcome op(x, y) == x is as if you didn’t apply op at all.

1 Like

Thank you for your clear answer.