What is a neutral element (for accumulate on gpu)?

HenriDeh · April 8, 2021, 3:37pm

Hello,

So I’m trying this simple accumulate function on a CuArray:

julia> using CUDA
julia> d_d = cu(rand(10,20));
julia> accumulate((a,d) -> d + 0.99*0.95*a, d_d, dims = 2, init = 0f0)
ERROR: GPUArrays.jl needs to know the neutral element for your operator `#19`.
Please pass it as an explicit argument to (if possible), or register it
globally your operator by defining `GPUArrays.neutral_element(::typeof(#19), T)`.
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] neutral_element(op::Function, T::Type)
   @ GPUArrays ~\.julia\packages\GPUArrays\bjw3g\src\host\mapreduce.jl:13
 [3] _accumulate!
   @ ~\.julia\packages\CUDA\qEV3Y\src\accumulate.jl:205 [inlined]
 [4] #accumulate!#738
   @ .\accumulate.jl:361 [inlined]
 [5] accumulate(op::Function, A::CuArray{Float32, 2}; dims::Int64, kw::Base.Iterators.Pairs{Symbol, Float32, Tuple{Symbol}, NamedTuple{(:init,), Tuple{Float32}}})
   @ Base .\accumulate.jl:302
 [6] top-level scope
   @ REPL[24]:1

First I thought the error was asking for a init value, which I provided, but apparently that’s not it…
So, what’s a neutral element then ?

maleadt · April 8, 2021, 4:05pm

The CUDA implementation performs the accumulation on more threads than the array has elements, so it needs to know which neutral element it can use to perform a no-op. Not very clean, especially here because Base.scan doesn’t allow passing in the neutral keyword that CUDA.accumulate! understands. But you can use the proposed alternative API:

julia> f = (a,d) -> d + 0.99*0.95*a
#11 (generic function with 1 method)

julia> GPUArrays.neutral_element(::typeof(f), ::Type{T}) where T = zero(T)

julia> accumulate(f, d_d, dims = 2, init = 0f0)
10×20 CuArray{Float64, 2}:
...

HenriDeh · April 8, 2021, 4:21pm

Thanks, it works but somehow the computation is wrong when compared to a cpu equivalent:

julia> f(a,d) = d + 0.95f0*0.99f0*a
f (generic function with 1 method)

julia> GPUArrays.neutral_element(::typeof(f), ::Type{T}) where T = zero(T)

julia> d = rand(Float32, 1, 10)
1×10 Matrix{Float32}:
 0.200713  0.652811  0.242768  0.661561  0.560284  0.929563  0.96326  0.864514  0.857594  0.663797

julia> d_d = cu(d)
1×10 CuArray{Float32, 2}:
 0.200713  0.652811  0.242768  0.661561  0.560284  0.929563  0.96326  0.864514  0.857594  0.663797

julia> accumulate(f, d, dims = 2)
1×10 Matrix{Float32}:
 0.200713  0.841581  1.03428  1.6343  2.09734  2.90211  3.6927  4.33749  4.93701  5.30705

julia> accumulate(f, d_d, dims = 2)
1×10 CuArray{Float32, 2}:
 0.200713  0.83035  0.987181  1.62071  2.04754  2.91242  3.73886  4.49216  5.03409  5.59887

You can see there’s a growing error build up during the accumulation. Is this due to GPU imprecision ?

maleadt · April 8, 2021, 4:36pm

That’s a pretty significant error, please file an issue on the CUDA.jl repository for that.

tkf · April 8, 2021, 7:49pm

I don’t think this is a bug. Parallel accumulate needs an associative operator but f is not associative:

julia> f = (a,d) -> d + 0.99*0.95*a;

julia> all(f(f(x, y), z) ≈ f(x, f(y, z)) for x in rand(10), y in rand(10), z in rand(10))
false

I don’t think you can compute accumulate((a,d) -> d + 0.99*0.95*a, d_d, dims = 2) in parallel. It’s possible to do it in GPU if you want to do cumsum!(map(a -> 0.99*0.95*a, d_d)).

HenriDeh · April 9, 2021, 9:41am

Thanks for spotting that, I didn’t know it had to be associative but it makes sense that it should be.

Thank you for the work around but that would not lead to the same operation, it’s a recursive discounted sum so it must start be the first element and work forward. So yes, that cannot be computed in parallel, I was counting on the GPU to parallelize the same accumulate on several rows.

So I’ll have to use a for loop:

julia> using CUDA

julia> d = rand(1,10)
1×10 Matrix{Float64}:
 0.845519  0.408208  0.243472  0.289811  0.943654  0.292486  0.87521  0.675013  0.682746  0.768655

julia> d_d = cu(d)
1×10 CuArray{Float32, 2}:
 0.845519  0.408208  0.243472  0.289811  0.943654  0.292486  0.87521  0.675013  0.682746  0.768655

julia> f(a,d) = d .+ 0.95f0.*0.99f0.*a
f (generic function with 1 method)

julia> accumulate(f, d, dims = 2)
1×10 Matrix{Float64}:
 0.845519  1.20342  1.37529  1.58327  2.43272  2.58046  3.30213  3.78067  4.23846  4.75493

julia> for t in 2:size(d_d)[2]
       d_d[:,t] = f(d_d[:,t-1], d_d[:,t])
       end

julia> d_d
1×10 CuArray{Float32, 2}:
 0.845519  1.20342  1.37529  1.58327  2.43272  2.58046  3.30213  3.78067  4.23846  4.75493

I think that’s as efficient as I can make it.

tkf · April 9, 2021, 11:11am

Actually, since you can rewrite

y_n = \alpha y_{n-1} + x_n

as

y_n = \alpha^n (y_0 + \sum_{k=1}^n \alpha^{-k} x_k),

you can compute it with

function linrec(a, xs, y0)
    ns = eachindex(xs)
    a .^ ns .* (y0 .+ cumsum(a .^ .- ns .* xs))
end

Note that everything in linrec is parallelizable. Just as a sanity check:

julia> xs = rand(3)
3-element Vector{Float64}:
 0.3103473087262907
 0.08905740732693901
 0.39324464076213617

julia> y0 = rand()
0.5570557429416012

julia> accumulate((a, d) -> d + 0.99*0.95*a, xs; init = y0)
3-element Vector{Float64}:
 0.8342582349628667
 0.873677277309515
 1.214938120071735

julia> linrec(0.99*0.95, xs, y0)
3-element Vector{Float64}:
 0.8342582349628667
 0.8736772773095152
 1.214938120071735

a .^ ns and a .^ .- ns are a bit worrisome expressions, though. I don’t know if this is numerically well-behaving algorithm.

(Edit: Obviously, people have already worked on this ages ago. A quick searching finds https://www.sciencedirect.com/science/article/pii/S0377042702005654)

HenriDeh · April 9, 2021, 1:12pm

Nice ! Thank you for going out of your way to find me this.

Looks like I need to learn some math. Do you happen to have a nice learning reference for me to check out on that subject, since it looks like you knew this was doable ? If it’s not too much to ask.

tkf · April 10, 2021, 1:26am

Do you mean the math (“analytical solution”) part or the parallelism part? If you are talking about the math part, maybe this wikipedia section in Recurrence relation is a nice quick overview of it. I learned this in the context of dynamical systems but I’m not sure if that’s the best way to learn this.

For the parallelization part, I don’t know if there is a good tutorial or textbook about it. I remember it was mentioned in some papers as a generalization of prefix sum (your comments mentioning “linear recurrence” helped me thinking about this) so I just stared at the equation. I’ve been looking for a good textbook for this kind of thing but I have to just look into the papers usually…

Topic		Replies	Views
Inconsistency in `accumulate` between `Array` and `CuArray.` General Usage gpu , gpuarrays , cuda	2	66	March 26, 2025
Bug in CUDA, CuArray, or something I just don't know? GPU	3	264	December 25, 2022
CUDA.jl - Sub-Vector Indexing Problem Inside CUDA Kernel GPU cuda , error , cuarrays , error-message , staticarrays	2	1230	March 28, 2022
Parallel calculations with CUDA General Usage gpu , cuda	0	568	December 12, 2019
Zygote + CUDA: scalar getindex with custom activation function using multiplication GPU flux	3	713	September 3, 2020

What is a neutral element (for accumulate on gpu)?

Related topics