What is a neutral element (for accumulate on gpu)?

Hello,

So I’m trying this simple accumulate function on a CuArray:

julia> using CUDA
julia> d_d = cu(rand(10,20));
julia> accumulate((a,d) -> d + 0.99*0.95*a, d_d, dims = 2, init = 0f0)
ERROR: GPUArrays.jl needs to know the neutral element for your operator `#19`.
Please pass it as an explicit argument to (if possible), or register it
globally your operator by defining `GPUArrays.neutral_element(::typeof(#19), T)`.
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] neutral_element(op::Function, T::Type)
   @ GPUArrays ~\.julia\packages\GPUArrays\bjw3g\src\host\mapreduce.jl:13
 [3] _accumulate!
   @ ~\.julia\packages\CUDA\qEV3Y\src\accumulate.jl:205 [inlined]
 [4] #accumulate!#738
   @ .\accumulate.jl:361 [inlined]
 [5] accumulate(op::Function, A::CuArray{Float32, 2}; dims::Int64, kw::Base.Iterators.Pairs{Symbol, Float32, Tuple{Symbol}, NamedTuple{(:init,), Tuple{Float32}}})
   @ Base .\accumulate.jl:302
 [6] top-level scope
   @ REPL[24]:1

First I thought the error was asking for a init value, which I provided, but apparently that’s not it…
So, what’s a neutral element then ?

2 Likes

The CUDA implementation performs the accumulation on more threads than the array has elements, so it needs to know which neutral element it can use to perform a no-op. Not very clean, especially here because Base.scan doesn’t allow passing in the neutral keyword that CUDA.accumulate! understands. But you can use the proposed alternative API:

julia> f = (a,d) -> d + 0.99*0.95*a
#11 (generic function with 1 method)

julia> GPUArrays.neutral_element(::typeof(f), ::Type{T}) where T = zero(T)

julia> accumulate(f, d_d, dims = 2, init = 0f0)
10×20 CuArray{Float64, 2}:
...

Thanks, it works but somehow the computation is wrong when compared to a cpu equivalent:

julia> f(a,d) = d + 0.95f0*0.99f0*a
f (generic function with 1 method)

julia> GPUArrays.neutral_element(::typeof(f), ::Type{T}) where T = zero(T)

julia> d = rand(Float32, 1, 10)
1×10 Matrix{Float32}:
 0.200713  0.652811  0.242768  0.661561  0.560284  0.929563  0.96326  0.864514  0.857594  0.663797

julia> d_d = cu(d)
1×10 CuArray{Float32, 2}:
 0.200713  0.652811  0.242768  0.661561  0.560284  0.929563  0.96326  0.864514  0.857594  0.663797

julia> accumulate(f, d, dims = 2)
1×10 Matrix{Float32}:
 0.200713  0.841581  1.03428  1.6343  2.09734  2.90211  3.6927  4.33749  4.93701  5.30705

julia> accumulate(f, d_d, dims = 2)
1×10 CuArray{Float32, 2}:
 0.200713  0.83035  0.987181  1.62071  2.04754  2.91242  3.73886  4.49216  5.03409  5.59887

You can see there’s a growing error build up during the accumulation. Is this due to GPU imprecision ?

That’s a pretty significant error, please file an issue on the CUDA.jl repository for that.

I don’t think this is a bug. Parallel accumulate needs an associative operator but f is not associative:

julia> f = (a,d) -> d + 0.99*0.95*a;

julia> all(f(f(x, y), z) ≈ f(x, f(y, z)) for x in rand(10), y in rand(10), z in rand(10))
false

I don’t think you can compute accumulate((a,d) -> d + 0.99*0.95*a, d_d, dims = 2) in parallel. It’s possible to do it in GPU if you want to do cumsum!(map(a -> 0.99*0.95*a, d_d)).

4 Likes

Thanks for spotting that, I didn’t know it had to be associative but it makes sense that it should be.

Thank you for the work around but that would not lead to the same operation, it’s a recursive discounted sum so it must start be the first element and work forward. So yes, that cannot be computed in parallel, I was counting on the GPU to parallelize the same accumulate on several rows.

So I’ll have to use a for loop:

julia> using CUDA

julia> d = rand(1,10)
1×10 Matrix{Float64}:
 0.845519  0.408208  0.243472  0.289811  0.943654  0.292486  0.87521  0.675013  0.682746  0.768655

julia> d_d = cu(d)
1×10 CuArray{Float32, 2}:
 0.845519  0.408208  0.243472  0.289811  0.943654  0.292486  0.87521  0.675013  0.682746  0.768655

julia> f(a,d) = d .+ 0.95f0.*0.99f0.*a
f (generic function with 1 method)

julia> accumulate(f, d, dims = 2)
1×10 Matrix{Float64}:
 0.845519  1.20342  1.37529  1.58327  2.43272  2.58046  3.30213  3.78067  4.23846  4.75493

julia> for t in 2:size(d_d)[2]
       d_d[:,t] = f(d_d[:,t-1], d_d[:,t])
       end

julia> d_d
1×10 CuArray{Float32, 2}:
 0.845519  1.20342  1.37529  1.58327  2.43272  2.58046  3.30213  3.78067  4.23846  4.75493

I think that’s as efficient as I can make it.

Actually, since you can rewrite

y_n = \alpha y_{n-1} + x_n

as

y_n = \alpha^n (y_0 + \sum_{k=1}^n \alpha^{-k} x_k),

you can compute it with

function linrec(a, xs, y0)
    ns = eachindex(xs)
    a .^ ns .* (y0 .+ cumsum(a .^ .- ns .* xs))
end

Note that everything in linrec is parallelizable. Just as a sanity check:

julia> xs = rand(3)
3-element Vector{Float64}:
 0.3103473087262907
 0.08905740732693901
 0.39324464076213617

julia> y0 = rand()
0.5570557429416012

julia> accumulate((a, d) -> d + 0.99*0.95*a, xs; init = y0)
3-element Vector{Float64}:
 0.8342582349628667
 0.873677277309515
 1.214938120071735

julia> linrec(0.99*0.95, xs, y0)
3-element Vector{Float64}:
 0.8342582349628667
 0.8736772773095152
 1.214938120071735

a .^ ns and a .^ .- ns are a bit worrisome expressions, though. I don’t know if this is numerically well-behaving algorithm.

(Edit: Obviously, people have already worked on this ages ago. A quick searching finds https://www.sciencedirect.com/science/article/pii/S0377042702005654)

2 Likes

Nice ! Thank you for going out of your way to find me this.

Looks like I need to learn some math. Do you happen to have a nice learning reference for me to check out on that subject, since it looks like you knew this was doable ? If it’s not too much to ask.

Do you mean the math (“analytical solution”) part or the parallelism part? If you are talking about the math part, maybe this wikipedia section in Recurrence relation is a nice quick overview of it. I learned this in the context of dynamical systems but I’m not sure if that’s the best way to learn this.

For the parallelization part, I don’t know if there is a good tutorial or textbook about it. I remember it was mentioned in some papers as a generalization of prefix sum (your comments mentioning “linear recurrence” helped me thinking about this) so I just stared at the equation. I’ve been looking for a good textbook for this kind of thing but I have to just look into the papers usually…