The CUDA implementation performs the accumulation on more threads than the array has elements, so it needs to know which neutral element it can use to perform a no-op. Not very clean, especially here because Base.scan doesn’t allow passing in the neutral keyword that CUDA.accumulate! understands. But you can use the proposed alternative API:

julia> f = (a,d) -> d + 0.99*0.95*a
#11 (generic function with 1 method)
julia> GPUArrays.neutral_element(::typeof(f), ::Type{T}) where T = zero(T)
julia> accumulate(f, d_d, dims = 2, init = 0f0)
10×20 CuArray{Float64, 2}:
...

I don’t think this is a bug. Parallel accumulate needs an associative operator but f is not associative:

julia> f = (a,d) -> d + 0.99*0.95*a;
julia> all(f(f(x, y), z) ≈ f(x, f(y, z)) for x in rand(10), y in rand(10), z in rand(10))
false

I don’t think you can compute accumulate((a,d) -> d + 0.99*0.95*a, d_d, dims = 2) in parallel. It’s possible to do it in GPU if you want to do cumsum!(map(a -> 0.99*0.95*a, d_d)).

Thanks for spotting that, I didn’t know it had to be associative but it makes sense that it should be.

Thank you for the work around but that would not lead to the same operation, it’s a recursive discounted sum so it must start be the first element and work forward. So yes, that cannot be computed in parallel, I was counting on the GPU to parallelize the same accumulate on several rows.

So I’ll have to use a for loop:

julia> using CUDA
julia> d = rand(1,10)
1×10 Matrix{Float64}:
0.845519 0.408208 0.243472 0.289811 0.943654 0.292486 0.87521 0.675013 0.682746 0.768655
julia> d_d = cu(d)
1×10 CuArray{Float32, 2}:
0.845519 0.408208 0.243472 0.289811 0.943654 0.292486 0.87521 0.675013 0.682746 0.768655
julia> f(a,d) = d .+ 0.95f0.*0.99f0.*a
f (generic function with 1 method)
julia> accumulate(f, d, dims = 2)
1×10 Matrix{Float64}:
0.845519 1.20342 1.37529 1.58327 2.43272 2.58046 3.30213 3.78067 4.23846 4.75493
julia> for t in 2:size(d_d)[2]
d_d[:,t] = f(d_d[:,t-1], d_d[:,t])
end
julia> d_d
1×10 CuArray{Float32, 2}:
0.845519 1.20342 1.37529 1.58327 2.43272 2.58046 3.30213 3.78067 4.23846 4.75493

Nice ! Thank you for going out of your way to find me this.

Looks like I need to learn some math. Do you happen to have a nice learning reference for me to check out on that subject, since it looks like you knew this was doable ? If it’s not too much to ask.

Do you mean the math (“analytical solution”) part or the parallelism part? If you are talking about the math part, maybe this wikipedia section in Recurrence relation is a nice quick overview of it. I learned this in the context of dynamical systems but I’m not sure if that’s the best way to learn this.

For the parallelization part, I don’t know if there is a good tutorial or textbook about it. I remember it was mentioned in some papers as a generalization of prefix sum (your comments mentioning “linear recurrence” helped me thinking about this) so I just stared at the equation. I’ve been looking for a good textbook for this kind of thing but I have to just look into the papers usually…