Unrolling of loops with operations on CuArrays

I have a loop where at each iteration I do some vectorial operation over CuArray. For example

using CUDA

N = 4
a = ones(N)

u = CUDA.zeros(1000)

@. u = a[1]
for i=2:N
    @. u += a[i]
end

As far as I understand, in this case CUDA creates N different kernels for each loop iteration which causes an increased overhead. Are there any ways to unroll the loop to obtain a single CUDA kernel? Like here

@. u = a[1] + a[2] + a[3] + a[4]

Use reduce?

Do you mean this?

@. u = reduce(+, a)

Then what about the following situation?

using CUDA

N = 4
a = ones(N)

x = CUDA.ones(1000)
u = CUDA.zeros(1000)

for i=1:N
    @. u += a[i] * x
end

mapreduce

It’s probably also possible to express this more neatly with Tullio or another indexing notation-package.

It seems that both mapreduce and Tullio approaches are slower than the original loop:

using BenchmarkTools
using CUDA
using Tullio

CUDA.allowscalar(false)


function loop(u, a, x)
    @. u = a[1] * x
    for i=2:N
        @. u += a[i]
    end
    return nothing
end


function loop_unrolled(u, a, x)
    @. u = a[1] * x + a[2] * x + a[3] * x + a[4] * x
    return nothing
end


function loop_mapreduce(u, a, x)
    u .= mapreduce(y -> y * x, +, a)
    return nothing
end


function loop_tullio(u, a, x)
    @. u = 0
    @tullio u += a[i] * x
    return nothing
end


N = 4
a = ones(N)
x = CUDA.ones(1000)
u = CUDA.zeros(1000)

@btime CUDA.@sync loop($u, $a, $x)
@btime CUDA.@sync loop_unrolled($u, $a, $x)
@btime CUDA.@sync loop_mapreduce($u, $a, $x)
@btime CUDA.@sync loop_tullio($u, $a, $x)
  21.863 μs (105 allocations: 5.73 KiB)
  10.349 μs (7 allocations: 496 bytes)
  42.210 μs (219 allocations: 12.31 KiB)
  69.036 μs (242 allocations: 25.36 KiB)