# Unrolling of loops with operations on CuArrays

I have a loop where at each iteration I do some vectorial operation over CuArray. For example

using CUDA

N = 4
a = ones(N)

u = CUDA.zeros(1000)

@. u = a[1]
for i=2:N
@. u += a[i]
end


As far as I understand, in this case CUDA creates N different kernels for each loop iteration which causes an increased overhead. Are there any ways to unroll the loop to obtain a single CUDA kernel? Like here

@. u = a[1] + a[2] + a[3] + a[4]


Use reduce?

Do you mean this?

@. u = reduce(+, a)


Then what about the following situation?

using CUDA

N = 4
a = ones(N)

x = CUDA.ones(1000)
u = CUDA.zeros(1000)

for i=1:N
@. u += a[i] * x
end


mapreduce

It’s probably also possible to express this more neatly with Tullio or another indexing notation-package.

It seems that both mapreduce and Tullio approaches are slower than the original loop:

using BenchmarkTools
using CUDA
using Tullio

CUDA.allowscalar(false)

function loop(u, a, x)
@. u = a[1] * x
for i=2:N
@. u += a[i]
end
return nothing
end

function loop_unrolled(u, a, x)
@. u = a[1] * x + a[2] * x + a[3] * x + a[4] * x
return nothing
end

function loop_mapreduce(u, a, x)
u .= mapreduce(y -> y * x, +, a)
return nothing
end

function loop_tullio(u, a, x)
@. u = 0
@tullio u += a[i] * x
return nothing
end

N = 4
a = ones(N)
x = CUDA.ones(1000)
u = CUDA.zeros(1000)

@btime CUDA.@sync loop($u,$a, $x) @btime CUDA.@sync loop_unrolled($u, $a,$x)
@btime CUDA.@sync loop_mapreduce($u,$a, $x) @btime CUDA.@sync loop_tullio($u, $a,$x)

  21.863 μs (105 allocations: 5.73 KiB)
10.349 μs (7 allocations: 496 bytes)
42.210 μs (219 allocations: 12.31 KiB)
69.036 μs (242 allocations: 25.36 KiB)