I have a loop where at each iteration I do some vectorial operation over CuArray. For example
using CUDA
N = 4
a = ones(N)
u = CUDA.zeros(1000)
@. u = a[1]
for i=2:N
@. u += a[i]
end
As far as I understand, in this case CUDA creates N
different kernels for each loop iteration which causes an increased overhead. Are there any ways to unroll the loop to obtain a single CUDA kernel? Like here
@. u = a[1] + a[2] + a[3] + a[4]
1 Like
Do you mean this?
@. u = reduce(+, a)
Then what about the following situation?
using CUDA
N = 4
a = ones(N)
x = CUDA.ones(1000)
u = CUDA.zeros(1000)
for i=1:N
@. u += a[i] * x
end
mapreduce
It’s probably also possible to express this more neatly with Tullio or another indexing notation-package.
It seems that both mapreduce and Tullio approaches are slower than the original loop:
using BenchmarkTools
using CUDA
using Tullio
CUDA.allowscalar(false)
function loop(u, a, x)
@. u = a[1] * x
for i=2:N
@. u += a[i]
end
return nothing
end
function loop_unrolled(u, a, x)
@. u = a[1] * x + a[2] * x + a[3] * x + a[4] * x
return nothing
end
function loop_mapreduce(u, a, x)
u .= mapreduce(y -> y * x, +, a)
return nothing
end
function loop_tullio(u, a, x)
@. u = 0
@tullio u += a[i] * x
return nothing
end
N = 4
a = ones(N)
x = CUDA.ones(1000)
u = CUDA.zeros(1000)
@btime CUDA.@sync loop($u, $a, $x)
@btime CUDA.@sync loop_unrolled($u, $a, $x)
@btime CUDA.@sync loop_mapreduce($u, $a, $x)
@btime CUDA.@sync loop_tullio($u, $a, $x)
21.863 μs (105 allocations: 5.73 KiB)
10.349 μs (7 allocations: 496 bytes)
42.210 μs (219 allocations: 12.31 KiB)
69.036 μs (242 allocations: 25.36 KiB)