Unrolling of loops with operations on CuArrays

fedoroff · October 27, 2021, 10:27am

I have a loop where at each iteration I do some vectorial operation over CuArray. For example

using CUDA

N = 4
a = ones(N)

u = CUDA.zeros(1000)

@. u = a[1]
for i=2:N
    @. u += a[i]
end

As far as I understand, in this case CUDA creates N different kernels for each loop iteration which causes an increased overhead. Are there any ways to unroll the loop to obtain a single CUDA kernel? Like here

@. u = a[1] + a[2] + a[3] + a[4]

maleadt · October 27, 2021, 3:50pm

Use reduce?

fedoroff · October 27, 2021, 4:21pm

Do you mean this?

@. u = reduce(+, a)

Then what about the following situation?

using CUDA

N = 4
a = ones(N)

x = CUDA.ones(1000)
u = CUDA.zeros(1000)

for i=1:N
    @. u += a[i] * x
end

maleadt · October 27, 2021, 4:22pm

mapreduce

It’s probably also possible to express this more neatly with Tullio or another indexing notation-package.

fedoroff · October 27, 2021, 5:10pm

It seems that both mapreduce and Tullio approaches are slower than the original loop:

using BenchmarkTools
using CUDA
using Tullio

CUDA.allowscalar(false)


function loop(u, a, x)
    @. u = a[1] * x
    for i=2:N
        @. u += a[i]
    end
    return nothing
end


function loop_unrolled(u, a, x)
    @. u = a[1] * x + a[2] * x + a[3] * x + a[4] * x
    return nothing
end


function loop_mapreduce(u, a, x)
    u .= mapreduce(y -> y * x, +, a)
    return nothing
end


function loop_tullio(u, a, x)
    @. u = 0
    @tullio u += a[i] * x
    return nothing
end


N = 4
a = ones(N)
x = CUDA.ones(1000)
u = CUDA.zeros(1000)

@btime CUDA.@sync loop($u, $a, $x)
@btime CUDA.@sync loop_unrolled($u, $a, $x)
@btime CUDA.@sync loop_mapreduce($u, $a, $x)
@btime CUDA.@sync loop_tullio($u, $a, $x)

  21.863 μs (105 allocations: 5.73 KiB)
  10.349 μs (7 allocations: 496 bytes)
  42.210 μs (219 allocations: 12.31 KiB)
  69.036 μs (242 allocations: 25.36 KiB)

Topic		Replies	Views
How do I to transform mapreduce function to work well with CUDA? GPU	5	1550	May 14, 2021
Strange behavior of `mapreduce` GPU	2	690	November 16, 2018
TensorCast & CUDA General Usage question , cuda	2	561	June 16, 2020
CUDA \| nested loops kernel GPU question	5	166	May 12, 2025
Simple kernel not working GPU	10	1194	July 12, 2020

Unrolling of loops with operations on CuArrays

Related topics