Unrolling loops over tuples - why so hard?

Unrolling loops over tuple elements can bring easy performance gains. Here’s a trivial example of a 250x speed-up, just by adding an @unroll macro to the code:

import Unroll.@unroll
using BenchmarkTools

function nounroll(args...)
    total = 0
    for i = 1:5
        total += args[i]
    end
    return total
end

function withunroll(args...)
    total = 0
    @unroll for i = 1:5
        total += args[i]
    end
    return total
end

args = (UInt8(1), Int16(1), UInt32(1), Float32(1.0), Float64(1.0))

@assert nounroll(args...) == withunroll(args...)

@btime nounroll(($args)...);
@btime withunroll(($args)...);

378.676 ns (4 allocations: 64 bytes)
  1.500 ns (0 allocations: 0 bytes)

These gains are not surprising. When tuple elements are of different types, the loop cannot specialize on those types, leading to dynamic dispatch and/or type instability. However, since the tuple element types are known at compile time, if we unroll the loop, suddenly the compiler can specialize each iteration, leading to much more efficient code (in certain circumstances).

I’ve noticed some code in the Julia stdlib that would benefit from this (it can be fairly slow, for several reasons, but this is one). I’m sure there’s loads more. My own code would also (and does, in places) benefit from this. However, there are two problems. Firstly, I’m using an external package for the unroll macro, so it can’t be used in the stdlib. Secondly, the macro requires hard coded loop ranges (notice I cheated, and set the loop range to 1:5, not 1:length(args), knowing there were 5 inputs). I think I see a way round the latter with @generated functions (not ideal, but likely unavoidable), but what can be done about the former?

I’m aware of the various ntuple macros and functions. However, I’ve found them to be neither flexible enough nor always performant, and found loop unrolling to be both.

Why isn’t it easier to unroll loops over tuples, in particular in stdlib code? Isn’t this something we should be doing? It seems to me the whole value of tuples is that their lengths and (usually) element types are known at compile time. Why not exploit this?

7 Likes

I would unroll through defining recursive functions.

1 Like

Excellent! You’ve found something hidden optimization! Be sure to benchmark it and make a PR. Julia could always use more keen eyes like yours.

1 Like

From what I understand, it risks very large code sizes; after all it is pasting the same code over and over, even if SIMD helps do it in chunks. Loops over tuples up to length 31 are conventionally unrolled with inlined recursive function calls on the Base.tail subtuple, and beyond that you run into type-unstable allocating code. Like I said in a recent related thread, it would actually be cool if we could fill a tuple in a loop that isn’t unrolled (each iteration is like incomplete initialization, but it is complete over the whole loop); maybe this can be done in a special macro or with compiler optimizations over Accessors.jl.

That’s a solution that I’ve used in some places. However, it comes with function call overheads (if the code cannot be inlined), and also challenges if the code updates a lot of variables in the loop. If you have to encapsulate all the work in a function, you run into the same lack of flexibility as with the ntuple tools.

Yes. But the unrolling solution can always unroll no more than N loops, if desired. Also, the unroller can group matching consecutive types into loops, so if all elements of the tuple are of the same type, there would still be just a single loop.

@inline has worked for me so far, though I can’t rule out that it always does.

I don’t understand, the macro unrolls 1 loop to begin with, and how would the elements sharing a type make a difference?

When I said the unroller can group matching consecutive types into loops, I meant this. Imagine a magic @unrolltuple macro that does the following:

function myfun(args...)
    total = 0
    @tupleunroll for i = 1:length(args)
        total += args[i]
    end
    return total
end

called with args = (1, 1, 1.0, 1.0) generates

function myfun(args)
    total = 0
    for i = 1:2
        total += args[i]::Int64
    end
    for i = 3:4
        total += args[i]::Float64
    end
    return total
end

You can’t do that with a macro, it has no idea what the type of args is. Maybe a generated function can pull it off. Even then, wouldn’t that defeat the purpose? The expression args[i] is type-unstable for a heterogenous args, the unrolling efficiency comes from having a fixed number of type-stable args[1], args[2], etc.

1 Like

I think that there are likely ways round these problems. I’ve already mentioned the used of generated functions. To fix the type instability you raised, the code generator could simply annotate the type in each loop.

CompTime.jl is a pretty nice way to do this in a generated function but with more ‘regular’ syntax and less explicit metaprogramming.

using CompTime
myfun(args...) = _myfun(args)

@ct_enable function _myfun(args::NTuple{N}) where {N}
    total = 0
    @ct_ctrl for i ∈ 1:N
        total += args[@ct i]
    end
    total
end

here we see everything has been unrolled:

julia> @code_typed myfun(1, 2, 3)
CodeInfo(
1 ─ %1 = Base.getfield(args, 1, true)::Int64
│   %2 = Base.add_int(0, %1)::Int64
│   %3 = Base.getfield(args, 2, true)::Int64
│   %4 = Base.add_int(%2, %3)::Int64
│   %5 = Base.getfield(args, 3, true)::Int64
│   %6 = Base.add_int(%4, %5)::Int64
└──      return %6
) => Int64

2 Likes

Thanks. This is nice. Wish there was something like this in Base.

1 Like

@user664303, would you mind elaborating or providing examples on places where Base.Cartesian has not given you good performance? My go-to for “I’m not asking you, I’m telling you”-style unrolling would be a function like

myfun(args...) = _myfun(args)

@generated function _myfun(args::NTuple{N}) where{N}
  quote
    total = 0
    Base.Cartesian.@nexprs $N j->begin
      total += args[j]
    end
    total
  end
end

which seems to generate the same code as @Mason’s example. I would be very interested to see examples of when this kind of strategy leads to sub-optimal code so that I can avoid those patterns.

1 Like

If it is type stable, it should be able to inline.
If it isn’t type stable, I wouldn’t worry about function call overhead. To the contrary, the function barriers could help runtime performance.

The rest of your points (that I didn’t quote) are valid. It can be cumbersome. But I do think it’s generally preferable to avoid metaprogramming.

2 Likes

This is a nice example. Thanks.

I wasn’t actually referring to the other tools in Base.Cartesian (like the one you gave). I was just referring to the ntuple function and the ntuple macro - but now I think about it, it was the use of the ntuple function in both normal and generated functions. I can’t provide the code that was slow, because it was some time ago that I came across it, and I found a workaround, so no longer have the slow code to hand. I’m sure the problem could have been fixed using the ::NTuple{N} trick to get the length of the tuple as a compile time constant. I assume using length(mytuple) doesn’t work as well as getting an explicit compile-time length N, even in a generated function.

1 Like

Since no-one has mentioned it so far, I just want to add that reduce and other folding operations unroll the implicit loop over a tuple:

julia> @code_typed reduce(+, args)
CodeInfo(
1 ─ %1  = Core.getfield(itr, 1)::UInt8
│   %2  = Core.getfield(itr, 2)::Int16
│   %3  = Core.getfield(itr, 3)::UInt32
│   %4  = Core.getfield(itr, 4)::Float32
│   %5  = Core.getfield(itr, 5)::Float64
│   %6  = Core.zext_int(Core.Int16, %1)::Int16
│   %7  = Base.add_int(%6, %2)::Int16
│   %8  = Base.sext_int(UInt32, %7)::UInt32
│   %9  = Base.add_int(%8, %3)::UInt32
│   %10 = Base.uitofp(Float32, %9)::Float32
│   %11 = Base.add_float(%10, %4)::Float32
│   %12 = Base.fpext(Base.Float64, %11)::Float64
│   %13 = Base.add_float(%12, %5)::Float64
└──       return %13
) => Float64
5 Likes