Efficient iteration with fixed size array types without `@generated` functions

brainandforce · September 23, 2024, 10:25pm

My package, CliffordNumbers.jl, has a type BitIndices{Q,C<:AbstractCliffordNumber{Q}} <: AbstractVector{BitIndex{Q}}, whose length is the same as that of C, and that length is known at compile time. Indexing these objects depends on the type parameter C, but these values are also known at compile time.

In principle, I would expect Julia to be able to unroll loops with iteration over an object with this type, and constant propagate each element of the type:

function iteration_test()
    result = BitIndex(Val(VGA(3)))
    for i in BitIndices(CliffordNumber{VGA(3)})    # 8 iterations
        result *= i
    end
    return result
end

However, when I lower this code, I find that loop control flow overhead is introduced anyway:

julia> @code_warntype iteration_test()
MethodInstance for iteration_test()
  from iteration_test() @ Main REPL[3]:1
Arguments
  #self#::Core.Const(iteration_test)
Locals
  @_2::Union{Nothing, Tuple{BitIndex{VGA(3)}, Tuple{Base.OneTo{Int64}, Int64}}}
  result::BitIndex{VGA(3)}
  i::BitIndex{VGA(3)}
Body::BitIndex{VGA(3)}
1 ─ %1  = Main.VGA(3)::Core.Const(VGA(3))
│   %2  = Main.Val(%1)::Core.Const(Val{VGA(3)}())
│         (result = Main.BitIndex(%2))
│   %4  = Main.CliffordNumber::Core.Const(CliffordNumber)
│   %5  = Main.VGA(3)::Core.Const(VGA(3))
│   %6  = Core.apply_type(%4, %5)::Core.Const(CliffordNumber{VGA(3)})
│   %7  = Main.BitIndices(%6)::Core.Const(BitIndex{VGA(3)}[BitIndex(Val(VGA(3))), BitIndex(Val(VGA(3)), 1), BitIndex(Val(VGA(3)), 2), BitIndex(Val(VGA(3)), 1, 2), BitIndex(Val(VGA(3)), 3), BitIndex(Val(VGA(3)), 1, 3), BitIndex(Val(VGA(3)), 2, 3), BitIndex(Val(VGA(3)), 1, 2, 3)])
│         (@_2 = Base.iterate(%7))
│   %9  = (@_2::Core.Const((BitIndex(Val(VGA(3))), (Base.OneTo(8), 1))) === nothing)::Core.Const(false)
│   %10 = Base.not_int(%9)::Core.Const(true)
└──       goto #4 if not %10
2 ┄ %12 = @_2::Tuple{BitIndex{VGA(3)}, Tuple{Base.OneTo{Int64}, Int64}}
│         (i = Core.getfield(%12, 1))
│   %14 = Core.getfield(%12, 2)::Tuple{Base.OneTo{Int64}, Int64}
│         (result = result * i)
│         (@_2 = Base.iterate(%7, %14))
│   %17 = (@_2 === nothing)::Bool
│   %18 = Base.not_int(%17)::Bool
└──       goto #4 if not %18
3 ─       goto #2
4 ┄       return result

This is indicated by @_2::Union{Nothing, Tuple{BitIndex{VGA(3)}, Tuple{Base.OneTo{Int64}, Int64}}}, which appears due to Base.iterate(A::AbstractArray, state=(eachindex(A),)) being invoked at runtime. (The return value of nothing indicates the end of iteration.) For this reason, I have to use @generated functions to manually unroll loops involving iteration over this object, particularly products.

I know that there are packages that allow for the unrolling of loops using macros (Unroll.jl, Unrolled.jl) but if possible, I’d like to optimize all iteration over this type, or any types I construct with similar features (singleton array types with size and indexing known at compile time) without having the make the user do anything special. Are there assumptions baked into the AbstractArray subtyping that prevent this optimization from occurring, or is this even possible at all?

matthias314 · September 23, 2024, 11:04pm

I doubt that it’s possible to unroll the for loop. However, when you convert BitIndices to a tuple, then you can make it unroll:

@inline function Base.foreach(f, b::BitIndices)
    ntuple(i -> (@inline; @inline f(b[i])), Val(length(b)))
    nothing
end

function iteration_test2()
    result::BitIndex{VGA(N)} = BitIndex(Val(VGA(N)))
    foreach(BitIndices(CliffordNumber{VGA(N)})) do i
        result *= i
    end
    return result
end

function iteration_test3()   # with Ref
    result = Ref(BitIndex(Val(VGA(3))))
    foreach(BitIndices(CliffordNumber{VGA(3)})) do i
        result[] *= i
    end
    return result[]
end

Then

julia> @b iteration_test()
31.426 ns

julia> @b iteration_test2()
3.794 ns

julia> @b iteration_test3()
3.430 ns

The user would have to cooperate for this, so I guess it’s not really what you want.

EDIT: modified iteration_test2 to avoid allocation; modified foreach

Benny · September 24, 2024, 12:02am

Type inference has no idea what a number in the type parameters means, so code_warntype is too early to see any unrolling optimizations. Write a println loop over a tuple, the builtin iterable types with fixed lengths, and you’ll see the iterate loop too. It’s code_llvm where you’ll see the unrolling.
Julia typically “unrolls” iteration over tuples with @inlined recursive Base.tail calls over <=31 elements, not loops. That suggests to me that there’s a signficant limitation to unrolling optimizations, which makes sense considering all the performance downsides and sabotage of other optimizations that can outweigh the removal of iteration overhead.
I think I see loop unrolling over a wrapper of Vector with Base.size returning a small (<=17) type parameter value. Check code_llvm to see if that’s happening for you. EDIT: Iterating over eachindex or 1:length(x) is crucial for unrolling.
Unrolled.jl does transform code into generated functions; you don’t know what types symbols will have, let alone if the types specify lengths, right after parsing source code. Unroll.jl doesn’t seem to use generated functions, but it looks like it gets around it by outright evaluating the indices iterable.

Mason · September 24, 2024, 10:46am

Sidenote: @code_warntype is a bad tool for inspecting this since it happens before optimization, so it won’t know about even the loop unrolling that happens on the julia side. I’d recommend @code_typed or Chtulhu.jl’s @descend here.

To answer the main question you asked, here’s a handy macro I sometimes use when I want unrolling but not a generated function:

macro unroll(N::Int, loop)
    Base.isexpr(loop, :for) || error("only works on for loops")
    Base.isexpr(loop.args[1], :(=)) || error("This loop pattern isn't supported")
    val, itr = esc.(loop.args[1].args)
    body = esc(loop.args[2])
    @gensym loopend
    label = :(@label $loopend)
    goto = :(@goto $loopend)
    out = Expr(:block, :(itr = $itr), :(next = iterate(itr)))
    unrolled = map(1:N) do _
        quote
            isnothing(next) && @goto loopend
            $val, state = next
            $body
            next = iterate(itr, state)
        end
    end
    append!(out.args, unrolled)
    remainder = quote
        while !isnothing(next)
            $val, state = next
            $body
            next = iterate(itr, state)
        end
        @label loopend
    end
    push!(out.args, remainder)
    out
end

this will unroll the first N iterations of a loop, potentially leaving the rest rolled up. For things like Tuple or your BitIndices, this should make it much easier for the compiler to understand what’s going on and produce properly optimized, type-stable code.

However, in your case this is not enough to make the compiler optimize away your whole iteration_test function. The other problem is that the compiler thinks that some of the operations you’re doing here are not legal to perform at compile time because it can’t prove that various functions in the call-chain can be done at compile time. In particular, _sort_with_parity(::NTuple) creates an array which is bad, and negative_square_bits does some other stuff that confuses the compiler.

If I recklessly just tell the compiler to assume that (*)(::BitIndex, ::BitIndex) is consistent and terminating,

@eval CliffordNumbers Base.@assume_effects :consistent :terminates_globally function (*)(a::T, b::T) where T<:BitIndex
    T(signbit_of_mult(a,b), xor(UInt(a), UInt(b)))
end

and add a zero-length shortcut for _sort_with_parity(::NTuple):

@eval CliffordNumbers function _sort_with_parity(t::NTuple{0})
     return ((), false)
end

and then add the unrolling to the loop:

function iteration_test2()
    result = BitIndex(Val(VGA(3)))
    # first 32 iterations are unrolled, rest is a loop. 
    # In this case, the compiler will figure out it can discard all the iterations past 8.
    @unroll 32 for i in BitIndices(CliffordNumber{VGA(3)})
        result *= i
    end
    return result
end

then we see the compiler is able to fully understand what’s going on here and do the whole computation at compile time:

julia> @code_typed iteration_test2()
CodeInfo(
1 ─     return $(QuoteNode(-BitIndex(Val(VGA(3)))))
) => BitIndex{VGA(3)}

I wouldn’t reccomend using @assume_effects here the way I did though. Instead, I’d recommend looking through the functions in the callchain of (*)(::BitIndex, ::BitIndex), and figuring out which functions are causing the bad-effects, and then seeing if there’s a way to re-write your code such that the compiler is better able to understand what’s going on. Check out the function Base.infer_effects to view the effects of various functions.

foobar_lv2 · September 24, 2024, 12:25pm

A construction that I sometimes use is

julia> function foo(a, ::Val{N}) where N
       length(a)==N || throw("dim mismatch")
       s=0
       for i=1:length(a)
       s += a[i]
       end
       s
       end

Inspect the generated code: Fully unrolled, all boundschecks replaced by the initial length check.

You use the type parameter to force specialization on N, such that it is known at compile time; and you use the crucial length(a)==N || throw("dim mismatch") check in order to ensure that the compiler then knows length(a) statically as well.

If you’re more greedy you can of course use llvm assume shennenigans in order to suppress the consistency check as well.

matthias314 · September 24, 2024, 9:25pm

This seems to depend of the complexity of the code inside the loop. If one takes the code from the OP (and keeps the Val argument), one gets something like

function iteration_test_val(::Val{N}) where N
    b = BitIndices(CliffordNumber{VGA(3)})
    length(b) == N || throw("dim mismatch")
    result = BitIndex(Val(VGA(3)))
    for j in 1:length(b)
        result *= b[j]
    end
    return result
end

As far as I can tell from looking at @code_llvm, the loop is not unrolled. That the timing for iteration_test_val(Val(8)) is the same as for the original iteration_test() suggests the same.

matthias314 · September 24, 2024, 11:14pm

I think that the function iteration_test2 I defined above is evaluated at compile time:

julia> @code_llvm iteration_test2()
;  @ REPL[4]:1 within `iteration_test2`
define [1 x i64] @julia_iteration_test2_211() #0 {
L1163.1:
;  @ REPL[4]:6 within `iteration_test2`
  ret [1 x i64] [i64 -9223372036854775808]
}

How come that this is possible without the changes to the internals of CliffordNumbers.jl that you are suggesting?

Mason · September 25, 2024, 9:55am

Hm, that’s quite interesting indeed. I had missed that your version actually did optimize away the loop. I’m guessing what’s happening here is that even though Julia’s compiler was unable to optimize everything away (check the @code_typed for your version, it’s still quite large), LLVM itself was still able the figure it out and decided it was okay to evaluate it all at compile time.

Generally, I like to try and get this sort of stuff to happen at the julia-level when possible, but yeah, sometimes LLVM can do stuff that julia doesn’t realize is legal.

brainandforce · September 26, 2024, 12:34am

matthias314:

function iteration_test3()   # with Ref
    result = Ref(BitIndex(Val(VGA(3))))
    foreach(BitIndices(CliffordNumber{VGA(3)})) do i
        result[] *= i
    end
    return result[]
end

Do you know why the use of a Ref improves speed here? (Could it just be a microbenchmark artifact? I’m not sure where @b comes from)

This is the one place in the entire package where an array is explicitly created. While it’s only used in a convenience constructor for the BitIndex type, it makes sense that it can’t be constant folded. I actually made the same optimization in a local branch, so hopefully that can solve problems for this common case. (In the future, I need to get the parity of the permutation of the indices without resorting to array allocations)

Can you elaborate more on this? (I figure the exact details are murky if you don’t know about Clifford algebras, but I assume the code introspection tools told you enough to conclude this)

matthias314 · September 26, 2024, 1:54am

Otherwise Julia would box the variable result because it is modified in the anonymous function used in foreach. The other way I know to avoid boxing is to declare the type of result as in iteration_test2. I wish Julia could figure out on its own that boxing is not necessary here.

@b is from Chairmarks.jl, a replacement of BenchmarkTools.jl.

Topic		Replies	Views
Fast type indexing Performance	6	597	March 10, 2019
Iterator for indexes of true values in BitVector without allocations General Usage	5	188	June 15, 2023
How to implement fast range reads for my custome-defined array? Performance array	2	721	January 20, 2018
How to Sensibly Implement Generic Slice Indexing for Custom Array Type General Usage question , gpu , performance , design	3	325	January 29, 2024
Efficient Repeat: Array of single value New to Julia	3	950	February 17, 2021

Efficient iteration with fixed size array types without `@generated` functions

Related topics