When is a fold over a fixed-length tuple unrolled?

matthias314 · December 21, 2021, 2:44am

@code_typed tells me that the statement

foldl(+, (1,2,3); init = 0)

is unrolled into three additions. On the other hand, if I say

struct A x::Int; y::Int; z::Int end

f(a::A) = foldl((1,2,3); init = UInt(0)) do h, i
    hash(getfield(a, i), h)
end

@code_typed f(A(1,2,3))

then I see a call to Base.afoldl (also with @code_llvm and @code_native), so I assume the code is not unrolled into three separate statements. (Also, if I unroll it by hand, then I get a function that runs faster.) My question therefore is: In which cases does such an expansion happen? Does it maybe depend on the size of the functions involved?

N5N3 · December 21, 2021, 8:55am

If you use Cthulhu and decend into _afold you’ll find the kernal is unrolled.
And the benchmark show little difference on my PC:

f(a::Tuple) = foldl(a; init = UInt(0)) do h, i
    hash(i, h)
end
_f(a::Tuple) = hash(a[3], hash(a[2], hash(a[1], UInt(0))))

AA = Ref((1,2,3))
using BenchmarkTools
@btime f($AA[])  # 5.000 ns (0 allocations: 0 bytes)
@btime _f($AA[])  #  5.100 ns (0 allocations: 0 bytes)

matthias314 · December 21, 2021, 1:59pm

Maybe the time difference got lost when I simplified the code. Here is an example where I can see a difference:

struct A x::Int; y::Int; z::Int end

h1(a::A, h::UInt) = hash(a.z, hash(a.y, hash(a.x, hash(:A, h))))

function h2(x::A, h0::UInt)
    foldl(ntuple(identity, fieldcount(typeof(x)));
            init = hash(:A, h0)) do h, i
        hash(getfield(x, i), h)
    end
end

Here are the timings:

a = A(1,2,3)
@btime h1($a, UInt(0))   # 6.869 ns (0 allocations: 0 bytes)
@btime h2($a, UInt(0))   # 7.994 ns (0 allocations: 0 bytes)

Or shouldn’t I use $a for timings?

N5N3 · December 21, 2021, 3:15pm

I can reproduce the Benchmark result on my pc.
But if we use a for loop to bench:

a = A.(rand(1:100,100),rand(1:100,100),rand(1:100,100))
b = h1.(a,UInt(0))
@btime $b .= h1.($a,UInt(0)); #  664.780 ns (0 allocations: 0 bytes)
@btime $b .= h2.($a,UInt(0)); #  615.517 ns (0 allocations: 0 bytes)

Cthulhu shows that the indexed is not propgate in to the kernal as const for h2. Let’s try

function h3(x::A, h0::UInt)
    f(h, ::Val{i}) where {i} = hash(getfield(x, i), h)
    foldl(f, ntuple(Val, fieldcount(A)); init = hash(:A, h0))
end
@btime $b .= h3.($a,UInt(0)); #  579.570 ns (0 allocations: 0 bytes)

a little better.
And the faster version is transforming a::A into a Tuple

function h4(a::A, h0::UInt)
    (;x, y, z) = a
    foldl((x, y, z); init = hash(:A, h0)) do h, i
        hash(i, h)
    end
end
@btime $b .= h4.($a,UInt(0)); #  482.653 ns (0 allocations: 0 bytes)

Not sure where cause the difference.

matthias314 · December 21, 2021, 4:55pm

Thanks a lot!

The downside of this is that you need to hard-code the number of fields. (Or am I wrong?) h2 and h3 work for any type, except that :A is hard-coded. I’m thinking of using the code inside a macro that can be called as @h(A) and produces the function definition. Then :A can be inserted via QuoteNode. (In a regular function like h2, I could use Symbol(typeof(x)), but this is very slow because it first creates a string.)
Of course, one can use generated functions, fancier macro code and other things to get a macro that does what I want, see this GitHUb issue. I was trying to figure out how far one can get with folding without sacrificing speed.

N5N3 · December 22, 2021, 1:17am

Then you might want?

function h5(a::T, h0::UInt) where {T}
    temp = ntuple(i -> getfield(a, i), fieldcount(A))
    foldl(temp; init = hash(T, h0)) do h, i
        hash(i, h)
    end
end
@btime $b .= h5.($a,UInt(0)); #  482.143 ns (0 allocations: 0 bytes)

I replaced Symbol(A) with A, as every typename has a unique objectid.
But the benchmark make things more strange. Why there’s performance difference between h2 and h5

matthias314 · December 22, 2021, 1:35am

This is as fast as the hand-coded function h1. Great!

EDIT: Hashing a parametric type is slower than hashing a symbol (provided that you have the symbol and don’t need to generate it via Symbol).

I agree that it would be good to understand why h2 is slower than h5 or, even better, if it were as fast!

Topic		Replies	Views
Unrolling loops over tuples - why so hard? Performance tuple , unrolling	14	1352	September 10, 2023
Is `Core.Compiler.return_type` foldable? General Usage question	19	143	June 24, 2025
Manually unroll operations with objects of tuple New to Julia question , metaprogramming , unrolling	21	2266	June 13, 2018
Tuples of closures Performance tuple , closure	5	588	June 21, 2019
Advice for Higher Order Functions on Tuples General Usage tuple	6	209	June 12, 2024

When is a fold over a fixed-length tuple unrolled?

Related topics