Non-allocating loop over a set of structs

I have a set of structs and I want to loop over them. The following code shows two possible loops together with a sequential one by one access to structs:

using BenchmarkTools

struct Sample{T<:AbstractFloat, F<:Function}
    a :: T
    foo :: F
end

function main()
    foo1(x) = x^1
    foo2(x) = x^2
    foo3(x) = x^3
    s1 = Sample(1., foo1)
    s2 = Sample(2., foo2)
    s3 = Sample(3., foo3)

    samples = (s1, s2, s3)

    @btime begin
        b = 0.
        for i=1:length($samples)
            b = b + $samples[i].a * $samples[i].foo(2.)
        end
    end

    @btime begin
        b = 0.
        for s in $samples
            b = b + s.a * s.foo(2.)
        end
    end

    @btime begin
        b = 0.
        b = b + $samples[1].a * $samples[1].foo(2.)
        b = b + $samples[2].a * $samples[2].foo(2.)
        b = b + $samples[3].a * $samples[3].foo(2.)
    end
end

main()

The result is

  310.543 ns (24 allocations: 480 bytes)
  258.344 ns (17 allocations: 304 bytes)
  0.017 ns (0 allocations: 0 bytes)

As you can see both loops result in memory allocations (though I surprised why the number of allocations is different). In turn, the sequential calls to structs give no allocations. Can you, please, tell me how to avoid the allocations in the these kind of loops. Thank you.

Something to note is that if you change the structure to:

struct Sample
    a :: Float64
    foo :: Function
end

The memory allocations change to (using version 1.1.1 and @time instead of @btime.):

  0.000001 seconds (12 allocations: 192 bytes)
  0.000001 seconds (12 allocations: 192 bytes)
  0.000000 seconds

And if I use:

struct Sample{F<:Function}
    a :: Float64
    foo :: F
end

The timings become:

  0.000002 seconds (24 allocations: 480 bytes)
  0.000002 seconds (16 allocations: 288 bytes)
  0.000000 seconds

So there appears to be some magic around the Parametric Function type…

I don’t know what’s going on here, but timings like 0.017 ns (0 allocations: 0 bytes), i.e. fractions of a nano second, typically indicate that something’s wrong with the benchmark.

The way you wrote Sample, s1, s2 and s3 will all have different types.

It is not possible to sensibly store more than a handful of these in a collection and retain detailed type information (that is needed for performant code). Hence, you cannot quickly loop over type-heterogeneous collections of samples. This is a fundamental property of julia (but depending on your application, “slow” loops may be fast enough).

However, if you deal with known length tuples you can unroll the loop, i.e.:

b = b + s1.a + s1.foo(2.0)
b = b + s2.a + s2.foo(2.0)
b = b + s3.a + s3.foo(2.0)

Julia is sometimes able to optimize code like the one you wrote into the above, but I recommend against relying on it.

1 Like

The way you wrote Sample , s1 , s2 and s3 will all have different types.

Does this happens because of the Function field? Can I do something with it, like specify the type of the function’s output?

However, if you deal with known length tuples you can unroll the loop

The number of elements changes from run to run. Can I do the loop unrolling using some kind of metaprogramming magic?

I still do not understand why the result of the loop is so different from the result of unrolling. I thought that under the hood Julia does exactly the same.

Why did you think a loop and unrolled code would be exactly the same? Unrolling something like for i in 1:1000000 is clearly not a good idea. A loop, by its definition, is executing the same code multiple times. Since samples contains heterogeneous type, “the same code” needs to be something generic that can handle things with different types.

Of course, a loop can be unrolled but it is not obvious for code like this when doing so is beneficial.

1 Like

Well, makes sense. However, in my case the number of loop cycles will be small (less than ten). Can I manually push the compiler to force it unrolling the loop?

Yes, you can look at https://github.com/cstjean/Unrolled.jl.

This sounds premature to me. Looping 10 times I’m not sure the allocations are going to break the bank. I feel it would be better to write the code then determine where the most time is being spent…

This is the bottleneck so far. These small loops appear inside an external loop and are executed many times. Moreover, the whole code is launched on GPU.

Recursion can often be used as a way of “unrolling” loops over small Tuples.

f() = 0
f(s, ss...) = s.a * s.foo(2.) + f(ss...)
@btime f(samples...) # 1 ns, 0 allocations
1 Like

Thank you @kristoffer.carlsson and @Per for your solutions:

using BenchmarkTools
using Unrolled

struct Sample{T<:AbstractFloat, F<:Function}
    a :: T
    foo :: F
end

@unroll function do_sum(samples)
    b = 0.
    @unroll for s in samples
        b = b + s.a * s.foo(2.)
    end
    return b
end

frec() = 0.
frec(s, ss...) = s.a * s.foo(2.) + frec(ss...)

function main()
    foo1(x) = x^1
    foo2(x) = x^2
    foo3(x) = x^3
    s1 = Sample(1., foo1)
    s2 = Sample(2., foo2)
    s3 = Sample(3., foo3)

    samples = (s1, s2, s3)

    @btime begin
        b = 0.
        for s in $samples
            b = b + s.a * s.foo(2.)
        end
    end

    @btime $do_sum($samples)

    @btime $frec($samples...)
end

main()
  259.890 ns (17 allocations: 304 bytes)
  0.016 ns (0 allocations: 0 bytes)
  1.522 ns (0 allocations: 0 bytes)

Though I still have to check how these approaches work inside CUDA kernels. I guess there will be issues with the recursion solution (though it is really elegant and I like it).

It is worth noting that the reason why both the @unrolled and the recursive functions are fast is that the types of the three Samples are stored in the Tuple type. If any of the .foo fields would change, then the Tuple type would change and the entire do_sum and frec functions would need to be recompiled.

3 Likes

It looks like in this case the number of objects is small, so a tuple-based approach is probably best. Just wanted to note that in the case that the number of objects is large, but the number of different types is a lot smaller, you can use a tuple of concretely typed Vectors, unroll the loop over the different vectors, but then iterate over each of the vectors as normal. This approach is automated in TypeSortedCollections.jl (I’m the main author). See also Looping over different types with common behavior.

3 Likes