Non-allocating loop over a set of structs

fedoroff · June 25, 2019, 10:17am

I have a set of structs and I want to loop over them. The following code shows two possible loops together with a sequential one by one access to structs:

using BenchmarkTools

struct Sample{T<:AbstractFloat, F<:Function}
    a :: T
    foo :: F
end

function main()
    foo1(x) = x^1
    foo2(x) = x^2
    foo3(x) = x^3
    s1 = Sample(1., foo1)
    s2 = Sample(2., foo2)
    s3 = Sample(3., foo3)

    samples = (s1, s2, s3)

    @btime begin
        b = 0.
        for i=1:length($samples)
            b = b + $samples[i].a * $samples[i].foo(2.)
        end
    end

    @btime begin
        b = 0.
        for s in $samples
            b = b + s.a * s.foo(2.)
        end
    end

    @btime begin
        b = 0.
        b = b + $samples[1].a * $samples[1].foo(2.)
        b = b + $samples[2].a * $samples[2].foo(2.)
        b = b + $samples[3].a * $samples[3].foo(2.)
    end
end

main()

The result is

  310.543 ns (24 allocations: 480 bytes)
  258.344 ns (17 allocations: 304 bytes)
  0.017 ns (0 allocations: 0 bytes)

As you can see both loops result in memory allocations (though I surprised why the number of allocations is different). In turn, the sequential calls to structs give no allocations. Can you, please, tell me how to avoid the allocations in the these kind of loops. Thank you.

pixel27 · June 25, 2019, 11:48am

Something to note is that if you change the structure to:

struct Sample
    a :: Float64
    foo :: Function
end

The memory allocations change to (using version 1.1.1 and @time instead of @btime.):

  0.000001 seconds (12 allocations: 192 bytes)
  0.000001 seconds (12 allocations: 192 bytes)
  0.000000 seconds

And if I use:

struct Sample{F<:Function}
    a :: Float64
    foo :: F
end

The timings become:

  0.000002 seconds (24 allocations: 480 bytes)
  0.000002 seconds (16 allocations: 288 bytes)
  0.000000 seconds

So there appears to be some magic around the Parametric Function type…

carstenbauer · June 25, 2019, 12:03pm

I don’t know what’s going on here, but timings like 0.017 ns (0 allocations: 0 bytes), i.e. fractions of a nano second, typically indicate that something’s wrong with the benchmark.

foobar_lv2 · June 25, 2019, 12:14pm

The way you wrote Sample, s1, s2 and s3 will all have different types.

It is not possible to sensibly store more than a handful of these in a collection and retain detailed type information (that is needed for performant code). Hence, you cannot quickly loop over type-heterogeneous collections of samples. This is a fundamental property of julia (but depending on your application, “slow” loops may be fast enough).

However, if you deal with known length tuples you can unroll the loop, i.e.:

b = b + s1.a + s1.foo(2.0)
b = b + s2.a + s2.foo(2.0)
b = b + s3.a + s3.foo(2.0)

Julia is sometimes able to optimize code like the one you wrote into the above, but I recommend against relying on it.

fedoroff · June 25, 2019, 12:29pm

The way you wrote Sample , s1 , s2 and s3 will all have different types.

Does this happens because of the Function field? Can I do something with it, like specify the type of the function’s output?

However, if you deal with known length tuples you can unroll the loop

The number of elements changes from run to run. Can I do the loop unrolling using some kind of metaprogramming magic?

I still do not understand why the result of the loop is so different from the result of unrolling. I thought that under the hood Julia does exactly the same.

kristoffer.carlsson · June 25, 2019, 12:44pm

Why did you think a loop and unrolled code would be exactly the same? Unrolling something like for i in 1:1000000 is clearly not a good idea. A loop, by its definition, is executing the same code multiple times. Since samples contains heterogeneous type, “the same code” needs to be something generic that can handle things with different types.

Of course, a loop can be unrolled but it is not obvious for code like this when doing so is beneficial.

fedoroff · June 25, 2019, 12:55pm

Well, makes sense. However, in my case the number of loop cycles will be small (less than ten). Can I manually push the compiler to force it unrolling the loop?

kristoffer.carlsson · June 25, 2019, 1:00pm

Yes, you can look at https://github.com/cstjean/Unrolled.jl.

pixel27 · June 25, 2019, 1:01pm

This sounds premature to me. Looping 10 times I’m not sure the allocations are going to break the bank. I feel it would be better to write the code then determine where the most time is being spent…

fedoroff · June 25, 2019, 1:06pm

This is the bottleneck so far. These small loops appear inside an external loop and are executed many times. Moreover, the whole code is launched on GPU.

Per · June 25, 2019, 1:16pm

Recursion can often be used as a way of “unrolling” loops over small Tuples.

f() = 0
f(s, ss...) = s.a * s.foo(2.) + f(ss...)
@btime f(samples...) # 1 ns, 0 allocations

fedoroff · June 25, 2019, 1:30pm

Thank you @kristoffer.carlsson and @Per for your solutions:

using BenchmarkTools
using Unrolled

struct Sample{T<:AbstractFloat, F<:Function}
    a :: T
    foo :: F
end

@unroll function do_sum(samples)
    b = 0.
    @unroll for s in samples
        b = b + s.a * s.foo(2.)
    end
    return b
end

frec() = 0.
frec(s, ss...) = s.a * s.foo(2.) + frec(ss...)

function main()
    foo1(x) = x^1
    foo2(x) = x^2
    foo3(x) = x^3
    s1 = Sample(1., foo1)
    s2 = Sample(2., foo2)
    s3 = Sample(3., foo3)

    samples = (s1, s2, s3)

    @btime begin
        b = 0.
        for s in $samples
            b = b + s.a * s.foo(2.)
        end
    end

    @btime $do_sum($samples)

    @btime $frec($samples...)
end

main()

  259.890 ns (17 allocations: 304 bytes)
  0.016 ns (0 allocations: 0 bytes)
  1.522 ns (0 allocations: 0 bytes)

Though I still have to check how these approaches work inside CUDA kernels. I guess there will be issues with the recursion solution (though it is really elegant and I like it).

Per · June 25, 2019, 1:47pm

It is worth noting that the reason why both the @unrolled and the recursive functions are fast is that the types of the three Samples are stored in the Tuple type. If any of the .foo fields would change, then the Tuple type would change and the entire do_sum and frec functions would need to be recompiled.

tkoolen · June 25, 2019, 3:22pm

It looks like in this case the number of objects is small, so a tuple-based approach is probably best. Just wanted to note that in the case that the number of objects is large, but the number of different types is a lot smaller, you can use a tuple of concretely typed Vectors, unroll the loop over the different vectors, but then iterate over each of the vectors as normal. This approach is automated in TypeSortedCollections.jl (I’m the main author). See also Looping over different types with common behavior - #5 by tkoolen.

Topic		Replies	Views
Efficiently looping over structs in Julia New to Julia	4	689	December 22, 2020
A question about how arrays work, how memory is allocated and what happen when chunks of code inside a function are moved into another function Performance	10	392	May 13, 2022
Memory allocations when performing operation on Tuple of structs Performance tuple , memory-allocation , struct	9	190	October 15, 2024
Why does using structs increase allocations (and calculation time) New to Julia	1	224	March 30, 2023
Iterating structs of different types efficiently General Usage type-stability	15	713	August 4, 2023

Non-allocating loop over a set of structs

Related topics