I have a set of structs and I want to loop over them. The following code shows two possible loops together with a sequential one by one access to structs:
using BenchmarkTools
struct Sample{T<:AbstractFloat, F<:Function}
a :: T
foo :: F
end
function main()
foo1(x) = x^1
foo2(x) = x^2
foo3(x) = x^3
s1 = Sample(1., foo1)
s2 = Sample(2., foo2)
s3 = Sample(3., foo3)
samples = (s1, s2, s3)
@btime begin
b = 0.
for i=1:length($samples)
b = b + $samples[i].a * $samples[i].foo(2.)
end
end
@btime begin
b = 0.
for s in $samples
b = b + s.a * s.foo(2.)
end
end
@btime begin
b = 0.
b = b + $samples[1].a * $samples[1].foo(2.)
b = b + $samples[2].a * $samples[2].foo(2.)
b = b + $samples[3].a * $samples[3].foo(2.)
end
end
main()
As you can see both loops result in memory allocations (though I surprised why the number of allocations is different). In turn, the sequential calls to structs give no allocations. Can you, please, tell me how to avoid the allocations in the these kind of loops. Thank you.
I don’t know what’s going on here, but timings like 0.017 ns (0 allocations: 0 bytes), i.e. fractions of a nano second, typically indicate that something’s wrong with the benchmark.
The way you wrote Sample, s1, s2 and s3 will all have different types.
It is not possible to sensibly store more than a handful of these in a collection and retain detailed type information (that is needed for performant code). Hence, you cannot quickly loop over type-heterogeneous collections of samples. This is a fundamental property of julia (but depending on your application, “slow” loops may be fast enough).
However, if you deal with known length tuples you can unroll the loop, i.e.:
b = b + s1.a + s1.foo(2.0)
b = b + s2.a + s2.foo(2.0)
b = b + s3.a + s3.foo(2.0)
Julia is sometimes able to optimize code like the one you wrote into the above, but I recommend against relying on it.
The way you wrote Sample , s1 , s2 and s3 will all have different types.
Does this happens because of the Function field? Can I do something with it, like specify the type of the function’s output?
However, if you deal with known length tuples you can unroll the loop
The number of elements changes from run to run. Can I do the loop unrolling using some kind of metaprogramming magic?
I still do not understand why the result of the loop is so different from the result of unrolling. I thought that under the hood Julia does exactly the same.
Why did you think a loop and unrolled code would be exactly the same? Unrolling something like for i in 1:1000000 is clearly not a good idea. A loop, by its definition, is executing the same code multiple times. Since samples contains heterogeneous type, “the same code” needs to be something generic that can handle things with different types.
Of course, a loop can be unrolled but it is not obvious for code like this when doing so is beneficial.
Well, makes sense. However, in my case the number of loop cycles will be small (less than ten). Can I manually push the compiler to force it unrolling the loop?
This sounds premature to me. Looping 10 times I’m not sure the allocations are going to break the bank. I feel it would be better to write the code then determine where the most time is being spent…
This is the bottleneck so far. These small loops appear inside an external loop and are executed many times. Moreover, the whole code is launched on GPU.
using BenchmarkTools
using Unrolled
struct Sample{T<:AbstractFloat, F<:Function}
a :: T
foo :: F
end
@unroll function do_sum(samples)
b = 0.
@unroll for s in samples
b = b + s.a * s.foo(2.)
end
return b
end
frec() = 0.
frec(s, ss...) = s.a * s.foo(2.) + frec(ss...)
function main()
foo1(x) = x^1
foo2(x) = x^2
foo3(x) = x^3
s1 = Sample(1., foo1)
s2 = Sample(2., foo2)
s3 = Sample(3., foo3)
samples = (s1, s2, s3)
@btime begin
b = 0.
for s in $samples
b = b + s.a * s.foo(2.)
end
end
@btime $do_sum($samples)
@btime $frec($samples...)
end
main()
Though I still have to check how these approaches work inside CUDA kernels. I guess there will be issues with the recursion solution (though it is really elegant and I like it).
It is worth noting that the reason why both the @unrolled and the recursive functions are fast is that the types of the three Samples are stored in the Tuple type. If any of the .foo fields would change, then the Tuple type would change and the entire do_sum and frec functions would need to be recompiled.
It looks like in this case the number of objects is small, so a tuple-based approach is probably best. Just wanted to note that in the case that the number of objects is large, but the number of different types is a lot smaller, you can use a tuple of concretely typed Vectors, unroll the loop over the different vectors, but then iterate over each of the vectors as normal. This approach is automated in TypeSortedCollections.jl (I’m the main author). See also Looping over different types with common behavior - #5 by tkoolen.