# Manually unroll operations with objects of tuple

I am trying to battle, overcome regressions when moving to Julia 0.7. Let’s say I have the function

``````function next_collision(
p::AbstractParticle{T}, bt::Tuple)::Tuple{T,Int} where {T}
tmin::T = T(Inf)
ind::Int = 0
for i in eachindex(bt)
tcol::T = collisiontime(p, bt[i])
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end#obstacle loop
return tmin, ind
end
``````

I want to make it faster because doing `bt[i]` isn’t very good. That is because the elements of the tuple `bt` are not the same type. They are all subtypes of the same abstract type, but they are not the same concrete type.

I thought I could use metaprogramming to “unroll” the loop, something similar with the package Unrolled.jl.

But I think I don’t know how to get the length of a tuple by its type?.. This is what I have:

``````@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}

L = length(TUP)

quote
i::Int = 0; ind::Int = 0
tmin::T = T(Inf)
for i in 1:L
let x = bt[i]
tcol::T = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end
end
end
end
``````

`bt` is a Tuple. But `length(TUP)` doesn’t work. But the actual length of a tuple is know by its type, right?

EDIT: The answer of how to do this, by slightly modifying an answer by @mauro3, is:

``````@generated function next_collision(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
L = fieldcount(TUP)

out = quote
i = 0; ind = 0
tmin = T(Inf)
end

for j=1:L
push!(out.args,
quote
let x = bt[\$j]
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = \$j
end
end
end
)
end
push!(out.args, :(return tmin, ind))
return out
end
``````

The result is non allocating and almost perfectly benchmarks equal to the sum of doing the `collisiontime` individually for each entry in the tuple.

2 Likes

This works:

``````julia> length((typeof((4,5,5.0))).parameters)
3
``````

Not sure that it is the best way.

1 Like

Thanks mauro, I fixed it, in the sense that I made the function run:

``````@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}

quote
L = length(TUP.parameters)
i::Int = 0; ind::Int = 0
tmin::T = T(Inf)
for i in 1:L
let x = bt[i]
tcol::T = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end
end
tmin, ind
end
end

``````

THe problem is that it is just as slow as the function I have posted originally, without any metaprogramming:

``````julia> @btime next_collision3(\$p, \$bt.obstacles)
200.746 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)

julia> @btime next_collision(\$p, \$bt.obstacles)
199.110 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)

``````

Can’t you use a StaticArray instead of the TUP? My understanding is that they do those optimizations for you.

The objects of the tuple are not of the same type, which is why I use a tuple instead of an array. Maybe I have misunderstood something?

EDIT: To clarify more, all objects of this tuple are subtypes of an abstract type (`Obstacle`) but they are not the same concrete type.

In julia 0.6 using the Unrolled.jl package, the same function takes 40 ns with 0 allocations, using the `unrolled_map` function. Posting this just for comparison purposes.

1 Like

Ok, now looking at your code more closely, I don’t see any unrolling done: you still have a loop in your quote-block. Indeed looping over a tuple with heterogeneous types will be type unstable.

Something along these lines might work:

``````@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
L = length(TUP.parameters)

out = quote
i = 0; ind = 0
tmin = T(Inf)
end

for i=1:L
push!(out.args,
quote
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end
)
end
push!(out.args, :(return tmin, ind))
return out
end
``````
1 Like

Thanks a lot mauro, I appreciate the help. Your code gives error of `x` being undefined.

I tried to fix it, by changing the second quote to

``````               quote
let x = bt[i]
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end
``````

instead. But then something unexpected happen. When I run the function `bt` is accessed with index `0`. I am guessing that this comes from the first quote. So I changed the first quote to initialze `i = 1` instead of `0`. Now the function runs and does produce the correct numbers. However, your version is dramatically slower than anything else:

``````julia> @btime next_collision3(\$p, \$bt.obstacles)
880.000 ns (10 allocations: 640 bytes)
(0.39757685533455417, 1)
``````

could it be because I was wrong on how I fixed the undefvar `x` error?

``````julia> nfields(typeof((42, "bar", :foo)))
3
``````
``````┌ Warning: `nfields(::DataType)` is deprecated, use `fieldcount` instead
``````

but thanks for letting me know!

Sorry, I missed the fact that you are using v0.7.

I have “fixed” the regression on @mauro3 's suggestion using interpolation:

``````@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
L = length(TUP.parameters)

out = quote
i = 0; ind = 0
tmin = T(Inf)
end

for j=1:L
push!(out.args,
quote
x = bt[\$j]
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = \$j
end
end
)
end
push!(out.args, :(return tmin, ind))
return out
end
``````

However, this function still benchmarks just as good as the one in my third post, i.e.:

``````@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}

quote
L = length(TUP.parameters)
i::Int = 0; ind::Int = 0
tmin::T = T(Inf)
for i in 1:L
let x = bt[i]
tcol::T = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end
end
tmin, ind
end
end
``````

EDIT: For clarity, doing `let x = bt[\$j]` in the first function makes no difference.

Not sure what’s going wrong then. Are there still allocations? Is the output type of `collisiontime` type-stable under all types contained in TUP and is it equal to `T`? How long are your tuples (at some point it will be better to loop)?

For debugging, I would hand-translate the quoted code for an example of a small tuple and see whether this performs as expected and has the right `@code_warntype` output.

Also, another thing you can look at is how Unrolled.jl does it by expanding their macro-call.

Last you could try lispy recursion instead of a generated function, e.g.: 0.6: how do you broadcast `call` over a vector of functions?

For more help from my side I need a MWE.

1 Like

@mauro3

Yeap, `collisiontime` is type-stable and always gives `T` result for all elements in `TUP`. The tuples are typically less than 10 elements. There are always 2 allocations, even though the `collisiontime` function is non-allocating.

I’ve made a MWE, but I really cannot understand what’s going on!

``````abstract type S{T} end
struct A{T} <: S{T}
a::T
end
struct B{T} <: S{T}
a::T
end
struct C{T} <: S{T}
a::T
end

function collisiontime(s::S{T})::T where {T}
sin(s.a) + 0.1s.a
end

bt = (A(0.1), A(0.2), A(0.3), B(0.4), C(0.5))

@generated function nc(t::T, bt::TUP) where {T, TUP}
L = fieldcount(TUP)

out = quote
i = 0; ind = 0
tmin = T(Inf)
end

for j=1:L
push!(out.args,
quote
let x = bt[\$j]
tcol = collisiontime(x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = \$j
end
end
end
)
end
push!(out.args, :(return tmin, ind))
return out
end

julia> @btime nc(0.1, \$bt)
32.426 ns (0 allocations: 0 bytes)
(0.10983341664682816, 1)

function nc_normal(t::T, bt::TUP) where {T, TUP}
tmin::T = T(Inf)
ind::Int = 0
for i in eachindex(bt)
tcol::T = collisiontime(bt[i])
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = i
end
end#obstacle loop
return tmin, ind
end

julia> @btime nc_normal(0.1, \$bt)
206.136 ns (10 allocations: 320 bytes)
(0.10983341664682816, 1)
``````

Here the metaprogramming approach seems to work perfectly okay!!! Not only that, but the “wrong” version does 2 allocations per element of the tuple, instead of 2 allocations in general…

I don’t know where to look in my actual code to see where the problem is and the metaprogramming approach doesn’t work.

I followed your suggestion, and wrote down explicit loop:

``````
function nc(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
i = 0; ind = 0
tmin = T(Inf)
let x = bt
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = 1
end
end
let x = bt
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = 2
end
end
let x = bt
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = 3
end
end
let x = bt
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = 4
end
end
let x = bt
tcol = collisiontime(p, x)
# Set minimum time:
if tcol < tmin
tmin = tcol
ind = 5
end
end
return tmin, ind
end
``````

Unfortunately calling this function with real structs that I use in real code, returns

``````julia> @btime nc(\$p, \$bt.obstacles)
193.173 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)
``````

The biggest problem is the fact that the `@code_warntype` is perfectly okay for `nc` There is not a single `any`/whatever and it deduces that it has to return a tuple (float, int)… I don’t know where to look, so any tips will be appreciated!

Could `collisiontime` allocate for some inputs but without showing up in code-warntype as you specify the return type (conversion)? Can you write it like:

``````function collisiontime(s::S{T}) where {T}
...
end
``````

Ok, I got a example which shows the same behavior you see. Replace in your MWE:

``````function collisiontime(s::S{T})::T where {T}
rand()>0.5 ? 1 : rand()>0.5 ? BigInt(1) : sin(s.a) + 0.1s.a # make it type-unstable
end
``````

note that `@code_warntype nc(1.0, bt)` is clean but

``````julia> @btime nc(0.1, \$bt)
140.382 ns (2 allocations: 45 bytes)
(0.21866933079506123, 2)
``````

If you drop the return type annotation of `collisiontime`, code-warntype will show reds.

Thanks for the effort mauro, but this is definitely not the root of the problem for me. Even though I changed all my function definitions to be just like you posted, i.e. guaranteed to return stable result like e.g.:

``````function collisiontime(p::Particle{T}, w::Wall{T}) where {T}
n = normalvec(w, p.pos)
denom = dot(p.vel, n)
denom >= 0.0 ? T(Inf) : dot(w.sp-p.pos, n)/denom
end
``````

I have identical performance, even for the function I wrote where I explicitly write out 5 steps of this loop. I still get

``````julia> @btime DynamicalBilliards.nc(\$p, \$bt.obstacles)
201.187 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)
``````

Once again, notice that the `collisiontime` funciton itself takes 3 ns for any element in this tuple, so I highly highly doubt that the problem is coming from this function.

Yet, the example I posted with the dummy types `S, A, B, C` works perfectly fine. I don’t get it!!! My types are not that complex, they have 3 fields instead of one, but other than that not much is changing…

Can you post a MWE that actually fails.

Thank all very much for the help. I tried to create a MWE example, reducing my code more and more until the problem dissapeared. I think I know the problem now…:

``````julia> s = rand(SVector{2})
2-element SArray{Tuple{2},Float64,1,2}:
0.21282361366456848
0.19330729027032745

julia> @btime normalize(\$s)
156.524 ns (2 allocations: 128 bytes)
2-element Array{Float64,1}:
0.7402320962786952
0.6723514286731668
``````

I was using `normalize` in one of my functions. In Julia 0.6 that was always fast but here… seems not to be the case. I’ll open an issue at staticarrays.

3 Likes

Great, thanks a lot all for the support, I am sorry for being so dumb, I shouldn’t have assumed anything to be working since we are switching to `0.7`.

So defining my own

``````function normalize(s::SVector{2})
n = √(s*s + s*s)
return SVector{2}(s/n, s/n)
end
``````

makes the metaprograming function I posted in post 12 non-allocating and super fast!!!:

``````  23.466 ns (0 allocations: 0 bytes)
``````

I have edited my original question and now also show the answer to it!