Manually unroll operations with objects of tuple

I am trying to battle, overcome regressions when moving to Julia 0.7. Let’s say I have the function

function next_collision(
    p::AbstractParticle{T}, bt::Tuple)::Tuple{T,Int} where {T}
    tmin::T = T(Inf)
    ind::Int = 0
    for i in eachindex(bt)
        tcol::T = collisiontime(p, bt[i])
        # Set minimum time:
        if tcol < tmin
            tmin = tcol
            ind = i
        end
    end#obstacle loop
    return tmin, ind
end

I want to make it faster because doing bt[i] isn’t very good. That is because the elements of the tuple bt are not the same type. They are all subtypes of the same abstract type, but they are not the same concrete type.

I thought I could use metaprogramming to “unroll” the loop, something similar with the package Unrolled.jl.

But I think I don’t know how to get the length of a tuple by its type?.. This is what I have:

@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}

    L = length(TUP)

    quote
        i::Int = 0; ind::Int = 0
        tmin::T = T(Inf)
        for i in 1:L
            let x = bt[i]
                tcol::T = collisiontime(p, x)
                # Set minimum time:
                if tcol < tmin
                    tmin = tcol
                    ind = i
                end
            end
        end
    end
end

bt is a Tuple. But length(TUP) doesn’t work. But the actual length of a tuple is know by its type, right?

EDIT: The answer of how to do this, by slightly modifying an answer by @mauro3, is:

@generated function next_collision(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
    L = fieldcount(TUP)

    out = quote
        i = 0; ind = 0
        tmin = T(Inf)
    end

    for j=1:L
        push!(out.args,
              quote
                  let x = bt[$j]
                tcol = collisiontime(p, x)
                # Set minimum time:
                if tcol < tmin
                  tmin = tcol
                  ind = $j
                end
              end
          end
              )
    end
    push!(out.args, :(return tmin, ind))
    return out
end

The result is non allocating and almost perfectly benchmarks equal to the sum of doing the collisiontime individually for each entry in the tuple.

2 Likes

This works:

julia> length((typeof((4,5,5.0))).parameters)                                                                                                                            
3                                                                                                                                                                        

Not sure that it is the best way.

1 Like

Thanks mauro, I fixed it, in the sense that I made the function run:

@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}

    quote
        L = length(TUP.parameters)
        i::Int = 0; ind::Int = 0
        tmin::T = T(Inf)
        for i in 1:L
            let x = bt[i]
                tcol::T = collisiontime(p, x)
                # Set minimum time:
                if tcol < tmin
                    tmin = tcol
                    ind = i
                end
            end
        end
        tmin, ind
    end
end

THe problem is that it is just as slow as the function I have posted originally, without any metaprogramming:

julia> @btime next_collision3($p, $bt.obstacles)
  200.746 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)

julia> @btime next_collision($p, $bt.obstacles)
  199.110 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)

Can’t you use a StaticArray instead of the TUP? My understanding is that they do those optimizations for you.

The objects of the tuple are not of the same type, which is why I use a tuple instead of an array. Maybe I have misunderstood something?

EDIT: To clarify more, all objects of this tuple are subtypes of an abstract type (Obstacle) but they are not the same concrete type.

In julia 0.6 using the Unrolled.jl package, the same function takes 40 ns with 0 allocations, using the unrolled_map function. Posting this just for comparison purposes.

1 Like

Ok, now looking at your code more closely, I don’t see any unrolling done: you still have a loop in your quote-block. Indeed looping over a tuple with heterogeneous types will be type unstable.

Something along these lines might work:

@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
    L = length(TUP.parameters)

    out = quote
        i = 0; ind = 0
        tmin = T(Inf)
    end

    for i=1:L
        push!(out.args,
              quote
                tcol = collisiontime(p, x)
                # Set minimum time:
                if tcol < tmin
                  tmin = tcol
                  ind = i
                end
              end
              )
    end
    push!(out.args, :(return tmin, ind))
    return out
end
1 Like

Thanks a lot mauro, I appreciate the help. Your code gives error of x being undefined.

I tried to fix it, by changing the second quote to

               quote
                let x = bt[i]
                    tcol = collisiontime(p, x)
                    # Set minimum time:
                    if tcol < tmin
                      tmin = tcol
                      ind = i
                    end
                end

instead. But then something unexpected happen. When I run the function bt is accessed with index 0. I am guessing that this comes from the first quote. So I changed the first quote to initialze i = 1 instead of 0. Now the function runs and does produce the correct numbers. However, your version is dramatically slower than anything else:

julia> @btime next_collision3($p, $bt.obstacles)
  880.000 ns (10 allocations: 640 bytes)
(0.39757685533455417, 1)

could it be because I was wrong on how I fixed the undefvar x error?

julia> nfields(typeof((42, "bar", :foo)))
3
┌ Warning: `nfields(::DataType)` is deprecated, use `fieldcount` instead

but thanks for letting me know!

Sorry, I missed the fact that you are using v0.7.

I have “fixed” the regression on @mauro3 's suggestion using interpolation:

@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
    L = length(TUP.parameters)

    out = quote
        i = 0; ind = 0
        tmin = T(Inf)
    end

    for j=1:L
        push!(out.args,
              quote
                x = bt[$j]
                tcol = collisiontime(p, x)
                # Set minimum time:
                if tcol < tmin
                  tmin = tcol
                  ind = $j
                end
              end
              )
    end
    push!(out.args, :(return tmin, ind))
    return out
end

However, this function still benchmarks just as good as the one in my third post, i.e.:

@generated function next_collision3(p::AbstractParticle{T}, bt::TUP) where {T, TUP}

    quote
        L = length(TUP.parameters)
        i::Int = 0; ind::Int = 0
        tmin::T = T(Inf)
        for i in 1:L
            let x = bt[i]
                tcol::T = collisiontime(p, x)
                # Set minimum time:
                if tcol < tmin
                    tmin = tcol
                    ind = i
                end
            end
        end
        tmin, ind
    end
end

EDIT: For clarity, doing let x = bt[$j] in the first function makes no difference.

Not sure what’s going wrong then. Are there still allocations? Is the output type of collisiontime type-stable under all types contained in TUP and is it equal to T? How long are your tuples (at some point it will be better to loop)?

For debugging, I would hand-translate the quoted code for an example of a small tuple and see whether this performs as expected and has the right @code_warntype output.

Also, another thing you can look at is how Unrolled.jl does it by expanding their macro-call.

Last you could try lispy recursion instead of a generated function, e.g.: 0.6: how do you broadcast `call` over a vector of functions? - #12 by andyferris

For more help from my side I need a MWE.

1 Like

@mauro3

Yeap, collisiontime is type-stable and always gives T result for all elements in TUP. The tuples are typically less than 10 elements. There are always 2 allocations, even though the collisiontime function is non-allocating.

I’ve made a MWE, but I really cannot understand what’s going on!

abstract type S{T} end
struct A{T} <: S{T}
    a::T
end
struct B{T} <: S{T}
    a::T
end
struct C{T} <: S{T}
    a::T
end

function collisiontime(s::S{T})::T where {T}
    sin(s.a) + 0.1s.a
end

bt = (A(0.1), A(0.2), A(0.3), B(0.4), C(0.5))

@generated function nc(t::T, bt::TUP) where {T, TUP}
    L = fieldcount(TUP)

    out = quote
        i = 0; ind = 0
        tmin = T(Inf)
    end

    for j=1:L
        push!(out.args,
              quote
                  let x = bt[$j]
                tcol = collisiontime(x)
                # Set minimum time:
                if tcol < tmin
                  tmin = tcol
                  ind = $j
                end
              end
          end
              )
    end
    push!(out.args, :(return tmin, ind))
    return out
end

julia> @btime nc(0.1, $bt)
  32.426 ns (0 allocations: 0 bytes)
(0.10983341664682816, 1)

function nc_normal(t::T, bt::TUP) where {T, TUP}
    tmin::T = T(Inf)
    ind::Int = 0
    for i in eachindex(bt)
        tcol::T = collisiontime(bt[i])
        # Set minimum time:
        if tcol < tmin
            tmin = tcol
            ind = i
        end
    end#obstacle loop
    return tmin, ind
end

julia> @btime nc_normal(0.1, $bt)
  206.136 ns (10 allocations: 320 bytes)
(0.10983341664682816, 1)

Here the metaprogramming approach seems to work perfectly okay!!! Not only that, but the “wrong” version does 2 allocations per element of the tuple, instead of 2 allocations in general…

I don’t know where to look in my actual code to see where the problem is and the metaprogramming approach doesn’t work.

I followed your suggestion, and wrote down explicit loop:


function nc(p::AbstractParticle{T}, bt::TUP) where {T, TUP}
    i = 0; ind = 0
    tmin = T(Inf)
    let x = bt[1]
        tcol = collisiontime(p, x)
        # Set minimum time:
        if tcol < tmin
          tmin = tcol
          ind = 1
        end
    end
    let x = bt[2]
        tcol = collisiontime(p, x)
        # Set minimum time:
        if tcol < tmin
          tmin = tcol
          ind = 2
        end
    end
    let x = bt[3]
        tcol = collisiontime(p, x)
        # Set minimum time:
        if tcol < tmin
          tmin = tcol
          ind = 3
        end
    end
    let x = bt[4]
        tcol = collisiontime(p, x)
        # Set minimum time:
        if tcol < tmin
          tmin = tcol
          ind = 4
        end
    end
    let x = bt[5]
        tcol = collisiontime(p, x)
        # Set minimum time:
        if tcol < tmin
          tmin = tcol
          ind = 5
        end
    end
    return tmin, ind
end

Unfortunately calling this function with real structs that I use in real code, returns

julia> @btime nc($p, $bt.obstacles)
  193.173 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)

The biggest problem is the fact that the @code_warntype is perfectly okay for nc:frowning: There is not a single any/whatever and it deduces that it has to return a tuple (float, int)… I don’t know where to look, so any tips will be appreciated!

Could collisiontime allocate for some inputs but without showing up in code-warntype as you specify the return type (conversion)? Can you write it like:

function collisiontime(s::S{T}) where {T}
    ...
end

Ok, I got a example which shows the same behavior you see. Replace in your MWE:

function collisiontime(s::S{T})::T where {T}
    rand()>0.5 ? 1 : rand()>0.5 ? BigInt(1) : sin(s.a) + 0.1s.a # make it type-unstable
end

note that @code_warntype nc(1.0, bt) is clean but

julia> @btime nc(0.1, $bt)                                                                                                                                               
  140.382 ns (2 allocations: 45 bytes)                                                                                                                                   
(0.21866933079506123, 2)                                                                                                                                                 

If you drop the return type annotation of collisiontime, code-warntype will show reds.

Thanks for the effort mauro, but this is definitely not the root of the problem for me. Even though I changed all my function definitions to be just like you posted, i.e. guaranteed to return stable result like e.g.:

function collisiontime(p::Particle{T}, w::Wall{T}) where {T}
    n = normalvec(w, p.pos)
    denom = dot(p.vel, n)
    denom >= 0.0 ? T(Inf) : dot(w.sp-p.pos, n)/denom
end

I have identical performance, even for the function I wrote where I explicitly write out 5 steps of this loop. I still get

julia> @btime DynamicalBilliards.nc($p, $bt.obstacles)
  201.187 ns (2 allocations: 128 bytes)
(0.39757685533455417, 1)

Once again, notice that the collisiontime funciton itself takes 3 ns for any element in this tuple, so I highly highly doubt that the problem is coming from this function.

Yet, the example I posted with the dummy types S, A, B, C works perfectly fine. I don’t get it!!! My types are not that complex, they have 3 fields instead of one, but other than that not much is changing…

Can you post a MWE that actually fails.

Thank all very much for the help. I tried to create a MWE example, reducing my code more and more until the problem dissapeared. I think I know the problem now…:

julia> s = rand(SVector{2})
2-element SArray{Tuple{2},Float64,1,2}:
 0.21282361366456848
 0.19330729027032745

julia> @btime normalize($s)
  156.524 ns (2 allocations: 128 bytes)
2-element Array{Float64,1}:
 0.7402320962786952
 0.6723514286731668

I was using normalize in one of my functions. In Julia 0.6 that was always fast but here… seems not to be the case. I’ll open an issue at staticarrays.

3 Likes

Great, thanks a lot all for the support, I am sorry for being so dumb, I shouldn’t have assumed anything to be working since we are switching to 0.7.

So defining my own

function normalize(s::SVector{2})
    n = √(s[1]*s[1] + s[2]*s[2])
    return SVector{2}(s[1]/n, s[2]/n)
end

makes the metaprograming function I posted in post 12 non-allocating and super fast!!!:

  23.466 ns (0 allocations: 0 bytes)

I have edited my original question and now also show the answer to it!