Recursion base vs hand-made recursion

Trying to solve the problem d17 of this year’s AOC, I found that, among the proposed solutions, the most efficient seems to be the one that uses recursion.
for example, compared to an iterative one that works on the same principle (see below, under the title iterative solution), it is more than 3 times faster.
I then tried to simulate by hand what (I imagine) a recursive function does,
but I can’t get the same performance.
Compared to the initial version that had 15 us against 1.8 us of the recursive version, I obtained a clear improvement (reaching 2.2 us) “simply” declaring the type of a vector: astk.
However, I couldn’t do better.
I would expect that by doing things properly, the “ad hoc recursion” would be more efficient than the general one.

sol with recursion

function fss(sn, seq, n=0)
    n == length(seq) && return sn
    for a in 0:7
        A = (sn << 3) + a
        out = ((A % 8) ⊻ 1 ⊻ A÷(2^((A % 8) ⊻ 6))) % 8
        if out== seq[n+1]
            pA = fss(A, seq, n + 1)
            !isnothing(pA) && return pA
        end
    end
end
Pr=[2,4,1,6,7,5,4,4,1,7,0,3,5,5,3,0]
using BenchmarkTools
@btime fss(0,  reverse($Pr))  # 1.8us
sol iterative

Pr=[2,4,1,6,7,5,4,4,1,7,0,3,5,5,3,0]

function prevs((A,n),rPr)
    res=Tuple{Int,Int}[]
    for a in 0:7
        Aa=(A << 3)+a
        pA = ((Aa % 8) ⊻ 1 ⊻ Aa ÷ (2^((Aa % 8) ⊻ 6))) % 8
            if pA == rPr[n+1]
            push!(res,(Aa,n+1))
        end
    end
    res
end

function rfss(rPr)
    vpn=prevs((0,0),rPr)
    for _ in 1:15
        vpn=vcat(prevs.(vpn,[rPr])...)
    end
    vpn[1]
end

using BenchmarkTools
@btime rfss(reverse(Pr)) # 6us
hand-made recursion solution

ePr=[2,4,1,6,7,5,4,4,1,7,0,3,5,5,3,0]

function prev(rPr,(A,n),astk=Tuple{Tuple{Int,Int},Int}[],s=0)
    n==length(rPr) && return (A,n)
    res=(-1,0) 
    for a in s:7
        Aa=(A << 3)+a
        a8=(Aa % 8)
        pA = (a8 ⊻ 1 ⊻ Aa ÷ (2^(a8 ⊻ 6))) % 8
        if pA == rPr[n+1]
            res = (Aa,n+1)
            push!(astk,((A,n),a+1))
            break
        end
    end
    res, astk
end

function hr(rPr)
    res, astk = prev(rPr,(0,0))
    while res[2] != length(rPr)
        if res[1] == -1 
            res,s = pop!(astk)
            res, astk = prev(rPr,res,astk,s)
        else
            res, astk = prev(rPr,res,astk)
        end
    end
    res
end


using BenchmarkTools
@btime hr(reverse(ePr)) #15us --> 2.2/3.2 us dopo aver esplicitato tipo del vettore astk

How can the performance of the hand-made recursion solution be further improved?

Repeatedly push!ing and pop!ing could lead to unwanted temporary allocations. A possible fix is to use sizehint! on astk if you have any clue about its max possible size.

1 Like

Sorry, but can you add more parens here, not sure what was the intended precedence regarding and ÷. I could check the actual precedence with JuliaSyntax.jl, but then I still wouldn’t be sure if that’s as intended, so I thought it’s better to ask you to clarify.

This is a non recursion based version (I think):

function fss2(seq)
    A = 0
    i = 1
    while i <= length(seq)
        good = false
        for j in 7:-1:(A % 8)
            t = ((A % 8) ⊻ 1 ⊻ A÷(2^((A % 8) ⊻ 6))) % 8
            seq[i] == t && (good = true; break)
            A += 1
        end
        good && (A <<= 3; i += 1; continue)
        while (A % 8)==0 && A > 0
            A >>= 3
            i -= 1
        end
    end
    return A >> 3
end

Also:

t = ((A % 8) ⊻ 1 ⊻ A÷(2^((A % 8) ⊻ 6))) % 8

seems equivalent to:

t = ((A % 8) ⊻ 1 ⊻ (A >> ((A % 8) ⊻ 6))) % 8

which is 2.5x faster.

1 Like

Should be even better:

function div_8(x)
    y = x % UInt8
    y & 0x7
end
function cryptic_function(a)
    b = div_8(a)
    c = b ⊻ 0x6
    d = a >> c
    e = b ⊻ 0x1 ⊻ d
    div_8(e)
end
1 Like

Using div_8 and cryptic_function(a) this is the simplified function:

function fss2(seq)
    A = 0; i = 1
    while i <= length(seq)
        if seq[i] == cryptic_function(A)
            A <<= 3; i += 1
        else
            A += 1
            while div_8(A)==0 && A > 0
                A >>= 3; i -= 1
            end
        end
    end
    return A >> 3
end
1 Like

This is the desired sequence of operations,

pA = ((Aa % 8) ⊻ 1 ⊻ (Aa ÷ (2^((Aa % 8) ⊻ 6)))) % 8

which is the compact form of this sequence

             B=(A%8)⊻6
            #C= A÷(2^(B%8))  
            C= A÷(2^B) #  B è sempre < 8
            B=B⊻C⊻7
...
            push!(out, B%8) 

which, in turn, is the simplified form of the sequence of operations (indicated in the text of the question) to which the input is subjected to obtain the output.

But here the question I wanted to ask is not so much about these aspects of the solution.
These are the same in both the basic and the hand-made recursion.
Therefore if improved, it would be an improvement for both forms and the gap between the two would not be reduced.

I tried to imitate what (I think but I’m not sure) the iteration mechanism does internally.
For this purpose I thought of using a vector that should act as a stack.
In a first attempt I had initialized the vector as a generic void.
astk=, obtaining an execution time of 15 us.
This execution time is reduced to 2.2/3 us simply by changing the initialization of astk:
astk=Tuple{Tuple{Int,Int},Int}

the question then is: what specific changes to the hand-made solution can improve it.
For example, instead of defining a single vector as a stack, use a different structure. Or store different variables in the stack.
Or, to put it another way, what does the basic recursion do that the hand-made version doesn’t?

1 Like

I’m not entirely sure but at first glance I’d say a stack of 16 (==length(sequence)) values ​​is enough. But I’d have to manually implement a mechanism that does what pop!() does. For example, keep track of the last index where I saved the value to retrieve.
Do you think this is preferable to using push!() and pop!()?

@DNF 's suggestion brings further improvement, but still not enough.

version wo push!() and pop!()
julia> ePr=[2,4,1,6,7,5,4,4,1,7,0,3,5,5,3,0]
16-element Vector{Int64}:
 

julia> function prev(rPr,(A,n),astk,lstk=0,s=0)
       
           n==length(rPr) && return (A,n)
           res=(-1,0) 
           for a in s:7
               Aa=(A << 3)+a
               a8=(Aa % 8)
               pA = (a8 ⊻ 1 ⊻ Aa ÷ (2^(a8 ⊻ 6))) % 8
               if pA == rPr[n+1]
                   res = (Aa,n+1)
                   lstk+=1
                   astk[lstk] = ((A,n),a+1) #push!(astk,((A,n),a+1))
                   break
               end
           end
           res, astk, lstk
       end
prev (generic function with 3 methods)

julia> function hr(rPr)
           astk=Vector{Tuple{Tuple{Int,Int},Int}}(undef,16)
           res, astk, lstk = prev(rPr,(0,0),astk)
           while res[2] != length(rPr)
               if res[1] == -1
                   #res,s = pop!(astk)
                   res,s = astk[lstk]
                   lstk-=1
                   res, astk, lstk = prev(rPr,res,astk,lstk,s)
               else
                   res, astk, lstk = prev(rPr,res,astk,lstk)
               end
           end
           res
       end
hr (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime hr(reverse(ePr)) 
  2.122 μs (86 allocations: 4.55 KiB)
(47910079998866, 16)

julia> @btime hr(reverse(ePr))
  2.044 μs (86 allocations: 4.55 KiB)
(47910079998866, 16)

When benchmarking, remember to interpolate the input variable into the expression, like this:

@btime hr(reverse($ePr))

Here’s what I get.


julia> using BenchmarkTools

julia> @btime hr(reverse(ePr))
  2.044 μs (86 allocations: 4.44 KiB)
(47910079998866, 16)

julia> @btime hr(reverse($ePr))
  1.980 μs (85 allocations: 4.41 KiB)
(47910079998866, 16)

I don’t know if the change is statistically significant

The general reflection that I am able to do (not knowing the internal details of the recursion nor many aspects in the passing of the parameters of the functions) is the following:
the recursion must save the “state of the system” to be able to recover it in case of need. So in general, I think, it saves “everything”.
With the recursion by hand, using the specific knowledge of the case, you can better target the action of “push!”.
The iterative algorithm creates at each step for a certain value of A all the descendants, thus creating a tree that branches up to level 16. Of the leaves of this level, the first one is taken (the one with the lowest value).
The recursive algorithm, on the other hand, generates a single descendant at each step and saves the state from which to continue the search in case it later has to return to this level to change path because the other paths tried do not allow reaching level 16.
This “state” essentially consists of the value A, the level n and the index s (in 0:7) from which to start again to search for the next descendant.

Descendants can be a maximum of 8.
I wonder what is the minimum quantity and the most efficient way to save this state.
For example, if I don’t use push!() and pop!(), can I pass between the various function calls only the address of the “state” and not pass the entire vector together with the state index? How?
In the current version I save as state A, n and a+1 (the point from which to start the search for a new descendant at that level).
The structure used is Vector{Tuple{Tuple{Int,Int},Int}}.
Is it better to use, for example, two distinct vectors: Vector{Tuple{Int,Int}} and Vector{Int}? or Vector{Int} and Vector{Tuple{UInt8,UInt8}} or something else?

PS
I did a test with Vector{Tuple{Int,UInt8}} and Vector{UInt8} and the situation gets much worse (20 times worse) >40us

Here is what I get:

julia> @btime fss2(reverse($Pr))
  333.679 ns (2 allocations: 192 bytes)
47910079998866

(with code in my last post)
The trick is that A is the whole recursion stack with << and >> serving as push! and pop!. The stack A contains the choice of branch 0:7 in each level.

Anyway, even the 2 allocations are the result of reverse and feeding the sequence in reverse gives 0 allocations.

2 Likes

Wonderful solution. I hadn’t seen it yet and I had stopped at the iterative version, which for the moment interested me less. Given the size of the numbers, the problem is intractable via brute force. Even the solutions that exploit the idea of ​​reconstructing the sequence backwards have execution times in the order of ms (the iterative ones) or tens of us (the recursive ones).

PS
I still have to figure out if it is possible to get a better result than the basic recursion, without this “trick”, but by changing something in the recursion hand-made.

PPS
A contains all the state information, even level n would be deducible, in theory, without the need to maintain a specific index.
It would be enough to use log(8,A), but this would be less efficient.

qes=reverse(seq)
julia> @btime fss2($qes)
  368.116 ns (0 allocations: 0 bytes)
47910079998866
2 Likes