Recursive call vs while loop

I have been playing with parametric methods to imitate the template programming of, say, C++. I seem to have some performance overhead when I prefer using method recursion to using a loop.

Compare, for instance, the below recursion

julia> function fact1(::Type{Val{N}})::Int where N
         n::Int = N
         return n*fact1(Val{n-1})
       end
fact1 (generic function with 2 methods)

julia> fact1(::Type{Val{0}}) = 1
fact1 (generic function with 2 methods)

julia> @code_llvm fact1(Val{5})

define i64 @julia_fact1_62903(i8**) #0 !dbg !5 {
top:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-10888, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #3
  %ptls = bitcast i8* %ptls_i8 to i8****
  %1 = alloca [7 x i8**], align 8
  %.sub = getelementptr inbounds [7 x i8**], [7 x i8**]* %1, i64 0, i64 0
  %2 = getelementptr [7 x i8**], [7 x i8**]* %1, i64 0, i64 3
  %3 = getelementptr [7 x i8**], [7 x i8**]* %1, i64 0, i64 5
  %4 = getelementptr [7 x i8**], [7 x i8**]* %1, i64 0, i64 2
  %5 = bitcast i8*** %2 to i8*
  call void @llvm.memset.p0i8.i32(i8* %5, i8 0, i32 32, i32 8, i1 false)
  %6 = bitcast [7 x i8**]* %1 to i64*
  store i64 10, i64* %6, align 8
  %7 = getelementptr [7 x i8**], [7 x i8**]* %1, i64 0, i64 1
  %8 = bitcast i8* %ptls_i8 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = bitcast i8*** %7 to i64*
  store i64 %9, i64* %10, align 8
  store i8*** %.sub, i8**** %ptls, align 8
  store i8** null, i8*** %4, align 8
  %11 = getelementptr [7 x i8**], [7 x i8**]* %1, i64 0, i64 6
  %12 = getelementptr [7 x i8**], [7 x i8**]* %1, i64 0, i64 4
  store i8** inttoptr (i64 140648119871760 to i8**), i8*** %2, align 8
  store i8** inttoptr (i64 140648119910736 to i8**), i8*** %12, align 8
  %13 = call i8** @jl_f_apply_type(i8** null, i8*** %2, i32 2)
  store i8** %13, i8*** %11, align 8
  store i8** inttoptr (i64 140648179986664 to i8**), i8*** %3, align 8
  %14 = call i8** @jl_apply_generic(i8*** %3, i32 2)
  store i8** %14, i8*** %4, align 8
  %15 = bitcast i8** %14 to i64*
  %16 = load i64, i64* %15, align 16
  %17 = mul i64 %16, 5
  %18 = load i64, i64* %10, align 8
  store i64 %18, i64* %8, align 8
  ret i64 %17
}

to the while-loop equivalent:

julia> function fact2(::Type{Val{N}})::Int where N
         n::Int = N - 1
         res::Int = N
         while n > 1
           res *= n
           n -= 1
         end
         res
       end
fact2 (generic function with 1 method)

julia> @code_llvm fact2(Val{5})

define i64 @julia_fact2_62904(i8**) #0 !dbg !5 {
top:
  ret i64 120
}

Both versions in C++, when compiled with clang++, emits the same llvm code as the while-loop version above. However, in julia, there is a significant difference between using recursive calls and loops in parametric methods.

Is this something that could be fixed? Or, is this something I should remember — as a rule of thumb, should I simply prefer using loops to recursion?

Thanks in advance for your time.

1 Like

I am pretty sure this is a case of tail-call recursion. Julia currently doesn’t implement tail-call optimization (TCO). References:

https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/tail$20call/julia-users/qHRDj80rIvA/T3AylpjsASEJ

tl;dr: LLVM has limited support for tail-call optimization and it’s hard to guarantee, so no one has found it to be worth their time yet. However, LLVM does catch limited cases, and in this simple case Clang++ probably eliminates the call.

But, there’s a surprising amount of stuff that exists in Julia packages. As you might expect, a form of TCO has been implemented in a package. Check this out from Lazy.jl:

3 Likes

Wow, thank you for your very fast response! I will have a look at them all.

I am not a language expert, but I believe that C++ compilers can aggressively eliminate all similar (and sometimes harder) recursions.

The above is a simple, stupid example. But I am trying to understand/learn the capabilities of Julia when it comes to compile-time known things.

By the way, just for the fun of it, here is a comparison between the parametric method vs macro:

julia> using BenchmarkTools

julia> macro fact3(N)
         ex = N
         n = N - 1
         while n > 1
           ex = :($n * $ex)
           n -= 1
         end
         ex
       end
@fact3 (macro with 1 method)

julia> @benchmark fact2(Val{20}) # as before
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.253 ns (0.00% GC)
  median time:      3.752 ns (0.00% GC)
  mean time:        3.773 ns (0.00% GC)
  maximum time:     20.543 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark @fact3(20)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.817 ns (0.00% GC)
  median time:      1.892 ns (0.00% GC)
  mean time:        1.884 ns (0.00% GC)
  maximum time:     14.722 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Do you know the difference between the two? May it be the case that the former initiates a function call, whereas the latter does not?

That’s exactly what I am saying. LLVM and other compilers can do a form of TCO. For some reason I don’t exactly know, Julia isn’t able to make use of that from LLVM yet, hence the difference.

1 Like

Sorry, but this is not a tail call. The recursive call must return to be multiplied by n.

2 Likes

Oh. Well then it’s just a case where Julia cannot optimize on computed types.

function fact1(::Type{Val{N}}) where N
         return Val{N-1}
       end

@code_warntype fact1(Val{10})

Variables:
  #self# <optimized out>
  #unused# <optimized out>
  n <optimized out>

Body:
  begin  # line 3:
      return (Core.apply_type)(Main.Val, (Base.sub_int)($(Expr(:static_parameter, 1)), 1)::Int64)::TYPE{VAL{_}} WHERE _
  end::TYPE{VAL{_}} WHERE _

I forget the issue for it.

(I would’ve thought this would be simple enough to still be a case that’s handled by TCO, but I guess that’s why the issue says it’s too fragile to matter in most cases :smile:.)

This might be fixed on v0.7 with Base.@pure pushing constants though.

1 Like

Which example are you referring to?

The OP’s first example:

function fact1(::Type{Val{N}})::Int where N
         n::Int = N
         return n*fact1(Val{n-1})
       end

I assume that this is all on julia 0.6? The problem with fact1 is that in Val{N-1}, the N-1 will not be evaluated at compile time, and hence the type of Val{N-1} is actually not inferred, and so this has to be evaluated at run time and dispatch needs to be invoked to perform the recursive call. This might be different in 0.7.

In fact2, since N itself is known at compile time, there is just an explicit loop with known boundaries and a simple body, which LLVM optimizes away into a constant. But calling fact2 is still a function call.

I am not sure about the function call fact2 versus the macro fact3, I leave that to the experts to explain the time difference, if there is any significance even. I think in that case your macro just evaluates to *(20,19,18,...,1), so the multiplication itself is not optimized away and is performed at runtime.

Which does return…

Why is it the case that in the first one N is known at compile time, whereas in the second it is not?

This was my intuition, but I am not sure, either :slight_smile:

Interestingly, on master, if you jump ahead to some modest number of recursions:

julia> @code_warntype fact1(Val{9})
Variables:

Body:
  begin
      #= line 3 =#
      # meta: location REPL[8] fact1 3
      # meta: location REPL[8] fact1 3
      SSAValue(6) = $(Expr(:invoke, MethodInstance for fact1(::Type{Val{6}}), :(Main.fact1), Val{6}))::Int64
      # meta: pop locations (2)
      return (Base.mul_int)(9, (Base.mul_int)(8, (Base.mul_int)(7, SSAValue(6))::Int64)::Int64)::Int64
  end::Int64

But if you do them one by one, the constants fold (for me, 3 works to start but 4 does not):

julia> @code_warntype fact1(Val{3})
Variables:
  n<optimized out>

Body:
  begin
      #= line 3 =#
      return 6
  end::Int64

julia> @code_warntype fact1(Val{4})
Variables:
  n<optimized out>

Body:
  begin
      #= line 3 =#
      return 24
  end::Int64

julia> @code_warntype fact1(Val{5})
Variables:
  n<optimized out>

Body:
  begin
      #= line 3 =#
      return 120
  end::Int64

So that if we run them one after another, we can get

julia> @code_warntype fact1(Val{20})
Variables:

Body:
  begin
      return 2432902008176640000
  end::Int64

julia> @benchmark fact1(Val{20})
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.030 ns (0.00% GC)
  median time:      0.040 ns (0.00% GC)
  mean time:        0.036 ns (0.00% GC)
  maximum time:     0.051 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
1 Like

Wow! Nice observation. I hadn’t checked the manually stepping case when I posted this :slight_smile:

In a way, your observation is similar to that of clang++ -O0 , where in Julia you do not have recursion, while there is recursion in the compiled code in C++. Then, with clang++ -O3, you have the desired behaviour.

Another thing I noted is the impressive median time — why, on master, is the timing significantly better, i.e., 3 ns vs 0.03 ns? Are they not both return statements?

In both cases N is known at compile time, that is, if you call fact1(Val{20}) or fact2(Val{20}) a specific version is compiled with N=20. However, for fact2, this just yields the body:

         n::Int = 20- 1
         res::Int = 20
         while n > 1
           res *= n
           n -= 1
         end
         res

which LLVM will be able to optimize away to the actual value. In fact1, you get the body

         n::Int = 20
         return n*fact1(Val{n-1})

with the problem being that the n-1 does not get evaluated (on Julia 0.6). This has nothing to do with first assigning N to n, even if you would type Val{N-1} directly it would not get evaluated. As such, method dispatch does not know which version of fact1 to call next and has to decide this at runtime. Just use @code_warntype instead of @code_llvm to see this behaviour:

julia> @code_warntype fact1(Val{20})
Variables:
  #self# <optimized out>
  #unused# <optimized out>
  n <optimized out>

Body:
  begin  # line 3:
      SSAValue(2) = (Main.fact1)((Core.apply_type)(Main.Val, (Base.sub_int)($(Expr(:static_parameter, 1)), 1)::Int64)::Type{Val{_}} where _)::Int64
      return (Base.mul_int)($(Expr(:static_parameter, 1)), SSAValue(2))::Int64
  end::Int64

You see the Val{_} appearing with unknown _

In Julia 0.7, due to constant propagation, this does actually get evaluated.

1 Like

Oh, I see. Thank you for the clarification. I just kept the assignment to n, which was needed in the looping case, to keep everything but the choice of recursive calls the same. Indeed, you are right: even when I tried getting rid of n, the behaviour was the same.

It’s not enough that the function just return, to be tail recursive, a function must return a naked recursive call. The OP’s first example doesn’t return a naked call, transforms the result by multiplying it by N.

A tail recursive version of factorial would have to use an accumulator:

factorial(::Type{Val{0}}, acc) = acc
factorial(::Type{Val{N}}, acc) where N = factorial(Val{N - 1}, N * acc)
factorial(n) = factorial(Val{n}, 1)

Recursion can be a powerful and elegant way to code. It can also sometimes lead to higher performance because of improved cache locality, and can even sometimes lead to more accurate algorithms.

The one rule of thumb to keep in mind when coding recursion, in any language, is that if you care about performance you need to enlarge the base case to amortize the function-call overhead.

Some examples of this include cache-oblivious matrix multiplication or transposition (from my course notes) and pairwise summation in Julia Base.

This is also called recursion coarsening, and my understanding is that it is still a research problem for compilers to perform this kind of transformation automatically in any general language.

If you have a tail-recursive algorithm, on the other hand, recursion is usually uninteresting and has no particular advantages; you might as well rewrite it into the equivalent loop. That’s why there’s not much interest in implementing TCO for Julia, in contrast to languages like Scheme where it is essential because tail calls are the primary iteration construct.

24 Likes

I’ve just began to learn about reccurssion and came accross this. Has this been improved, or is the lack of opptimization still existant.

Yes, still existent AFAIK.

What is AFAIK?