Recursive inner functions a thousand times slower?

ikirill · August 11, 2022, 4:43am

In the following code

function f(y)
    loop(x::Int)::Int = x > 0 ? 1 + loop(x-1) : 0
    loop(y)
end

g(x::Int)::Int = x > 0 ? 1 + g(x-1) : 0

I get the following timings:

julia> @btime f(100)
  12.770 μs (2 allocations: 32 bytes)
100

julia> @btime g(100)
  106.513 ns (0 allocations: 0 bytes)
100

with the internal recursive function being 100 times slower than an equivalent top-level function. Obviously adding up 100 1’s should take around ~5ns.

In pprof I can see that loop calls jl_apply_generic, which calls jl_invoke which calls loop, which I’m guessing is what makes f particularly slow.

It seems (Performance of recursive function Type instability of nested recursive function - #3 by ma-chengyuan) that this has been an issue for a while, but does anybody know of a workaround that doesn’t involve hand-unrolling the recursion by maintaining a stack manually? My actual code is quite a lot bigger than this and unrolling it by hand or hoisting out the recursive loop would be difficult.

Benny · August 11, 2022, 5:48am

Seems like a self-referential version of this, which links to the core Github issue #15276. The problem in that case seems like the compiler gives up easily on inferring the captured value if it changes during runtime, which can be worked around by replacing the closure with a functor, a function-like instance whose type is a struct containing captured variables (struct and the associated function are defined at global scope).

I’m not exactly sure why the compiler gives up on inferring a nested recursive function (this doesn’t look like a world age problem, which involves global scope definitions, and nested method definitions aren’t actually redone as dynamically as global definitions), and the workaround is to define the method at global scope like g. If you need to capture other variables, you could make a functor as described before. (EDIT: To be clear, a functor could only capture itself if you struggle with uninitialized mutable structs, and it is unnecessary for calling the function part. I’m talking about capturing other things, like if you wanted to iterate with different steps g(x-step) ).

filchristou · August 11, 2022, 6:45am

I was playing around with this and can someone explain me this: Just by removing the type annotation I get 4x faster execution. In the meantime the @code_warntype code remains the same:

function f2(y)
    loop(x) = x > 0 ? 1 + loop(x-1) : 0
    loop(y)
end

In my machine f(100) runs in 14μs and f2(100) in 3μs.

julia> @btime f2(100)
3.066 μs (2 allocations: 32 bytes)

What surprises me is that even @code_llvm code looks the same. Is there any way to understand what is going on here ?

Benny · August 11, 2022, 7:31am

I can replicate this (17.8us vs 5.3us) on a mac, and the @code_native also matches.

marius311 · August 11, 2022, 8:27am

Here’s one workaround:

function f3(y)
    loop(loop, x) = x > 0 ? 1 + loop(loop, x-1) : 0
    loop(loop, y)
end
@btime f3(100) # ~100ns

Like those other issues mention, the problem is the loop variable is used in the function (to call itself recursively) so its a closed-over variable, then for various Julia reasons it gets Boxed, which then leads to bad performance. Here, you just manually pass it around as an argument.

ikirill · August 11, 2022, 8:35am

That’s just like the solution everybody uses in C++!

Benny · August 19, 2022, 7:21am

Another thread made me remember this and consider if @btime is the problem here, but unlike that example, @timeing a loop actually corroborates the 3-4x difference in @btime, not the matching @code_native/@code_llvm:

julia> @time for i in 1:10_000
        f(i÷i*100) # prevent constant hoisting
       end
  0.181906 seconds (20.00 k allocations: 312.500 KiB)

julia> @time for i in 1:10_000
        f2(i÷i*100) # prevent constant hoisting
       end
  0.056144 seconds (20.00 k allocations: 312.500 KiB)

julia> 0.181906/0.056144
3.239990025648333

nsajko · August 20, 2022, 2:19am

If the issue you want to avoid is namespace pollution, you could put the recursive function into the top level of a submodule.

ikirill · August 20, 2022, 7:57am

No, the issue I had was that the recursive function needs access to ~twelve different variables from its scope, and what was killing performance was the recursive call itself rather than accesses to variables from inside the closure. I ended up using this macro:

macro closure_struct(name, args...)
    @assert isa(name, Symbol)
    @assert all(s->isa(s,Symbol), args)
    typenames = map(s->esc(Symbol(uppercase(string(s)))), args)
    type_args = Expr(:curly, esc(name), typenames...)
    body = Expr(:block, map(p -> Expr(:(::), p...), zip(args,typenames))...)
    decl = Expr(:struct, false, type_args, body)
    constructor = let
        lhs = Expr(:call, esc(name), Expr(:parameters, args...))
        rhs = Expr(:call, esc(name), args...)
        :($lhs = $rhs)
    end
    quote
        $decl
        $constructor
    end
end

That way you can define a recursive function with

@closure_struct _loop x y z

function (rec::_loop)(a::Int)::Int
    @unpack x, y, z = rec
    a > 0 ? rec(a-1) : a
end

at the top level, and then I can use it directly as _loop(;x,y,z)(a).

Topic		Replies	Views
Significant difference in execution speed between similar functions with nested loops Performance question	4	239	March 8, 2023
Inside closures, captured constructors are much slower than captured functions Performance constructors , closure	2	300	May 31, 2023
Very strange performance issue of a simple code General Usage performance	4	577	June 16, 2019
10x slowdown when passing function as argument Performance	15	2250	June 4, 2020
Performant Recursive Anonymous Functions General Usage performance , recursion , lambda , anonymous-function	23	1251	November 30, 2022

Recursive inner functions a thousand times slower?

Related topics