with the internal recursive function being 100 times slower than an equivalent top-level function. Obviously adding up 100 1’s should take around ~5ns.
In pprof I can see that loop calls jl_apply_generic, which calls jl_invoke which calls loop, which I’m guessing is what makes f particularly slow.
It seems (Performance of recursive functionType instability of nested recursive function - #3 by ma-chengyuan) that this has been an issue for a while, but does anybody know of a workaround that doesn’t involve hand-unrolling the recursion by maintaining a stack manually? My actual code is quite a lot bigger than this and unrolling it by hand or hoisting out the recursive loop would be difficult.
Seems like a self-referential version of this, which links to the core Github issue #15276. The problem in that case seems like the compiler gives up easily on inferring the captured value if it changes during runtime, which can be worked around by replacing the closure with a functor, a function-like instance whose type is a struct containing captured variables (struct and the associated function are defined at global scope).
I’m not exactly sure why the compiler gives up on inferring a nested recursive function (this doesn’t look like a world age problem, which involves global scope definitions, and nested method definitions aren’t actually redone as dynamically as global definitions), and the workaround is to define the method at global scope like g. If you need to capture other variables, you could make a functor as described before. (EDIT: To be clear, a functor could only capture itself if you struggle with uninitialized mutable structs, and it is unnecessary for calling the function part. I’m talking about capturing other things, like if you wanted to iterate with different steps g(x-step) ).
I was playing around with this and can someone explain me this: Just by removing the type annotation I get 4x faster execution. In the meantime the @code_warntype code remains the same:
function f2(y)
loop(x) = x > 0 ? 1 + loop(x-1) : 0
loop(y)
end
In my machine f(100) runs in 14μs and f2(100) in 3μs.
function f3(y)
loop(loop, x) = x > 0 ? 1 + loop(loop, x-1) : 0
loop(loop, y)
end
@btime f3(100) # ~100ns
Like those other issues mention, the problem is the loop variable is used in the function (to call itself recursively) so its a closed-over variable, then for various Julia reasons it gets Boxed, which then leads to bad performance. Here, you just manually pass it around as an argument.
Another thread made me remember this and consider if @btime is the problem here, but unlike that example, @timeing a loop actually corroborates the 3-4x difference in @btime, not the matching @code_native/@code_llvm:
julia> @time for i in 1:10_000
f(i÷i*100) # prevent constant hoisting
end
0.181906 seconds (20.00 k allocations: 312.500 KiB)
julia> @time for i in 1:10_000
f2(i÷i*100) # prevent constant hoisting
end
0.056144 seconds (20.00 k allocations: 312.500 KiB)
julia> 0.181906/0.056144
3.239990025648333
No, the issue I had was that the recursive function needs access to ~twelve different variables from its scope, and what was killing performance was the recursive call itself rather than accesses to variables from inside the closure. I ended up using this macro: