Experiments with Julia Stack Sizes and Enzyme

I’ve been trying to get Enzyme.jl working as an AD backend for SymbolicRegression.jl for a couple of years now, and recently got a working demonstration with a hack to increase the stack size: Extremely long compilation time on recursive, branching functions (DynamicExpressions.jl) · Issue #1156 · EnzymeAD/Enzyme.jl · GitHub.

Some background: I noticed it was hard to reproduce my issues in a MWE, and I also ran into a mysterious stack overflow once in a while with no debug info. It seemed I could only get the compilation freezes in the full, more complicated version of my code.

So in a sort of a last ditch effort to fix this issue I found this stackoverflow.com post about manually reserving stack space:

function with_stacksize(f::F, n) where {F<:Function}
    fetch(schedule(Task(f, n)))
end

with_stacksize(8 * 1024 * 1024) do  # 8 MB stack (default=4 MB)
    ...
end

Weirdly enough, this fixed the issue. I wrapped my Enzyme.jl call with it (full code here):

with_stacksize(8 * 1024 * 1024) do
    autodiff(
        Reverse,
        evaluator,
        Duplicated(g.f.tree, g.extra.storage_tree),
        Duplicated(g.f.dataset, g.extra.storage_dataset),
        Const(g.f.options),
        Const(g.f.idx),
        Duplicated(output, doutput),
    )
end

and, lo and behold, I could actually run SymbolicRegression.jl with Enzyme.jl as the AD. Even multi-threaded searches now works. And the derivatives are fast :metal:

So, some questions:

  1. What could the issue be coming from? Is this simply too hard of a problem to differentiate with Enzyme using the normal Julia stack size?
  2. Task(f, n) is undocumented. Is it safe to use? (Am I going to run into some mysterious segfaults in the future?)
  3. What is the “proper” way to deal with a limited stack size in Julia?
8 Likes

Some experiments with stack sizes:

function test_recursion_depth(depth::Int, args...)
    depth >= 1000000000000 && return nothing
    print("\33[2K\rHello from depth $(depth)")
    test_recursion_depth(depth + 1, args...)
    print("$(depth) $(args)")  # Just to prevent LLVM removing args
end

So with no arguments, we can get up to a call depth of 86660:

julia> test_recursion_depth(0)
Hello from depth 86660ERROR: StackOverflowError:

Let’s see if we have 10 arguments:

julia> test_recursion_depth(0, ones(10)...)
Hello from depth 12998

meaning we hit a stack overflow at about 13,000 call depth.

And, for 100 arguments:

julia> test_recursion_depth(0, ones(100)...)
Hello from depth 6117

Does the type of the argument affect the stack frame size?

julia> test_recursion_depth(0, ones(BigInt, 100)...)
Hello from depth 6117

julia> test_recursion_depth(0, ntuple(_ -> ones(10), 100)...)
Hello from depth 6117

Nope, just the number of arguments. What about kwargs?

julia> function test_recursion_depth_kws(depth::Int; kws...)
           depth >= 1000000000000 && return nothing
           print("\33[2K\rHello from depth $(depth)")
           test_recursion_depth_kws(depth + 1; kws...)
           print("$(depth) $(kws)")
       end

julia> test_recursion_depth_kws(0; NamedTuple(Symbol("x$i") => 1 for i in 1:100)...)
Hello from depth 2511

The all-keywords example seems to have a ~2.4x larger stack frame size.

What about when you have local arguments within the function body?

julia> function test_recursion_depth_locals(::Val{n}, depth::Int, args...) where {n}
           depth >= 1000000000000 && return nothing
           print("\33[2K\rHello from depth $(depth)")
           a = ntuple(_ -> rand(), Val(n))
           test_recursion_depth_locals(Val(n), depth + 1, args...)
           print("$(depth) $(a) $(args)")  # Just to prevent LLVM removing args
       end;

Let’s see 1 value, 50 values, and 500 values in a tuple, in the function body:

julia> test_recursion_depth_locals(Val(1), 0, ones(10)...)
Hello from depth 12091

julia> test_recursion_depth_locals(Val(50), 0, ones(10)...)
Hello from depth 7646

julia> test_recursion_depth_locals(Val(500), 0, ones(10)...)
Hello from depth 1710

So the function body seems to affect the stack frame size quite a bit.

Okay, now lets try with different stack sizes via Task(f, n) (still undocumented, but seems to work pretty well) –

1 MB

julia> fetch(schedule(Task(() -> test_recursion_depth(0, ones(10)...), 1024 * 1024)))
Hello from depth 1532

4 MB (default for a task)

julia> fetch(schedule(Task(() -> test_recursion_depth(0, ones(10)...), 4 * 1024 * 1024)))
Hello from depth 6447

julia> fetch(schedule(Task(() -> test_recursion_depth(0, ones(10)...))))
Hello from depth 6447

8 MB (seems to be default for root task)

julia> fetch(schedule(Task(() -> test_recursion_depth(0, ones(10)...), 8 * 1024 * 1024)))
Hello from depth 13001

Note: it can get up a bit higher than the initial test – because the task has a fresh stack, whereas the root task has the REPL in its stack.

Now, 64 MB:

julia> fetch(schedule(Task(() -> test_recursion_depth(0, ones(10)...), 64 * 1024 * 1024)))
Hello from depth 104751

Or 1 GB:

julia> fetch(schedule(Task(() -> test_recursion_depth(0, ones(10)...), 1024 * 1024 * 1024)))
Hello from depth 1677615

(Note: it’s just a stack size – that memory is virtual. Only if you use that stack size do you actually allocate it.)

3 Likes