Memory allocation in recursion

Does Julia allocate memory when recursing? I see a ton of memory allocation
in a recursive quicksort…

No, you shouldn’t be seeing heap allocations when recursing, unless your function heap allocates regardless.

Often, inference will fail when types change on recursion and the compiler can’t prove the recursion/type changes terminate.
Do you have inference failures?

4 Likes

No flagged problems in @code_warntype output. It is a puzzle…

Do you have a minimal reproducer?

2 Likes

I was about to say “but julia doesn’t do TCO”. But you’re absolutely right, it is both educational and more readable / honest / transparent to write

function foo(x,y,x)
  #...
    return foo(a,b,c) # tail-call
  #...
end

as

function foo(x,y,z)
  @label entry
  #...
    x,y,z = a,b,c
    @goto entry
  #...
end

I would not be surprised if the recursion stumped escape analysis and caused views to be allocated instead of SROA-ed.

1 Like

No, there’s no problem with views and recursion in general — we combine the two all the time. For example here is a recursive implementation of pairwise summation with views:

julia> using BenchmarkTools

julia> function foo(a)
          n = length(a)
          n ≤ 1 && return sum(a)
          n2 = n >> 1
          @views return foo(a[1:n2]) + foo(a[n2+1:end])
       end
foo (generic function with 1 method)

julia> a = rand(1000);

julia> @btime foo($a);
  11.377 μs (0 allocations: 0 bytes)

(It’s slow because I didn’t coarsen the base case, but it doesn’t allocate.)

Really, it’s pointless to speculate about why @PetrKryslUCSD is seeing allocations without seeing example code. But allocations are certainly not an inherent price of recursion.

6 Likes

The code (batch.jl):

module PQuickSort

using LinearAlgebra

@inline function _partition!(A, perm, pivot, left,right)
    @inbounds while left <= right
        while A[left] < pivot
            left += 1
        end
        while A[right] > pivot
            right -= 1
        end
        if left <= right
            A[left], A[right] = A[right], A[left]
            perm[left], perm[right] = perm[right], perm[left]
            left += 1
            right -= 1
        end
    end
    return (left,right)
end

function quicksortperm!(A, perm, i=1, j=length(A))
    if j > i
        left, right = _partition!(A, perm, A[(j+i) >>> 1], i, j)
        quicksortperm!(A, perm, i, right)
        quicksortperm!(A, perm, left, j)
    end
end

function __pquicksortperm!(A, perm, i, j, task)
    if task <= 0
        quicksortperm!(A, perm, i, j)
        return
    end
    left, right = _partition!(A, perm, A[(j+i) >>> 1], i, j)
    t = Threads.@spawn __pquicksortperm!(A, perm, $i, $right, task - 1)
    __pquicksortperm!(A, perm, left, j, task - 1)
    wait(t)
    return
end

# This has the same order of the arguments as the built in sortperm!
function pquicksortperm!(perm, A, ntasks = Threads.nthreads())
    __pquicksortperm!(A, perm, 1, length(A), ntasks)
end

end # module

using LinearAlgebra
using Main.PQuickSort
function g()
    N = 960_000_000
    # N = 10
    a = collect((N):-1:1)
    p = similar(a)
    @time sortperm!(p, a)
    # @show a, prm
    aref = sort(deepcopy(a))

    a = collect((N):-1:1)
    prm = collect(1:length(a))
    @time PQuickSort.pquicksortperm!(prm, a)
    # @show a, prm

    norm(a - sort(aref)) == 0



    a = collect((N):-1:1)
    prm = collect(1:length(a))
    @time PQuickSort.pquicksortperm!(prm, a)
end

This is what I see:

sorting % ../julia-1.10.0/bin/julia -t 1


julia> include("batch.jl")
g (generic function with 1 method)

julia> g()
  1.764360 seconds
 12.726564 seconds (686 allocations: 47.734 KiB, 0.01% compilation time)
 12.728211 seconds (5 allocations: 528 bytes)

julia> g()
  1.771723 seconds
 12.724387 seconds (5 allocations: 528 bytes)
 12.751161 seconds (5 allocations: 528 bytes)

julia>
sorting % ../julia-1.10.0/bin/julia -t 4
               _

julia> include("batch.jl")
g (generic function with 1 method)

julia> g()
  1.765504 seconds
  3.860578 seconds (756 allocations: 54.953 KiB, 0.06% compilation time)
  3.834740 seconds (75 allocations: 7.734 KiB)

julia>
sorting % ../julia-1.10.0/bin/julia -t 8

julia> include("batch.jl")
g (generic function with 1 method)

julia> g()
  1.764240 seconds
  2.564240 seconds (1.96 k allocations: 178.703 KiB, 0.09% compilation time)
  2.506170 seconds (1.27 k allocations: 131.484 KiB)

julia> g()
  1.572977 seconds
  2.533586 seconds (1.27 k allocations: 131.484 KiB)
  2.469899 seconds (1.27 k allocations: 131.484 KiB)

julia> g()
  1.572899 seconds
  2.531733 seconds (1.27 k allocations: 131.484 KiB)
  2.476805 seconds (1.27 k allocations: 131.484 KiB)

julia>
sorting % ../julia-1.10.0/bin/julia -t 16


julia> include("batch.jl")
g (generic function with 1 method)

julia> g()
  1.763108 seconds
  1.939059 seconds (328.39 k allocations: 34.418 MiB, 0.12% compilation time)
  1.908631 seconds (327.71 k allocations: 34.465 MiB)

julia> g()
  1.563598 seconds
  1.930559 seconds (327.71 k allocations: 34.130 MiB)
  1.885926 seconds (327.68 k allocations: 33.367 MiB)

julia>

Platform:

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 2 on 16 virtual cores

Have you tried it without Threads.@spawn to check whether the allocations are just due to the threading?

2 Likes

Nothing new can be said about TCO and Julia at this point, but please bear in mind that self-recursion is the only use of TCO which can be transformed into a loop or labeled goto. It is among the less interesting applications. Dynamic programming and continuation-passing style both call for mutual recursion, where “mutual” can extend to dozens of functions.

So I hope someone finds the energy to add explicit tail calls to Julia. It would be a good addition to the language. Some of the arguments against doing it implicitly and automatically are good ones, but there’s no case to be made against providing the ability to eliminate a tail call explicitly.

1 Like

Dynamic programming (which can involve either mutual or self recursion) doesn’t necessarily (or even usually) involve tail calls. Ditto for many other applications of mutual recursion. (And mutually recursive tail calls can still be written as an imperative loop, albeit with a merged function body.)

But yes, the possibility of an explicit tail-call annotation has been raised before, e.g. for people who want to write in functional styles such as continuation passing.

And of course, this has nothing to do with the code the OP of this thread is concerned with, which isn’t a tail call.

2 Likes

You are right. It was the thread spawning. Seems I seriously miscalculated how many threads were going to be spawned.

1 Like

Many, even most, interesting dynamic programming techniques, can be implemented to use tail calls. Schemers have made a specialty of this, for obvious reasons. Of course, if they aren’t available in the first place, there’s no point.

At the limit this boils down to a well known finding that any system which meets certain minimal standards can run any program.

Anything I do in Julia which uses multiple dispatch, for a less general example, I could write in a language which doesn’t have it. But it’s a very nice feature to have. When you talk about a merged function body, that’s manually writing a state machine with a bunch of labels and goto. Hardly a satisfactory solution. One might even say that writing efficient state machines which aren’t a single function containing a maze of gotos is the main value of tail call elimination!

Yeah I think that’s a great proposal… but it’s closed, and the only way it gets opened again is if people keep speaking up. Continuation-passing style is a choice way to write interpreters, and involves a tail call pretty much by definition. The alternatives are consuming the stack, or writing the entire interpreter as a single state machine (see above), and neither of those are good.

Or writing a bytecode interpreter with its own stack, but now we’re implementing a different program because we can’t write the one we want. Also no bueno.

I was a bit confused where the bit about TCO came from in the first place, yes. But here we are. Julia certainly stack allocates memory in recursion, even when it doesn’t have to, and with no way of preventing it other than to not use recursion.

Anyway. Back to your regularly scheduled, um, program.

2 Likes