Does Julia allocate memory when recursing? I see a ton of memory allocation
in a recursive quicksort…
No, you shouldn’t be seeing heap allocations when recursing, unless your function heap allocates regardless.
Often, inference will fail when types change on recursion and the compiler can’t prove the recursion/type changes terminate.
Do you have inference failures?
No flagged problems in @code_warntype output. It is a puzzle…
Do you have a minimal reproducer?
I was about to say “but julia doesn’t do TCO”. But you’re absolutely right, it is both educational and more readable / honest / transparent to write
function foo(x,y,x)
#...
return foo(a,b,c) # tail-call
#...
end
as
function foo(x,y,z)
@label entry
#...
x,y,z = a,b,c
@goto entry
#...
end
I would not be surprised if the recursion stumped escape analysis and caused views to be allocated instead of SROA-ed.
No, there’s no problem with views and recursion in general — we combine the two all the time. For example here is a recursive implementation of pairwise summation with views:
julia> using BenchmarkTools
julia> function foo(a)
n = length(a)
n ≤ 1 && return sum(a)
n2 = n >> 1
@views return foo(a[1:n2]) + foo(a[n2+1:end])
end
foo (generic function with 1 method)
julia> a = rand(1000);
julia> @btime foo($a);
11.377 μs (0 allocations: 0 bytes)
(It’s slow because I didn’t coarsen the base case, but it doesn’t allocate.)
Really, it’s pointless to speculate about why @PetrKryslUCSD is seeing allocations without seeing example code. But allocations are certainly not an inherent price of recursion.
The code (batch.jl
):
module PQuickSort
using LinearAlgebra
@inline function _partition!(A, perm, pivot, left,right)
@inbounds while left <= right
while A[left] < pivot
left += 1
end
while A[right] > pivot
right -= 1
end
if left <= right
A[left], A[right] = A[right], A[left]
perm[left], perm[right] = perm[right], perm[left]
left += 1
right -= 1
end
end
return (left,right)
end
function quicksortperm!(A, perm, i=1, j=length(A))
if j > i
left, right = _partition!(A, perm, A[(j+i) >>> 1], i, j)
quicksortperm!(A, perm, i, right)
quicksortperm!(A, perm, left, j)
end
end
function __pquicksortperm!(A, perm, i, j, task)
if task <= 0
quicksortperm!(A, perm, i, j)
return
end
left, right = _partition!(A, perm, A[(j+i) >>> 1], i, j)
t = Threads.@spawn __pquicksortperm!(A, perm, $i, $right, task - 1)
__pquicksortperm!(A, perm, left, j, task - 1)
wait(t)
return
end
# This has the same order of the arguments as the built in sortperm!
function pquicksortperm!(perm, A, ntasks = Threads.nthreads())
__pquicksortperm!(A, perm, 1, length(A), ntasks)
end
end # module
using LinearAlgebra
using Main.PQuickSort
function g()
N = 960_000_000
# N = 10
a = collect((N):-1:1)
p = similar(a)
@time sortperm!(p, a)
# @show a, prm
aref = sort(deepcopy(a))
a = collect((N):-1:1)
prm = collect(1:length(a))
@time PQuickSort.pquicksortperm!(prm, a)
# @show a, prm
norm(a - sort(aref)) == 0
a = collect((N):-1:1)
prm = collect(1:length(a))
@time PQuickSort.pquicksortperm!(prm, a)
end
This is what I see:
sorting % ../julia-1.10.0/bin/julia -t 1
julia> include("batch.jl")
g (generic function with 1 method)
julia> g()
1.764360 seconds
12.726564 seconds (686 allocations: 47.734 KiB, 0.01% compilation time)
12.728211 seconds (5 allocations: 528 bytes)
julia> g()
1.771723 seconds
12.724387 seconds (5 allocations: 528 bytes)
12.751161 seconds (5 allocations: 528 bytes)
julia>
sorting % ../julia-1.10.0/bin/julia -t 4
_
julia> include("batch.jl")
g (generic function with 1 method)
julia> g()
1.765504 seconds
3.860578 seconds (756 allocations: 54.953 KiB, 0.06% compilation time)
3.834740 seconds (75 allocations: 7.734 KiB)
julia>
sorting % ../julia-1.10.0/bin/julia -t 8
julia> include("batch.jl")
g (generic function with 1 method)
julia> g()
1.764240 seconds
2.564240 seconds (1.96 k allocations: 178.703 KiB, 0.09% compilation time)
2.506170 seconds (1.27 k allocations: 131.484 KiB)
julia> g()
1.572977 seconds
2.533586 seconds (1.27 k allocations: 131.484 KiB)
2.469899 seconds (1.27 k allocations: 131.484 KiB)
julia> g()
1.572899 seconds
2.531733 seconds (1.27 k allocations: 131.484 KiB)
2.476805 seconds (1.27 k allocations: 131.484 KiB)
julia>
sorting % ../julia-1.10.0/bin/julia -t 16
julia> include("batch.jl")
g (generic function with 1 method)
julia> g()
1.763108 seconds
1.939059 seconds (328.39 k allocations: 34.418 MiB, 0.12% compilation time)
1.908631 seconds (327.71 k allocations: 34.465 MiB)
julia> g()
1.563598 seconds
1.930559 seconds (327.71 k allocations: 34.130 MiB)
1.885926 seconds (327.68 k allocations: 33.367 MiB)
julia>
Platform:
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 24 × Apple M2 Ultra
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 2 on 16 virtual cores
Have you tried it without Threads.@spawn
to check whether the allocations are just due to the threading?
Nothing new can be said about TCO and Julia at this point, but please bear in mind that self-recursion is the only use of TCO which can be transformed into a loop or labeled goto. It is among the less interesting applications. Dynamic programming and continuation-passing style both call for mutual recursion, where “mutual” can extend to dozens of functions.
So I hope someone finds the energy to add explicit tail calls to Julia. It would be a good addition to the language. Some of the arguments against doing it implicitly and automatically are good ones, but there’s no case to be made against providing the ability to eliminate a tail call explicitly.
Dynamic programming (which can involve either mutual or self recursion) doesn’t necessarily (or even usually) involve tail calls. Ditto for many other applications of mutual recursion. (And mutually recursive tail calls can still be written as an imperative loop, albeit with a merged function body.)
But yes, the possibility of an explicit tail-call annotation has been raised before, e.g. for people who want to write in functional styles such as continuation passing.
And of course, this has nothing to do with the code the OP of this thread is concerned with, which isn’t a tail call.
You are right. It was the thread spawning. Seems I seriously miscalculated how many threads were going to be spawned.
Many, even most, interesting dynamic programming techniques, can be implemented to use tail calls. Schemers have made a specialty of this, for obvious reasons. Of course, if they aren’t available in the first place, there’s no point.
At the limit this boils down to a well known finding that any system which meets certain minimal standards can run any program.
Anything I do in Julia which uses multiple dispatch, for a less general example, I could write in a language which doesn’t have it. But it’s a very nice feature to have. When you talk about a merged function body, that’s manually writing a state machine with a bunch of labels and goto. Hardly a satisfactory solution. One might even say that writing efficient state machines which aren’t a single function containing a maze of gotos is the main value of tail call elimination!
Yeah I think that’s a great proposal… but it’s closed, and the only way it gets opened again is if people keep speaking up. Continuation-passing style is a choice way to write interpreters, and involves a tail call pretty much by definition. The alternatives are consuming the stack, or writing the entire interpreter as a single state machine (see above), and neither of those are good.
Or writing a bytecode interpreter with its own stack, but now we’re implementing a different program because we can’t write the one we want. Also no bueno.
I was a bit confused where the bit about TCO came from in the first place, yes. But here we are. Julia certainly stack allocates memory in recursion, even when it doesn’t have to, and with no way of preventing it other than to not use recursion.
Anyway. Back to your regularly scheduled, um, program.