Understanding allocation pattern when using sort!()

Looked a bit closer into allocations of sort!() based on `sort!()` and `sort()` give me runtime dispatches and noticed the following which I’m trying to understand.

using BenchmarkTools

const N = 100

function makevector(n::Integer)::Vector{Int}
    rand(1:50, N)
end

v1 = rand(1:50, N)
v2 = makevector(N)
w = randn(N)

@btime sort!($v1)
@btime sort!($v2)
@btime sort!($w)

I get these allocation patterns when doing multiple runs:

melis@juggle 09:13:~/examples/julia$ j sort_allocates.jl 
  331.107 ns (0 allocations: 0 bytes)
  271.924 ns (1 allocation: 448 bytes)
  357.009 ns (0 allocations: 0 bytes)
melis@juggle 09:13:~/examples/julia$ j sort_allocates.jl 
  345.598 ns (0 allocations: 0 bytes)
  346.390 ns (0 allocations: 0 bytes)
  343.257 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  356.360 ns (0 allocations: 0 bytes)
  365.132 ns (0 allocations: 0 bytes)
  359.219 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  367.053 ns (0 allocations: 0 bytes)
  339.612 ns (0 allocations: 0 bytes)
  358.938 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  326.630 ns (0 allocations: 0 bytes)
  336.044 ns (0 allocations: 0 bytes)
  335.163 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  376.132 ns (0 allocations: 0 bytes)
  339.516 ns (0 allocations: 0 bytes)
  347.260 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  325.187 ns (0 allocations: 0 bytes)
  273.010 ns (1 allocation: 448 bytes)
  344.654 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  343.237 ns (0 allocations: 0 bytes)
  272.489 ns (1 allocation: 448 bytes)
  357.061 ns (0 allocations: 0 bytes)
melis@juggle 09:14:~/examples/julia$ j sort_allocates.jl 
  333.879 ns (0 allocations: 0 bytes)
  357.373 ns (0 allocations: 0 bytes)
  343.618 ns (0 allocations: 0 bytes)

So only the variant using makevector() allocates in certain runs. Is the allocation related to the use of N in makevector(). Or something else?

I did find this blog post detailing the choice of sorting algorithm based on element type, but these are all numerical here, so (non-allocating) QuickSort should be used, right?

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)

const global is type-stable and sometimes constant-propagated, it should not be a problem.

AFAIK, sort!($v1) and sort!($v2) should not be meaningfully different. It is legitimately bizarre that only some runs will have that 1 allocation and shorter minimum runtime. I’m able to replicate this behavior rerunning lines in the REPL, sometimes on the v1 run and sometimes on the v2 run.

It appears to depend on the vector’s random values. When either v1 or v2 is sort!ed with 1 allocation, I tried rerunning only the @btime lines, and the same vector causes the 1 allocation every time. I couldn’t tell you why though, I never learned to use a debugger. I checked length(unique(v1)), length(unique(v2)) and saw no pattern when the 1 allocation shows up.

I’m almost certain it’s a coincidence.

Because you’re generating random vectors, sort! occasionally hits

which is indeed allocating at the benefit of speed

2 Likes

Ah, so it is data-dependent. I didn’t think that would be a sensible explanation. Nice find.

That’s what made me check length(unique(v1)), length(unique(v2)) across runs but I don’t see a pattern of the 1 allocation happening for “vectors of few unique integers” as commented over sort_int_range!. Sometimes a vector with 47 unique elements will allocate while the vector with 43 will not, and sometimes a vector with 41 will allocate while the vector with 44 will not. But maybe I’m just missing something, if !o1 && !o2 && rangelen < div(n,2) seems more complicated than just “few unique integers” at first glance.

It’s about the range in conjunction with the vector size. If extrema(v)=(1,50) and N=100, then the branch is missed and sort! used which does not allocate.

On the other hand, put N=102 and sort_int_range! is hit every time. Make N smaller and the chances of hitting that branch diminish.

1 Like

Well that probably explains it, unless someone debugs this and finds something else. I don’t like it, though. I know the !-named functions are conventionally promised to modify arguments, not promised to minimally allocate. But eliminating allocations is a major reason people reach for ! functions, and it’s strange to have something that is 0 allocation sometimes. The good news is that it looks like counts is fully contained in the method, so the work on escape analysis might mitigate that allocation one day (could be easily too big to go on the stack entirely, so maybe just freed upon return so the GC doesn’t need to work on it).

Curious, what the counts you are refering to? I thought it was the where = fill(0, rangelen) in sort_int_range! that caused the allocation?

Ah my bad, I’m looking at master, where the line in sort_int_range! is written counts = fill(0, rangelen).

1 Like

Good news! On a somewhat recent master (Commit 4bb6528bc0 (2022-05-09 09:13 UTC), count sort is almost three times as fast and allocation free :tada: even though the implementation is identical. Seems the compiler has become quite a bit smarter.

2 Likes

I read that eliding allocations happens usually by moving something from the heap to the stack, but surely that cannot be done for very large count. Maybe it allocates if the array gets too big? Or maybe an allocation is only counted if it has to be handled by the GC?

The latter is probably most likely. Since the array is only used locally and doesn’t contain any references it’s easy to generate code to allocate and free it within the function. Not sure if that falls under the definition of “allocation” as far as the profiling goes, though.

Really wish it was documented in detail what @time or @benchmark counts as an allocation, it’s not even stated explicitly that it’s only counting heap allocations (as opposed to stack). Also, people try to eliminate allocations for at least a couple reasons: 1) reducing or eliminating GC latency, 2) eliminating nondeterministic malloc/free. So it’s not ideal to not be sure how many of the allocations are being saddled on the GC, and important information is obscured if it’s not counting heap allocations freed by the compiler instead of the GC.