Why does sort! allocate inside loops when the input is large?

Hi,

I have noticed that sort! allocates when called repetitively, but I am not sure why since I did not expect it to allocate:

v=rand(1000);
@btime sort!(v)
#  1.744 μs (0 allocations: 0 bytes)

In my code, I need to analyze columns of a large matrix of data. Typically, it is better to sort each column first before doing anything (like calculating quantiles, medians,…). Here’s a MWE

function quant(U)
    d,n=size(U)
    v=zeros(d)
    quants=zeros(n)
    for j=1:n
        copy!(v,view(U,:,j))
        sort!(v)
        quants[j]=quantile!(v,.2;sorted=true)
    end
    return quants
end

However, this function allocates and it is seems that the culprit is the sort!(v) line (without it, there are only 5-6 allocations).

U=rand(1000,1000)
@btime quant($U)
# 20.244 ms (6006 allocations: 9.82 MiB)

Curiously enough, when size(U,1) is small, there are no allocations

U=rand(10,1000);
@btime quant($U)
#  160.864 μs (5 allocations: 8.02 KiB)

Any ideas on why this is happening?

1 Like

TLDR is that Julia uses an algorithm called Scratch quick sort that uses an allocation in order to be both faster than regular quicksort and stable. See https://www.youtube.com/watch?v=RIhCBTx5TYA for details (this also might be a case where we use Radix sort which also allocates). We have an API for passing in a scratch buffer to sort but I’m not sure if that is part of the public interface. @Lilith can you confirm?

8 Likes

Specifically, defining:

function quant(U)
    d,n=size(U)
    v=zeros(d)
    quants=zeros(n)
    scratch = similar(quants)
    for j=1:n
        copy!(v,view(U,:,j))
        sort!(v; scratch)
        quants[j]=quantile!(v,.2;sorted=true)
    end
    return quants
end

deletes half of the allocations and is 25% faster. (the other half come from a line in Radix sort that has the comment # TODO use scratch for this

1 Like

Thanks for sharing. I notice that Julia is not using ScratchQuickSort by default (at least for the example U=rand(1000,1000)), but when when I specify the algorithm, the allocations disappear. I will include some benchmarks below in case someone stumbles upon this thread in the future.

function quant(U,alg)
    d,n=size(U)
    v=zeros(d)
    quants=zeros(n)
    scratch=zeros(d)
    for j=1:n
        copy!(v,view(U,:,j))
        sort!(v;alg=alg,scratch=scratch)

        quants[j]=quantile!(v,.2;sorted=true)
    end
    return quants
end
U=rand(1000,1000);
@btime quant($U,$(Base.Sort.ScratchQuickSort()));
#  42.819 ms (9 allocations: 23.65 KiB)

This is faster than the regular QuickSort

@btime quant($U,$(QuickSort))
# 55.677 ms (9 allocations: 23.65 KiB)

But it is slower than the DefaultStable algorithm (which alllocates)

@btime quant($U,$(Base.Sort.DefaultStable()));
# 16.796 ms (3009 allocations: 2.14 MiB)

As far as I can tell, there is a trade-off between allocations and speed,

Yep, this is all correct. And no scratch is not part of the public interface, largely because the appropriate eltype for the scratch array is non-obvious.

3 Likes

In this particular case it is better to not sort first, and just leave it to the quantile! function (which can make do with a partial sort.) This might be different in a more realistic scenario, perhaps.

1 Like

I couldn’t find any sorting functions that do sort!(output, input). That would be a good solution in many cases, I think, and would allow you to do

sort!(output, view(U, :, j))

or similar with partialsort!