Why is sorting matrices allocating and why is it sometimes slower than the naive approach?

Lilith · November 5, 2021, 5:04pm

I noticed that sort! is allocating here:

x = rand(10, 10);
@time sort!(x, dims=1);
#  0.000016 seconds (20 allocations: 960 bytes)

And I’m not sure why. Here is the investigation I’ve done:

using BenchmarkTools

function f(x::Matrix)
    for i in axes(x,2)
        sort!(@view x[:,i])
    end
    x
end
function g(x::Matrix)
   sort!(x, dims=1)
end

r = rand(1000, 1000);
r2 = copy(r);
f(r) == g(r2)

@btime f(x) setup=(x=rand(1000,1000)); # 39.049 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(1000,1000)); # 38.280 ms (2000 allocations: 93.75 KiB)

# and worse:
@btime f(x) setup=(x=rand(10,100_000)); #  12.799 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(10,100_000)); #  16.860 ms (200000 allocations: 9.16 MiB)

Why does the builtin method allocate and what are the runtime implications?

@edit sort!(rand(1000,1000), dims=1)) reveals:

function sort!(A::AbstractArray;
               dims::Integer,
               alg::Algorithm=defalg(A),
               lt=isless,
               by=identity,
               rev::Union{Bool,Nothing}=nothing,
               order::Ordering=Forward)
    ordr = ord(lt, by, rev, order)
    nd = ndims(A)
    k = dims

    1 <= k <= nd || throw(ArgumentError("dimension out of range"))

    remdims = ntuple(i -> i == k ? 1 : size(A, i), nd)
    for idx in CartesianIndices(remdims)
        Av = view(A, ntuple(i -> i == k ? Colon() : idx[i], nd)...)
        sort!(Av, alg, ordr)
    end
    A
end

fredrikekre · November 5, 2021, 5:09pm

I don’t know how sort! is implemented, but just because something allocates doesn’t mean it has to be slow.

Lilith · November 5, 2021, 6:42pm

True, but unexpected allocations should be taken seriously when benchmarking, and I can’t see how it makes sense to allocate in this context.

I also refined my benchmarks to highlight the problem:

@btime f(x) setup=(x=rand(10,100_000)); #  12.799 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(10,100_000)); #  16.860 ms (200000 allocations: 9.16 MiB)

(also added to OP)

goerch · November 5, 2021, 6:45pm

Juno profiler shows allocations due to dynamic dispatch

sort!(Av, alg, ordr)

Surprising if it is intended.

But if it doesn’t it might be faster;)

Lilith · November 5, 2021, 6:50pm

That’s useful profiling detail! Do you happen to know of a way to get those results in Atom or with a bare Julia REPL?

goerch · November 5, 2021, 6:54pm

Here is what I used in Juno/Atom:

using Profile
Profile.clear()
g(rand(1000, 1000))
@profile for i in 1:1000; g(rand(1000, 1000)); end
Juno.profiler()

There is also

Profile.print()

but I still have to learn how to detect dynamic dispatch in the console output (it seems possible that the profile GUI uses something like @code_warntype internally?)

Lilith · November 5, 2021, 10:20pm

Thanks so much @georch! The Juno.profiler() GUI is super cool. This might make my benchmarking life much more fun. I forgot for a moment that I am using Juno within Atom. Is the color coding of the flamegraph the same as the color coding of @warntypes? I have yet to spot how we can attribute allocations to dynamic dispatch using the Juno flame graph. Could you explain this to me? (or refer me to a manual if you’d rather do that instead).

Staying centered on the original topic, it seems like there is a performance problem with Base’s multidimensional array sorting, and it seems plausible that it is easy to fix. (I suspect this issue is still present in master, but rn I don’t run the nightlies so idk)

goerch · November 6, 2021, 12:03am

Juno profiler docs.

Lilith · November 6, 2021, 12:56am

Thanks for the link. It seems to me that the original call to sort was dynamically dispatched (because I called it with a global variable as an argument) but I don’t think that means we have a causal relationship of

Still think we can do better, though!

carstenbauer · November 6, 2021, 7:13am

Or slower

Lilith · November 6, 2021, 2:00pm

This is all well and good, I think we all know that allocations ≠ time.

However, when it comes to unexpected memory allocations, we have “Unexpected memory allocation is almost always a sign of some problem with your code, usually a problem with type-stability or creating many small temporary arrays. Consequently, in addition to the allocation itself, it’s very likely that the code generated for your function is far from optimal. Take such indications seriously and follow the advice below.”

In this case we have benchmarking results that confirm that a version that does not allocate is, unsurprisingly, faster.

Lilith · November 6, 2021, 2:53pm

I found the type instability. dims is a dynamic parameter, which the index used to construct the view (i.e. it could be a Tuple{Colon, Int} or a Tuple{Int, Color}. @codewarn_type shows this: Base.Sort.ntuple(%35, %36)::Tuple{Union{Colon, Int64}, Union{Colon, Int64}}.

Fundamentally, I think we can only fix this instability with compile-time information about dims. However! I think we should be able to pull the dynamic dispatch outside of the loop, which will make its impact negligible.

Topic		Replies	Views
Allocation / performance for 2d array sorts Performance	5	604	November 10, 2020
Understanding allocation pattern when using sort!() General Usage	12	380	June 15, 2022
Allocation when sorting a view Performance question	16	563	October 13, 2022
Why does sortperm! allocate here? Performance sortperm , memory-allocation	4	281	March 24, 2024
Fast permutation vector: code needed Performance sorting	38	641	March 21, 2023

Why is sorting matrices allocating and why is it sometimes slower than the naive approach?

Related topics