# Why is sorting matrices allocating and why is it sometimes slower than the naive approach?

I noticed that `sort!` is allocating here:

``````x = rand(10, 10);
@time sort!(x, dims=1);
#  0.000016 seconds (20 allocations: 960 bytes)
``````

And I’m not sure why. Here is the investigation I’ve done:

``````using BenchmarkTools

function f(x::Matrix)
for i in axes(x,2)
sort!(@view x[:,i])
end
x
end
function g(x::Matrix)
sort!(x, dims=1)
end

r = rand(1000, 1000);
r2 = copy(r);
f(r) == g(r2)

@btime f(x) setup=(x=rand(1000,1000)); # 39.049 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(1000,1000)); # 38.280 ms (2000 allocations: 93.75 KiB)

# and worse:
@btime f(x) setup=(x=rand(10,100_000)); #  12.799 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(10,100_000)); #  16.860 ms (200000 allocations: 9.16 MiB)
``````

Why does the builtin method allocate and what are the runtime implications?

`@edit sort!(rand(1000,1000), dims=1))` reveals:

``````function sort!(A::AbstractArray;
dims::Integer,
alg::Algorithm=defalg(A),
lt=isless,
by=identity,
rev::Union{Bool,Nothing}=nothing,
order::Ordering=Forward)
ordr = ord(lt, by, rev, order)
nd = ndims(A)
k = dims

1 <= k <= nd || throw(ArgumentError("dimension out of range"))

remdims = ntuple(i -> i == k ? 1 : size(A, i), nd)
for idx in CartesianIndices(remdims)
Av = view(A, ntuple(i -> i == k ? Colon() : idx[i], nd)...)
sort!(Av, alg, ordr)
end
A
end
``````

I don’t know how `sort!` is implemented, but just because something allocates doesn’t mean it has to be slow.

2 Likes

True, but unexpected allocations should be taken seriously when benchmarking, and I can’t see how it makes sense to allocate in this context.

I also refined my benchmarks to highlight the problem:

``````@btime f(x) setup=(x=rand(10,100_000)); #  12.799 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(10,100_000)); #  16.860 ms (200000 allocations: 9.16 MiB)
``````

(also added to OP)

Juno profiler shows allocations due to dynamic dispatch

``````sort!(Av, alg, ordr)
``````

Surprising if it is intended.

But if it doesn’t it might be faster;)

1 Like

That’s useful profiling detail! Do you happen to know of a way to get those results in Atom or with a bare Julia REPL?

Here is what I used in Juno/Atom:

``````using Profile
Profile.clear()
g(rand(1000, 1000))
@profile for i in 1:1000; g(rand(1000, 1000)); end
Juno.profiler()
``````

There is also

``````Profile.print()
``````

but I still have to learn how to detect dynamic dispatch in the console output (it seems possible that the profile GUI uses something like `@code_warntype` internally?)

Thanks so much @georch! The Juno.profiler() GUI is super cool. This might make my benchmarking life much more fun. I forgot for a moment that I am using Juno within Atom. Is the color coding of the flamegraph the same as the color coding of @warntypes? I have yet to spot how we can attribute allocations to dynamic dispatch using the Juno flame graph. Could you explain this to me? (or refer me to a manual if you’d rather do that instead).

Staying centered on the original topic, it seems like there is a performance problem with Base’s multidimensional array sorting, and it seems plausible that it is easy to fix. (I suspect this issue is still present in master, but rn I don’t run the nightlies so idk)

Thanks for the link. It seems to me that the original call to sort was dynamically dispatched (because I called it with a global variable as an argument) but I don’t think that means we have a causal relationship of

Still think we can do better, though!

Or slower

This is all well and good, I think we all know that allocations ≠ time.

In this case we have benchmarking results that confirm that a version that does not allocate is, unsurprisingly, faster.

I found the type instability. `dims` is a dynamic parameter, which the index used to construct the view (i.e. it could be a `Tuple{Colon, Int}` or a `Tuple{Int, Color}`. `@codewarn_type` shows this: `Base.Sort.ntuple(%35, %36)::Tuple{Union{Colon, Int64}, Union{Colon, Int64}}`.

Fundamentally, I think we can only fix this instability with compile-time information about `dims`. However! I think we should be able to pull the dynamic dispatch outside of the loop, which will make its impact negligible.

1 Like