Why is sorting matrices allocating and why is it sometimes slower than the naive approach?

I noticed that sort! is allocating here:

x = rand(10, 10);
@time sort!(x, dims=1);
#  0.000016 seconds (20 allocations: 960 bytes)

And I’m not sure why. Here is the investigation I’ve done:

using BenchmarkTools

function f(x::Matrix)
    for i in axes(x,2)
        sort!(@view x[:,i])
    end
    x
end
function g(x::Matrix)
   sort!(x, dims=1)
end

r = rand(1000, 1000);
r2 = copy(r);
f(r) == g(r2)

@btime f(x) setup=(x=rand(1000,1000)); # 39.049 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(1000,1000)); # 38.280 ms (2000 allocations: 93.75 KiB)

# and worse:
@btime f(x) setup=(x=rand(10,100_000)); #  12.799 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(10,100_000)); #  16.860 ms (200000 allocations: 9.16 MiB)

Why does the builtin method allocate and what are the runtime implications?

@edit sort!(rand(1000,1000), dims=1)) reveals:

function sort!(A::AbstractArray;
               dims::Integer,
               alg::Algorithm=defalg(A),
               lt=isless,
               by=identity,
               rev::Union{Bool,Nothing}=nothing,
               order::Ordering=Forward)
    ordr = ord(lt, by, rev, order)
    nd = ndims(A)
    k = dims

    1 <= k <= nd || throw(ArgumentError("dimension out of range"))

    remdims = ntuple(i -> i == k ? 1 : size(A, i), nd)
    for idx in CartesianIndices(remdims)
        Av = view(A, ntuple(i -> i == k ? Colon() : idx[i], nd)...)
        sort!(Av, alg, ordr)
    end
    A
end

I don’t know how sort! is implemented, but just because something allocates doesn’t mean it has to be slow.

2 Likes

True, but unexpected allocations should be taken seriously when benchmarking, and I can’t see how it makes sense to allocate in this context.

I also refined my benchmarks to highlight the problem:

@btime f(x) setup=(x=rand(10,100_000)); #  12.799 ms (0 allocations: 0 bytes)
@btime g(x) setup=(x=rand(10,100_000)); #  16.860 ms (200000 allocations: 9.16 MiB)

(also added to OP)

Juno profiler shows allocations due to dynamic dispatch

sort!(Av, alg, ordr)

Surprising if it is intended.

But if it doesn’t it might be faster;)

1 Like

That’s useful profiling detail! Do you happen to know of a way to get those results in Atom or with a bare Julia REPL?

Here is what I used in Juno/Atom:

using Profile
Profile.clear()
g(rand(1000, 1000))
@profile for i in 1:1000; g(rand(1000, 1000)); end
Juno.profiler()

There is also

Profile.print()

but I still have to learn how to detect dynamic dispatch in the console output (it seems possible that the profile GUI uses something like @code_warntype internally?)

Thanks so much @georch! The Juno.profiler() GUI is super cool. This might make my benchmarking life much more fun. I forgot for a moment that I am using Juno within Atom. Is the color coding of the flamegraph the same as the color coding of @warntypes? I have yet to spot how we can attribute allocations to dynamic dispatch using the Juno flame graph. Could you explain this to me? (or refer me to a manual if you’d rather do that instead).

Staying centered on the original topic, it seems like there is a performance problem with Base’s multidimensional array sorting, and it seems plausible that it is easy to fix. (I suspect this issue is still present in master, but rn I don’t run the nightlies so idk)

Juno profiler docs.

Thanks for the link. It seems to me that the original call to sort was dynamically dispatched (because I called it with a global variable as an argument) but I don’t think that means we have a causal relationship of

Still think we can do better, though!

Or slower :grinning:

This is all well and good, I think we all know that allocations ≠ time.

However, when it comes to unexpected memory allocations, we have “Unexpected memory allocation is almost always a sign of some problem with your code, usually a problem with type-stability or creating many small temporary arrays. Consequently, in addition to the allocation itself, it’s very likely that the code generated for your function is far from optimal. Take such indications seriously and follow the advice below.”

In this case we have benchmarking results that confirm that a version that does not allocate is, unsurprisingly, faster.

I found the type instability. dims is a dynamic parameter, which the index used to construct the view (i.e. it could be a Tuple{Colon, Int} or a Tuple{Int, Color}. @codewarn_type shows this: Base.Sort.ntuple(%35, %36)::Tuple{Union{Colon, Int64}, Union{Colon, Int64}}.

Fundamentally, I think we can only fix this instability with compile-time information about dims. However! I think we should be able to pull the dynamic dispatch outside of the loop, which will make its impact negligible.

1 Like