Allocation / performance for 2d array sorts

I’m just getting started with Julia so I’m trying to optimize every piece of code, not just for practicality, but to learn more about the language. I ran into an interesting case, where the number of allocation seems to have an inverse relationship with speed, and I cant really figure out how to optimize this code:

using Profile, BenchmarkTools

function double_sort1(base_arr::Array{Int64, 2})::Array{Int64, 2}
    n = size(base_arr)[1]
    arr::Array{Int64, 2} = Array{Int64}(undef, n, 4)
    p::Array{Int64, 1} =  Array{Int64}(undef, n)
    for (side, r) in enumerate([1:2, 3:4])
        sortperm!(p, base_arr[:, side])
        for (i_out, i_in) in enumerate(p)
            for (j_in, j_out) in enumerate(r)
                arr[i_out, j_out] = base_arr[i_in, j_in]
            end
        end
    end
    arr
end

function double_sort2(base_arr::Array{Int64, 2})::Array{Int64, 2}
    n = size(base_arr)[1]
    arr::Array{Int64, 2} = Array{Int64}(undef, n, 4)
    for (side, r) in enumerate([1:2, 3:4])
        for (i_out, i_in) in enumerate(sortperm(base_arr[:, side]))
            for (j_in, j_out) in enumerate(r)
                arr[i_out, j_out] = base_arr[i_in, j_in]
            end
        end
    end
    arr
end

function double_sort3(base_arr::Array{Int64, 2})::Array{Int64, 2}
    n = size(base_arr)[1]
    arr::Array{Int64, 2} = Array{Int64}(undef, n, 4)
    for (side, r) in enumerate([1:2, 3:4])
        arr[:, r] = base_arr[sortperm(base_arr[:, side]), :]
    end
    arr
end

function double_sort4(base_arr::Array{Int64, 2})::Array{Int64, 2}
    cat([base_arr[sortperm(base_arr[:, side]), :] for side in 1:2]...,dims=2)
end

a = rand(1:200000, (3_000_000, 2));
double_sort1(a) == double_sort2(a), double_sort2(a) == double_sort3(a), double_sort3(a) == double_sort4(a)
@btime double_sort1($a);
@btime double_sort2($a);
@btime double_sort3($a);
@btime double_sort4($a);

1.725 s (11 allocations: 160.22 MiB)
427.311 ms (15 allocations: 186.16 MiB)
417.247 ms (19 allocations: 277.71 MiB)
403.048 ms (55 allocations: 277.71 MiB)

1 Like

Here are some attempts…

double_sort5(a) = hcat(sortslices(a, dims=1, by=first), sortslices(a, dims=1, by=last));

function double_sort6(base)
    p1, p2 = sortperm(base[:,1]), sortperm(base[:,2])
    @inbounds hcat(base[p1,:], base[p2,:])
end

function double_sort7(base::AbstractMatrix)
    size(base,2) == 2 || error("expected two columns")
    arr = similar(base, size(base,1), 4)
    @inbounds for side in 1:2
        perm = sortperm(base[:, side])
        for row in axes(arr,1)
            pr = perm[row]
            arr[row, 2side-1] = base[pr, 1]
            arr[row, 2side]= base[pr, 2]
        end
    end
    arr
end

The last is very much like your second one, plus @inbounds, but perhaps more readable. Note that there is no need for type annotations, except for dispatch, or as the occasional reminder to yourself.

As you noticed, sortperm! seems to be very slow, I’m not sure why.

julia> @btime double_sort5(r)    setup=(r=rand(1:20_000, 300_000, 2));
  148.299 ms (38 allocations: 54.93 MiB)

julia> @btime double_sort6(r)    setup=(r=rand(1:20_000, 300_000, 2));
  9.652 ms (18 allocations: 27.77 MiB)

julia> @btime double_sort7(r)    setup=(r=rand(1:20_000, 300_000, 2));
  7.522 ms (14 allocations: 18.62 MiB)

julia> @btime sortperm(v)      setup=(v=rand(1:20_000, 300_000));
  1.711 ms (4 allocations: 2.44 MiB)

julia> @btime sortperm!(p, v)  setup=(v=rand(1:20_000, 300_000); p=similar(v)); # also surprising?
  23.541 ms (1 allocation: 16 bytes)
2 Likes

wow, thanks. this was very quick and very useful.

I did not deeply examine the code samples, but it may be related to the problems I have gone through to re-implement an heuristic which was mostly made of applying sortperm!. I had to gut sort! and implement a hacky _swap_permute! to get the best performance. This has come up recently in this Discourse thread, in which I link to my repository and the aforementioned code.

Also, I noticed that basically only 1 CPU is working during @btime. why is that?

If I understood this problem, the following implementation might not be the most efficient but might be easier to read:

function double_sort8(A::AbstractMatrix)
    p1 = sortperm(A[:,1])
    p2 = sortperm(A[:,2])
    [A[p1,1] A[p1,2] A[p2,1] A[p2,2]]
end

# or in one line:
[A[sortperm(A[:,1]),:] A[sortperm(A[:,2]),:]]

Returning the result as tuples instead of array would further improve allocation & performance.

PS: the original OP used n = 3_000_000