Speeding up sort along an array dimension

johnbb · August 16, 2021, 8:59am

I have large arrays which I would like to sort along a single dimension of small size as in the example below. Any suggestions to increase performance would be welcome.

using BenchmarkTools
a = rand(Float32, 1000, 1000, 30);
@btime sort!($a, dims = 3);   #  => 517.336 ms (5934000 allocations: 197.36 MiB)

jling · August 16, 2021, 9:31am

julia> a = rand(Float32, 30, 1000, 1000);

julia> @btime sort!($a, dims = 1);
  352.697 ms (4956000 allocations: 182.43 MiB)

changing the permutation seems to help a bit

johnbb · August 16, 2021, 11:02am

Yes, but as the dimensions are in the given order, the time for permutations needs to be added also. Besides as the sort operation is just one step in the data manipulation, permutations are somewhat inconvenient. I was thinking more in the direction of how to make us of all cores.

julia> @btime permutedims($a, [3,1,2]);
  60.995 ms (7 allocations: 114.44 MiB)
julia> @btime permutedims($a, [3,2,1]);
  243.528 ms (7 allocations: 114.44 MiB)

mcabbott · August 16, 2021, 1:17pm

Maybe worth noting, if you do permute, that it’s possible to beat Base’s permutedims. There is some clever code for this in Strided.jl which is (in part, part of) the back-end for TensorOperations.jl, and one way to access it looks like this:

julia> a = rand(Float32, 1000, 1000, 30);

julia> @btime permutedims($a, (3,1,2));
  20.933 ms (2 allocations: 114.44 MiB)

julia> using TransmuteDims

julia> @btime transmutedims($a, (3,1,2));
  7.905 ms (64 allocations: 114.45 MiB)

Note also that timing sort! like this might be misleading, as all but the first runs will get a pre-sorted input.

julia> @btime sort!($a, dims = 3);
  260.115 ms (5934000 allocations: 197.36 MiB)

julia> @btime sort!(a, dims = 3)  setup=(a = rand(Float32, 1000, 1000, 30)) evals=1;
  629.847 ms (5934000 allocations: 197.36 MiB)

julia> @btime sort($a, dims = 3);
  563.034 ms (15 allocations: 228.88 MiB)

julia> @btime sort(a, dims = 3)  setup=(a = rand(Float32, 1000, 1000, 30)) evals=1;
  563.652 ms (15 allocations: 228.88 MiB)

johnbb · August 18, 2021, 8:59am

Thank you for your suggestions.

My real data arrays have one more dimension as in the example below where I also try multi-threading.

function sort2!(d)
    Threads.@threads for k in 1:size(d,4)
        @inbounds for j in 1:size(d,2)
            @inbounds for i in 1:size(d,1)
                d[i,j,:,k] = sort(d[i,j,:,k])
            end
        end
    end
end

a = rand(Float32, 1000, 1000, 30, 60);
@time sort!(a, dims = 3);
#  70.350123 seconds (417.71 M allocations: 12.555 GiB, 1.05% gc time, 1.21% compilation time)

a = rand(Float32, 1000, 1000, 30, 60);
@time sort2!(a);
# 24 threads: 27.988482 seconds (120.08 M allocations: 23.250 GiB, 80.31% gc time, 0.05% compilation time)
# 12 threads: 22.279251 seconds (120.42 M allocations: 23.270 GiB, 61.47% gc time, 0.14% compilation time)
#  8 threads: 21.185633 seconds (120.42 M allocations: 23.270 GiB, 50.91% gc time, 0.19% compilation time)
#  4 threads: 26.546660 seconds (120.42 M allocations: 23.270 GiB, 29.08% gc time, 1.33% compilation time)
#  2 threads: 38.829076 seconds (120.42 M allocations: 23.270 GiB, 11.74% gc time, 0.10% compilation time)
#  1 thread : 45.502133 seconds (120.00 M allocations: 23.246 GiB, 2.21% gc time)type or paste code here

Is this a safe way of using multi-threading? Further, the performance does not seem to scale well with the number of threads. In each experiment Julia is re-started with julia -t <no. of threads>. The code is run with Julia version 1.6.1 on a machine with 2 Intel Xeon E5-2650 v4 2.20GHz CPUs each with 12 physical cores.

lmiq · August 18, 2021, 10:48am

Use sort!(@view(d[i,j,:,k])) here.

And last run things twice, you are measuring compilation time there.

DNF · August 18, 2021, 11:39am

On my computer this code (which is the same as @lmiq suggests)

function mysort3!(a::AbstractArray{<:Any, 4})
    Threads.@threads for k in axes(a, 4)
        for j in axes(a, 2)
            for i in axes(a, 1)
                sort!(view(a, i, j, :, k))
            end
        end
    end
    return a
end

gives more than 8x speedup for 8 threads.

Unfortunately, I cannot find a way to select which dimension to sort, since there is apparently no generalization of eachrow and eachcol. eachslice and selectdim does not have the right behaviour. But hardcoding the dimension works.

johnbb · August 18, 2021, 12:00pm

Thanks! Using views makes a big difference (I forgot).

function sort3!(d)
    Threads.@threads for k in 1:size(d,4)
        @inbounds for j in 1:size(d,2)
            @inbounds for i in 1:size(d,1)
                sort!(@view(d[i,j,:,k]))
            end
        end
    end
end

a = rand(Float32, 1000, 1000, 30, 60);
@time sort!(a, dims = 3);
#   80.604359 seconds (417.29 M allocations: 12.534 GiB, 0.97% gc time, 0.62% compilation time)

a = rand(Float32, 1000, 1000, 30, 60);
@time sort3!(a);
# 24 threads:  4.182823 seconds (528.22 k allocations: 28.570 MiB, 1.61% compilation time)
#  8 threads:  6.576613 seconds (559.60 k allocations: 30.554 MiB, 1.02% compilation time)
#  2 threads: 23.737438 seconds (559.49 k allocations: 30.546 MiB, 0.40% gc time, 2.03% compilation time)

lmiq · August 18, 2021, 12:17pm

I don’t think those @inbounds are helping there.

Also, it seems that using broadcasting improves slightly the performance (not sure why):

edit: ignore

julia> function sort2!(d)
           Threads.@threads for k in 1:size(d,4)
               for j in 1:size(d,2)
                   for i in 1:size(d,1)
                       d[i,j,:,k] .= sort!(@view(d[i,j,:,k]))  #  using .=
                   end
               end
           end
       end
sort2! (generic function with 1 method)

julia> @btime sort2!(a) setup=setup evals=1
  48.173 ms (1200041 allocations: 146.49 MiB)

julia> function sort2!(d)
           Threads.@threads for k in 1:size(d,4)
               for j in 1:size(d,2)
                   for i in 1:size(d,1)
                       d[i,j,:,k] = sort!(@view(d[i,j,:,k]))
                   end
               end
           end
       end
sort2! (generic function with 1 method)

julia> @btime sort2!(a) setup=setup evals=1
  54.690 ms (1200042 allocations: 146.49 MiB)

mcabbott · August 18, 2021, 12:18pm

Yes, but be careful to to just test on pre-sorted data:

julia> a = rand(Float32, 100, 100, 30, 60);

julia> @time sort!(a, dims = 3);
  0.787410 seconds (3.88 M allocations: 180.190 MiB, 7.49% gc time, 30.54% compilation time)

julia> @time sort!(a, dims = 3);
  0.308969 seconds (2.40 M allocations: 100.708 MiB, 12.33% gc time)

julia> a .= rand.();

julia> @time sort!(a, dims = 3);
  0.504254 seconds (2.40 M allocations: 100.708 MiB, 1.59% gc time)

DNF · August 18, 2021, 1:20pm

That seems weird. Isn’t this equivalent to

sort!(@view(d[i,j,:,k]))
d[i,j,:,k] .= d[i,j,:,k]

? In other words, just extra work?

I suggest using

for k in axes(d, 4)

which is more general, robust and safe. And less verbose. If you get a zero-based indexing array, and you use @inbounds, then 1:size(d, ...) can create some nasty problems. In general, prefer axes, eachindex, pairs, etc. over 1:size and 1:length.

lmiq · August 18, 2021, 1:33pm

Yeah, right, that doesn’t make sense, it was just a stupid mistake of mine because of copying/pasting the code (the original code used sort instead of sort!)

Topic		Replies	Views
sort(A, dims=1) mixing rows Performance sort	11	1233	October 9, 2018
Sorting multidimensional arrays Performance question , sort , arrays	3	577	November 17, 2021
Allocation / performance for 2d array sorts Performance	5	604	November 10, 2020
10x faster sortperm() Performance sortperm	10	2706	October 29, 2024
Sort on tuple is > 4x as fast as sortrows on the equivalent array General Usage performance , sort	2	662	September 13, 2017

Speeding up sort along an array dimension

Related topics