Speeding up sort along an array dimension

I have large arrays which I would like to sort along a single dimension of small size as in the example below. Any suggestions to increase performance would be welcome.

using BenchmarkTools
a = rand(Float32, 1000, 1000, 30);
@btime sort!($a, dims = 3);   #  => 517.336 ms (5934000 allocations: 197.36 MiB)
julia> a = rand(Float32, 30, 1000, 1000);

julia> @btime sort!($a, dims = 1);
  352.697 ms (4956000 allocations: 182.43 MiB)

changing the permutation seems to help a bit

Yes, but as the dimensions are in the given order, the time for permutations needs to be added also. Besides as the sort operation is just one step in the data manipulation, permutations are somewhat inconvenient. I was thinking more in the direction of how to make us of all cores.

julia> @btime permutedims($a, [3,1,2]);
  60.995 ms (7 allocations: 114.44 MiB)
julia> @btime permutedims($a, [3,2,1]);
  243.528 ms (7 allocations: 114.44 MiB)

Maybe worth noting, if you do permute, that it’s possible to beat Base’s permutedims. There is some clever code for this in Strided.jl which is (in part, part of) the back-end for TensorOperations.jl, and one way to access it looks like this:

julia> a = rand(Float32, 1000, 1000, 30);

julia> @btime permutedims($a, (3,1,2));
  20.933 ms (2 allocations: 114.44 MiB)

julia> using TransmuteDims

julia> @btime transmutedims($a, (3,1,2));
  7.905 ms (64 allocations: 114.45 MiB)

Note also that timing sort! like this might be misleading, as all but the first runs will get a pre-sorted input.

julia> @btime sort!($a, dims = 3);
  260.115 ms (5934000 allocations: 197.36 MiB)

julia> @btime sort!(a, dims = 3)  setup=(a = rand(Float32, 1000, 1000, 30)) evals=1;
  629.847 ms (5934000 allocations: 197.36 MiB)

julia> @btime sort($a, dims = 3);
  563.034 ms (15 allocations: 228.88 MiB)

julia> @btime sort(a, dims = 3)  setup=(a = rand(Float32, 1000, 1000, 30)) evals=1;
  563.652 ms (15 allocations: 228.88 MiB)
4 Likes

Thank you for your suggestions.

My real data arrays have one more dimension as in the example below where I also try multi-threading.

function sort2!(d)
    Threads.@threads for k in 1:size(d,4)
        @inbounds for j in 1:size(d,2)
            @inbounds for i in 1:size(d,1)
                d[i,j,:,k] = sort(d[i,j,:,k])
            end
        end
    end
end

a = rand(Float32, 1000, 1000, 30, 60);
@time sort!(a, dims = 3);
#  70.350123 seconds (417.71 M allocations: 12.555 GiB, 1.05% gc time, 1.21% compilation time)

a = rand(Float32, 1000, 1000, 30, 60);
@time sort2!(a);
# 24 threads: 27.988482 seconds (120.08 M allocations: 23.250 GiB, 80.31% gc time, 0.05% compilation time)
# 12 threads: 22.279251 seconds (120.42 M allocations: 23.270 GiB, 61.47% gc time, 0.14% compilation time)
#  8 threads: 21.185633 seconds (120.42 M allocations: 23.270 GiB, 50.91% gc time, 0.19% compilation time)
#  4 threads: 26.546660 seconds (120.42 M allocations: 23.270 GiB, 29.08% gc time, 1.33% compilation time)
#  2 threads: 38.829076 seconds (120.42 M allocations: 23.270 GiB, 11.74% gc time, 0.10% compilation time)
#  1 thread : 45.502133 seconds (120.00 M allocations: 23.246 GiB, 2.21% gc time)type or paste code here

Is this a safe way of using multi-threading? Further, the performance does not seem to scale well with the number of threads. In each experiment Julia is re-started with julia -t <no. of threads>. The code is run with Julia version 1.6.1 on a machine with 2 Intel Xeon E5-2650 v4 2.20GHz CPUs each with 12 physical cores.

Use sort!(@view(d[i,j,:,k])) here.

And last run things twice, you are measuring compilation time there.

1 Like

On my computer this code (which is the same as @lmiq suggests)

function mysort3!(a::AbstractArray{<:Any, 4})
    Threads.@threads for k in axes(a, 4)
        for j in axes(a, 2)
            for i in axes(a, 1)
                sort!(view(a, i, j, :, k))
            end
        end
    end
    return a
end

gives more than 8x speedup for 8 threads.

Unfortunately, I cannot find a way to select which dimension to sort, since there is apparently no generalization of eachrow and eachcol. eachslice and selectdim does not have the right behaviour. But hardcoding the dimension works.

1 Like

Thanks! Using views makes a big difference (I forgot).

function sort3!(d)
    Threads.@threads for k in 1:size(d,4)
        @inbounds for j in 1:size(d,2)
            @inbounds for i in 1:size(d,1)
                sort!(@view(d[i,j,:,k]))
            end
        end
    end
end

a = rand(Float32, 1000, 1000, 30, 60);
@time sort!(a, dims = 3);
#   80.604359 seconds (417.29 M allocations: 12.534 GiB, 0.97% gc time, 0.62% compilation time)

a = rand(Float32, 1000, 1000, 30, 60);
@time sort3!(a);
# 24 threads:  4.182823 seconds (528.22 k allocations: 28.570 MiB, 1.61% compilation time)
#  8 threads:  6.576613 seconds (559.60 k allocations: 30.554 MiB, 1.02% compilation time)
#  2 threads: 23.737438 seconds (559.49 k allocations: 30.546 MiB, 0.40% gc time, 2.03% compilation time)

I don’t think those @inbounds are helping there.

Also, it seems that using broadcasting improves slightly the performance (not sure why):

edit: ignore
julia> function sort2!(d)
           Threads.@threads for k in 1:size(d,4)
               for j in 1:size(d,2)
                   for i in 1:size(d,1)
                       d[i,j,:,k] .= sort!(@view(d[i,j,:,k]))  #  using .=
                   end
               end
           end
       end
sort2! (generic function with 1 method)

julia> @btime sort2!(a) setup=setup evals=1
  48.173 ms (1200041 allocations: 146.49 MiB)

julia> function sort2!(d)
           Threads.@threads for k in 1:size(d,4)
               for j in 1:size(d,2)
                   for i in 1:size(d,1)
                       d[i,j,:,k] = sort!(@view(d[i,j,:,k]))
                   end
               end
           end
       end
sort2! (generic function with 1 method)

julia> @btime sort2!(a) setup=setup evals=1
  54.690 ms (1200042 allocations: 146.49 MiB)


Yes, but be careful to to just test on pre-sorted data:

julia> a = rand(Float32, 100, 100, 30, 60);

julia> @time sort!(a, dims = 3);
  0.787410 seconds (3.88 M allocations: 180.190 MiB, 7.49% gc time, 30.54% compilation time)

julia> @time sort!(a, dims = 3);
  0.308969 seconds (2.40 M allocations: 100.708 MiB, 12.33% gc time)

julia> a .= rand.();

julia> @time sort!(a, dims = 3);
  0.504254 seconds (2.40 M allocations: 100.708 MiB, 1.59% gc time)
1 Like

:face_with_raised_eyebrow: That seems weird. Isn’t this equivalent to

sort!(@view(d[i,j,:,k]))
d[i,j,:,k] .= d[i,j,:,k]

? In other words, just extra work?

I suggest using

for k in axes(d, 4)

which is more general, robust and safe. And less verbose. If you get a zero-based indexing array, and you use @inbounds, then 1:size(d, ...) can create some nasty problems. In general, prefer axes, eachindex, pairs, etc. over 1:size and 1:length.

2 Likes

Yeah, right, that doesn’t make sense, it was just a stupid mistake of mine because of copying/pasting the code (the original code used sort instead of sort!)