I have large arrays which I would like to sort along a single dimension of small size as in the example below. Any suggestions to increase performance would be welcome.
using BenchmarkTools
a = rand(Float32, 1000, 1000, 30);
@btime sort!($a, dims = 3); # => 517.336 ms (5934000 allocations: 197.36 MiB)
Yes, but as the dimensions are in the given order, the time for permutations needs to be added also. Besides as the sort operation is just one step in the data manipulation, permutations are somewhat inconvenient. I was thinking more in the direction of how to make us of all cores.
julia> @btime permutedims($a, [3,1,2]);
60.995 ms (7 allocations: 114.44 MiB)
julia> @btime permutedims($a, [3,2,1]);
243.528 ms (7 allocations: 114.44 MiB)
Maybe worth noting, if you do permute, that it’s possible to beat Base’s permutedims. There is some clever code for this in Strided.jl which is (in part, part of) the back-end for TensorOperations.jl, and one way to access it looks like this:
julia> a = rand(Float32, 1000, 1000, 30);
julia> @btime permutedims($a, (3,1,2));
20.933 ms (2 allocations: 114.44 MiB)
julia> using TransmuteDims
julia> @btime transmutedims($a, (3,1,2));
7.905 ms (64 allocations: 114.45 MiB)
Note also that timing sort! like this might be misleading, as all but the first runs will get a pre-sorted input.
My real data arrays have one more dimension as in the example below where I also try multi-threading.
function sort2!(d)
Threads.@threads for k in 1:size(d,4)
@inbounds for j in 1:size(d,2)
@inbounds for i in 1:size(d,1)
d[i,j,:,k] = sort(d[i,j,:,k])
end
end
end
end
a = rand(Float32, 1000, 1000, 30, 60);
@time sort!(a, dims = 3);
# 70.350123 seconds (417.71 M allocations: 12.555 GiB, 1.05% gc time, 1.21% compilation time)
a = rand(Float32, 1000, 1000, 30, 60);
@time sort2!(a);
# 24 threads: 27.988482 seconds (120.08 M allocations: 23.250 GiB, 80.31% gc time, 0.05% compilation time)
# 12 threads: 22.279251 seconds (120.42 M allocations: 23.270 GiB, 61.47% gc time, 0.14% compilation time)
# 8 threads: 21.185633 seconds (120.42 M allocations: 23.270 GiB, 50.91% gc time, 0.19% compilation time)
# 4 threads: 26.546660 seconds (120.42 M allocations: 23.270 GiB, 29.08% gc time, 1.33% compilation time)
# 2 threads: 38.829076 seconds (120.42 M allocations: 23.270 GiB, 11.74% gc time, 0.10% compilation time)
# 1 thread : 45.502133 seconds (120.00 M allocations: 23.246 GiB, 2.21% gc time)type or paste code here
Is this a safe way of using multi-threading? Further, the performance does not seem to scale well with the number of threads. In each experiment Julia is re-started with julia -t <no. of threads>. The code is run with Julia version 1.6.1 on a machine with 2 Intel Xeon E5-2650 v4 2.20GHz CPUs each with 12 physical cores.
On my computer this code (which is the same as @lmiq suggests)
function mysort3!(a::AbstractArray{<:Any, 4})
Threads.@threads for k in axes(a, 4)
for j in axes(a, 2)
for i in axes(a, 1)
sort!(view(a, i, j, :, k))
end
end
end
return a
end
gives more than 8x speedup for 8 threads.
Unfortunately, I cannot find a way to select which dimension to sort, since there is apparently no generalization of eachrow and eachcol. eachslice and selectdim does not have the right behaviour. But hardcoding the dimension works.
Thanks! Using views makes a big difference (I forgot).
function sort3!(d)
Threads.@threads for k in 1:size(d,4)
@inbounds for j in 1:size(d,2)
@inbounds for i in 1:size(d,1)
sort!(@view(d[i,j,:,k]))
end
end
end
end
a = rand(Float32, 1000, 1000, 30, 60);
@time sort!(a, dims = 3);
# 80.604359 seconds (417.29 M allocations: 12.534 GiB, 0.97% gc time, 0.62% compilation time)
a = rand(Float32, 1000, 1000, 30, 60);
@time sort3!(a);
# 24 threads: 4.182823 seconds (528.22 k allocations: 28.570 MiB, 1.61% compilation time)
# 8 threads: 6.576613 seconds (559.60 k allocations: 30.554 MiB, 1.02% compilation time)
# 2 threads: 23.737438 seconds (559.49 k allocations: 30.546 MiB, 0.40% gc time, 2.03% compilation time)
Also, it seems that using broadcasting improves slightly the performance (not sure why):
edit: ignore
julia> function sort2!(d)
Threads.@threads for k in 1:size(d,4)
for j in 1:size(d,2)
for i in 1:size(d,1)
d[i,j,:,k] .= sort!(@view(d[i,j,:,k])) # using .=
end
end
end
end
sort2! (generic function with 1 method)
julia> @btime sort2!(a) setup=setup evals=1
48.173 ms (1200041 allocations: 146.49 MiB)
julia> function sort2!(d)
Threads.@threads for k in 1:size(d,4)
for j in 1:size(d,2)
for i in 1:size(d,1)
d[i,j,:,k] = sort!(@view(d[i,j,:,k]))
end
end
end
end
sort2! (generic function with 1 method)
julia> @btime sort2!(a) setup=setup evals=1
54.690 ms (1200042 allocations: 146.49 MiB)
which is more general, robust and safe. And less verbose. If you get a zero-based indexing array, and you use @inbounds, then 1:size(d, ...) can create some nasty problems. In general, prefer axes, eachindex, pairs, etc. over 1:size and 1:length.
Yeah, right, that doesn’t make sense, it was just a stupid mistake of mine because of copying/pasting the code (the original code used sort instead of sort!)