 # Fast non dominated sorting

#1

I need a fast implementation of the so-called non dominated sorting algorithm. The best I could come up with so far is the following (the sorting is to be done on the basis of three criteria):

``````dominates(x, y) = all(x .<= y) && any(x .< y)

function nonDominatedSorting(arr::Array{Float64,2})
fronts::Array{Array,1} = Array[]
ind::Array{Int64,1} = collect(1:size(arr, 1))
while !isempty(arr)
s = size(arr, 1)
red = dropdims(sum([ dominates(arr[i,:], arr[j,:]) for i in 1:s, j in 1:s ], dims=1) .== 0, dims=1)
a = 1:s
sel::Array{Int64,1} = a[red]
push!(fronts, ind[sel])
da::Array{Int64,1} = deleteat!(collect(1:s), sel)
ind = deleteat!(ind, sel)
arr = arr[da, :]
end
return fronts
end
``````

The timing is not too bad:

``````using BenchmarkTools

scores = rand(Float64, (50, 3));

@btime nonDominatedSorting(\$scores)
# 10.039 ms (31998 allocations: 21.81 MiB)
``````

But I need this for simulations, so at the moment I use the faster

``````using RCall

function nds(arr::Array{Float64,2})
@rput arr
R"""
library(nsga2R)
ranking <- fastNonDominatedSorting(arr)
"""
return @rget ranking
end
``````

which gives

``````@btime nds(\$scores)
# 2.354 ms (227 allocations: 9.97 KiB)
``````

That’s fine for what I want, but I’d still be happier to have a pure Julia implementation. So I’d be grateful for any suggestions for making the first version faster.

#2

You might have more answers in https://discourse.julialang.org/c/user/perf (reference this post, if you post there).
Here is a first improvement:

``````julia> x = rand(Int, 100000);

julia> y = rand(Int, 100000);

julia> dominates(x, y) = all(x .<= y) && any(x .< y)

julia> dominates2(x, y) = all(i -> x[i] <= y[i], eachindex(x)) && any(i -> x[i] < y[i], eachindex(x))
dominates2 (generic function with 1 method)

julia> _time(f, x, y) = @time f(x, y) # When `@time` is called from global scope, it has a fixed cost in term of speed and allocation so we make this dummy _time function
_time (generic function with 1 method)

julia> _time(dominates, x, y)
0.000118 seconds (3 allocations: 16.594 KiB)
false

julia> _time(dominates, x, y)
0.000186 seconds (3 allocations: 16.594 KiB)
false

julia> _time(dominates2, x, y)
0.000001 seconds
false

julia> _time(dominates2, x, y)
0.000001 seconds
false
``````

The reason for this improvement is that `dominates2` does create the intermediate array `x .<= y` so it does not allocate as much and it does not evaluate `x <= y` all the way and stops when it is false for one `i`.

4 Likes
#3

You could try storing all data transposed so you can access `arr[:,i]` instead, as this is how it is stored in memory.

Also, this line
`sum([ dominates(arr[i,:], arr[j,:]) for i in 1:s, j in 1:s ], dims=1)`
creates a matrix just to sum over it. Writing out this loop would probably save you both a lot of allocations and time.

3 Likes
#4

Thank you, both @blegat and @baggepinnen – these suggestions helped a lot. I now have

``````dominates(x, y) = all(i -> x[i] <= y[i], eachindex(x)) && any(i -> x[i] < y[i], eachindex(x))

function nonDominatedSorting(arr0::Array{Float64,2})
arr = arr0'
fronts::Array{Array,1} = Array[]
ind::Array{Int64,1} = collect(1:size(arr, 2))
while !isempty(arr)
s = size(arr, 2)
red = Array{Bool,2}(undef, s, s)
for i in 1:s, j in 1:s
red[i, j] = dominates(arr[:, i], arr[:, j])
end
sel::Array{Int64,1} = collect(1:s)[ dropdims(sum(red, dims=1) .== 0, dims=1) ]
push!(fronts, ind[sel])
da::Array{Int64,1} = deleteat!(collect(1:s), sel)
ind = deleteat!(ind, sel)
arr = arr[:,da]
end
return fronts
end
``````

which is a lot faster than the R function I used:

``````scores = rand(Float64, (50, 3))

@btime nonDominatedSorting(\$scores)
# 394.493 μs (8801 allocations: 987.38 KiB)
``````
#5

This should give you another order of magnitude or so:

``````function dominates2(x, y)
strict_inequality_found = false
for i in eachindex(x)
y[i] < x[i] && return false
strict_inequality_found |= x[i] < y[i]
end
return strict_inequality_found
end

function nds2(arr)
fronts = Vector{Int64}[]
ind = collect(1:size(arr, 1))
a = [arr[i, :] for i = 1:size(arr, 1)]
while !isempty(a)
s = length(a)
red = [all(x -> !dominates2(x, y), a) for y in a]
push!(fronts, ind[red])
da = deleteat!(collect(1:s), red)
ind = deleteat!(ind, red)
a = a[da]
end
return fronts
end
``````

Edit: You may want to check correctness, especially in the presence of equal values. Having a testset of tricky cases is good practice anyway.

4 Likes
#6

Many thanks @GunnarFarneback – this is correct, as far as I have been able to check, and is indeed much faster:

``````@btime nds2(\$scores)
# 37.822 μs (266 allocations: 14.38 KiB)
``````
#7

Note that trsnaposing is lazy, this does not actually change the memory layout, so your `arr[:,i]` is still ineffective. If you do `arr = copy(arr0')` the data is copied and put into the correct memory layout.

2 Likes
#8

Thanks, I didn’t know. This made the function about twice as fast, but it’s still slower than @GunnarFarneback’s `nds2`, so for now, I’ll go with that.

1 Like
#9

I’m not sure what this thing is doing, but you can get more speedup (>2x over the one from @GunnarFarneback) by doing this:

``````function nds3(arr)
fronts = Vector{Int64}[]
ind = collect(axes(arr, 1))
a = collect(eachrow(arr))
while !isempty(a)
red = [all(x -> !dominates2(x, y), a) for y in a]
push!(fronts, ind[red])
deleteat!(ind, red)
deleteat!(a, red)
end
return fronts
end
``````
5 Likes
#10

That’s fantastic, many thanks!

What a nice community this is: within a couple of hours, the function got about 500 times faster thanks to the joint help of people on here!

7 Likes
#11

BTW, if the `scores` matrix has a fixed, known and small size in the second dimension, you could restructure it as a vector of `SVector`s, to possibly get even further speedups.

3 Likes
#12

Thanks, the only way I can get this to work is by using

``````scores = @SArray rand(50, 3)
``````

but then the timing actually gets worse.

#13

No I was thinking more along the lines of a `Vector` of `SVector`s, not a large `SMatrix`. StaticArrays are not particularly performant when they become large (>100 elements, roughly). Also, you may have to change the code, I think.

1 Like
#14

Maybe something similar to this:

``````function nds4(arr)
fronts = Vector{Int64}[]
ind = collect(axes(arr, 1))
a = SVector{3}.(eachrow(arr))
while !isempty(a)
red = [all(x -> !dominates2(x, y), a) for y in a]
push!(fronts, ind[red])
deleteat!(ind, red)
deleteat!(a, red)
end
return fronts
end
``````
4 Likes
#15

It seems we just broke 1000-fold speedup limit 6 Likes
#16

Amazing! This gives

``````@btime nds4(scores)
# 5.510 μs (100 allocations: 7.30 KiB)
``````

Many thanks to all!

1 Like
#17

Julia. Come for the speed. Stay for the community.

6 Likes
#18

Is there a sorting package or some pace to post this version to…?

2 Likes
#19

It looks like I’m arriving late here, but here are a few more thoughts:

I tend to think that dynamic operations like `deleteat!` and `push!` are costly in general. One might want to avoid them by performing the appropriate swaps in order to move the elements to the beginning of the array. The algorithm can then continue working on an `arr` array of unchanged length, but looking only at the end of it. It is also possible to examine the dominance on elements one by one, instead of computing the full `red` array storing whether each element is dominated.

Here is an implementation trying to take advantage of these observations.

``````function nds5(arr)
fronts = Vector{Int64}[]
ind = collect(axes(arr, 1))
a = SVector{3}.(eachrow(arr))
N = length(a)

# All elements in arr[1:ii-1] are dominated
ii = 1
while ii<=N
start = ii
@inbounds for i in ii:N
for j in start:N
dominates2(a[j], a[i]) && @goto l1
end

# Swap values at indices `ii` and `i` in `ind` and `a`
ind[ii], ind[i] = ind[i], ind[ii]
a[ii],   a[i]   = a[i],   a[ii]
ii += 1

@label l1
end
push!(fronts, ind[start:ii-1])
end
return fronts
end
``````

Note that it is not strictly equivalent to the original functions, since the sorting is not stable : the same indices are output in a different order (whereas the original function would always have indices sorted in ascending order).

``````julia> nds4(scores)
9-element Array{Array{Int64,1},1}:
[6, 42, 62, 76, 79, 91]
[1, 14, 20, 23, 30, 37, 40, 50, 69, 75, 88]
[4, 10, 12, 18, 19, 22, 36, 48, 52, 61, 63, 64, 68, 83, 84, 87, 97, 98]
[2, 3, 11, 13, 21, 25, 26, 29, 38, 41  …  56, 57, 59, 60, 66, 71, 77, 80, 90, 96]
[15, 16, 17, 35, 39, 43, 44, 45, 67, 70, 81, 82, 89, 94, 100]
[5, 9, 24, 31, 32, 46, 49, 58, 65, 78, 85, 86]
[8, 27, 33, 34, 47, 53, 72, 73, 92]
[28, 74, 93, 95]
[7, 99]
``````
``````julia> nds5(scores)
9-element Array{Array{Int64,1},1}:
[6, 42, 62, 76, 79, 91]
[14, 20, 23, 30, 37, 40, 50, 69, 75, 88, 1]
[18, 19, 22, 10, 36, 12, 48, 52, 61, 63, 64, 68, 4, 83, 84, 87, 97, 98]
[11, 38, 41, 2, 13, 51, 25, 54, 55, 56  …  26, 3, 66, 29, 71, 21, 77, 80, 90, 96]
[67, 39, 70, 15, 43, 44, 45, 81, 82, 16, 89, 17, 94, 35, 100]
[9, 78, 5, 65, 31, 32, 85, 86, 24, 46, 49, 58]
[8, 33, 47, 27, 92, 53, 34, 72, 73]
[95, 28, 74, 93]
[99, 7]
``````

In the implementation above, it is interesting to observe that the `ind` array is shuffled in such a way that it contains all indices of the output partition in the correct order. All that is needed to reconstruct the output partition of `1:n` is the list of part boundaries to correctly split the `ind` array.

The following implementation relies on a custom `Partition` type, which allows manipulating partitions of `1:n` in the form of a 1D array of indices (which is a reordering of `1:n`) and an array of boundaries. This avoids manipulating arrays of arrays:

``````struct Partition
ind    :: Vector{Int64}
bounds :: Vector{Int64}
end

Partition(n::Int64) = Partition(collect(1:n), )

function push_bound!(p::Partition, i::Int64)
push!(p.bounds, i)
p
end

Base.getindex(p::Partition, i::Int64) = if i==lastindex(p.bounds)
view(p.ind, p.bounds[i]:lastindex(p.ind))
else
view(p.ind, p.bounds[i]:p.bounds[i+1]-1)
end

function Base.show(io::IO, p::Partition)
n = length(p.bounds)
if p.bounds[n] == length(p.ind)+1
n -= 1
end
println("\$n-part Partition:")
for i in 1:n
println(" ", p[i])
end
end

function nds6(arr)
a = SVector{3}.(eachrow(arr))
N = length(a)
p = Partition(N)

ii = 1
while ii<=N
start = ii
@inbounds for i in ii:N
for j in start:N
dominates2(a[j], a[i]) && @goto l1
end

# Swap values at indices `ii` and `i` in `ind` and `a`
p.ind[ii], p.ind[i] = p.ind[i], p.ind[ii]
a[ii],     a[i]     = a[i],     a[ii]
ii += 1

@label l1
end
push_bound!(p, ii)
end
return p
end
``````

which yields the same results as `nds5`, except stored in another fashion:

``````julia> nds6(scores)
9-part Partition:
[6, 42, 62, 76, 79, 91]
[14, 20, 23, 30, 37, 40, 50, 69, 75, 88, 1]
[18, 19, 22, 10, 36, 12, 48, 52, 61, 63, 64, 68, 4, 83, 84, 87, 97, 98]
[11, 38, 41, 2, 13, 51, 25, 54, 55, 56, 57, 59, 60, 26, 3, 66, 29, 71, 21, 77, 80, 90, 96]
[67, 39, 70, 15, 43, 44, 45, 81, 82, 16, 89, 17, 94, 35, 100]
[9, 78, 5, 65, 31, 32, 85, 86, 24, 46, 49, 58]
[8, 33, 47, 27, 92, 53, 34, 72, 73]
[95, 28, 74, 93]
[99, 7]
``````

Now some benchmarks:

``````using BenchmarkTools

function bench(N)
scores = rand(Float64, (N, 3));
@assert nds1(scores) == nds2(scores) == nds3(scores) == nds4(scores)
print("nds1");  @btime nds1(\$scores)
print("nds2");  @btime nds2(\$scores)
print("nds3");  @btime nds3(\$scores)
print("nds4");  @btime nds4(\$scores)
print("nds5");  @btime nds5(\$scores)
print("nds6");  @btime nds6(\$scores)

nothing
end
``````
``````julia> bench(50)
nds1  12.494 ms (40026 allocations: 27.14 MiB)
nds2  42.062 μs (265 allocations: 14.36 KiB)
nds3  13.697 μs (240 allocations: 7.97 KiB)
nds4  5.633 μs (94 allocations: 7.00 KiB)
nds5  4.675 μs (64 allocations: 5.77 KiB)
nds6  3.211 μs (59 allocations: 4.94 KiB)

julia> bench(50)
nds1  14.268 ms (46021 allocations: 31.52 MiB)
nds2  48.169 μs (295 allocations: 15.48 KiB)
nds3  17.536 μs (266 allocations: 8.58 KiB)
nds4  6.547 μs (100 allocations: 7.30 KiB)
nds5  5.779 μs (65 allocations: 5.86 KiB)
nds6  3.303 μs (59 allocations: 4.94 KiB)
``````

Most of the remaining allocations in `nds6` come from the Array-to-SArray conversion, which takes negligible time and is therefore probably not worth optimizing:

``````julia> f(x) = SVector{3}.(eachrow(x))
f (generic function with 1 method)

julia> @btime f(\$scores)
693.521 ns (54 allocations: 4.20 KiB)
``````

Full code (for anyone wanting to reproduce the results)
``````dominates(x, y) = all(x .<= y) && any(x .< y)

function dominates2(x, y)
strict_inequality_found = false
for i in eachindex(x)
y[i] < x[i] && return false
strict_inequality_found |= x[i] < y[i]
end
return strict_inequality_found
end

function nds1(arr::Array{Float64,2})
fronts::Array{Array,1} = Array[]
ind::Array{Int64,1} = collect(1:size(arr, 1))
while !isempty(arr)
s = size(arr, 1)
red = dropdims(sum([ dominates(arr[i,:], arr[j,:]) for i in 1:s, j in 1:s ], dims=1) .== 0, dims=1)
a = 1:s
sel::Array{Int64,1} = a[red]
push!(fronts, ind[sel])
da::Array{Int64,1} = deleteat!(collect(1:s), sel)
ind = deleteat!(ind, sel)
arr = arr[da, :]
end
return fronts
end

function nds2(arr)
fronts = Vector{Int64}[]
ind = collect(1:size(arr, 1))
a = [arr[i, :] for i = 1:size(arr, 1)]
while !isempty(a)
s = length(a)
red = [all(x -> !dominates2(x, y), a) for y in a]
push!(fronts, ind[red])
da = deleteat!(collect(1:s), red)
ind = deleteat!(ind, red)
a = a[da]
end
return fronts
end

function nds3(arr)
fronts = Vector{Int64}[]
ind = collect(axes(arr, 1))
a = collect(eachrow(arr))
while !isempty(a)
red = [all(x -> !dominates2(x, y), a) for y in a]
push!(fronts, ind[red])
deleteat!(ind, red)
deleteat!(a, red)
end
return fronts
end

using StaticArrays

function nds4(arr)
fronts = Vector{Int64}[]
ind = collect(axes(arr, 1))
a = SVector{3}.(eachrow(arr))
while !isempty(a)
red = [all(x -> !dominates2(x, y), a) for y in a]
push!(fronts, ind[red])
deleteat!(ind, red)
deleteat!(a, red)
end
return fronts
end

function nds5(arr)
fronts = Vector{Int64}[]
ind = collect(axes(arr, 1))
a = SVector{3}.(eachrow(arr))
N = length(a)

ii = 1
while ii<=N
start = ii
@inbounds for i in ii:N
for j in start:N
dominates2(a[j], a[i]) && @goto l1
end

# Swap values at indices `ii` and `i` in `ind` and `a`
ind[ii], ind[i] = ind[i], ind[ii]
a[ii],   a[i]   = a[i],   a[ii]
ii += 1

@label l1
end
push!(fronts, ind[start:ii-1])
end
return fronts
end

struct Partition
ind    :: Vector{Int64}
bounds :: Vector{Int64}
end

Partition(n::Int64) = Partition(collect(1:n), )

function push_bound!(p::Partition, i::Int64)
push!(p.bounds, i)
p
end

Base.getindex(p::Partition, i::Int64) = if i==lastindex(p.bounds)
view(p.ind, p.bounds[i]:lastindex(p.ind))
else
view(p.ind, p.bounds[i]:p.bounds[i+1]-1)
end

function Base.show(io::IO, p::Partition)
n = length(p.bounds)
if p.bounds[n] == length(p.ind)+1
n -= 1
end
println("\$n-part Partition:")
for i in 1:n
println(" ", p[i])
end
end

function nds6(arr)
a = SVector{3}.(eachrow(arr))
N = length(a)
p = Partition(N)

ii = 1
while ii<=N
start = ii
@inbounds for i in ii:N
for j in start:N
dominates2(a[j], a[i]) && @goto l1
end

# Swap values at indices `ii` and `i` in `ind` and `a`
p.ind[ii], p.ind[i] = p.ind[i], p.ind[ii]
a[ii],     a[i]     = a[i],     a[ii]
ii += 1

@label l1
end
push_bound!(p, ii)
end
return p
end

using BenchmarkTools

function bench(N)
scores = rand(Float64, (N, 3));
@assert nds1(scores) == nds2(scores) == nds3(scores) == nds4(scores)
print("nds1");  @btime nds1(\$scores)
print("nds2");  @btime nds2(\$scores)
print("nds3");  @btime nds3(\$scores)
print("nds4");  @btime nds4(\$scores)
print("nds5");  @btime nds5(\$scores)
print("nds6");  @btime nds6(\$scores)

nothing
end
``````
5 Likes
#20

The ordering of the indices is unimportant: the function calculates the so-called Pareto fronts, and within any front all solutions are equivalent. In other words, `nds5` would have done. I can’t now quite follow why, despite the extra work, `nds6` is still faster. In any case, fantastic to see that a further speedup was still possible.

In response to @yakir12, the non-dominated sorting algorithm is a key component in Deb’s NSGA-II algorithm, which is one of the most popular algorithms for agent-based optimization. At the moment, there are three Julia packages for this, but two seem to have been abandoned and one is, at least for my purposes, not flexible enough. For me, it’s much more useful to have the components of the NSGA-II algorithm available (another key component is the crowding sorting distance) and then to combine them in whichever way best suits my needs for the case at hand. I could imagine a package with fast versions of the necessary functions (like the one developed in this thread for non dominated sorting) to be useful for some people.

2 Likes