Fastest way to partition array, given a condition

I need to sort (usually small, i. e. 5 to 20 elements, but the algorithm cannot scale badly) arrays, but only partially. That is, I need that the elements get split into elements smaller than a cutoff, and elements greater than a cutoff, and I need the number of elements smaller than the cutoff. It must be inplace, and single-threaded.

The fastest I could do for now is this:

julia> function partialsort_cutoff!(x,cutoff)
         iswap = 1
         for i in eachindex(x)
           if x[i] <= cutoff
             if iswap != i
               x[iswap], x[i] = x[i], x[iswap]
             end
             iswap = iswap + 1
           end
         end
         return iswap - 1
       end
partialsort_cutoff! (generic function with 1 method)

julia> @benchmark partialsort_cutoff!(y,0.5) setup=(y=rand(20)) evals=1
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):   88.000 ns … 444.000 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     172.000 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   172.809 ns Β±  25.730 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                         β–‚ β–„β–β–β–†β–‚β–‡β–ƒβ–ˆβ–ƒβ–ƒβ–ˆβ–ƒβ–ˆβ–‚β–†β–β–‚β–„ β–ƒ                  
  β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–…β–„β–†β–…β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–†β–…β–‡β–…β–†β–„β–…β–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–‚ β–…
  88 ns            Histogram: frequency by time          236 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


Anyone can think of a faster alternative?

1 Like

Small style suggestion, you can use x[iswap], x[i] = x[i], x[iswap] instead of three lines.

4 Likes

accepted :slight_smile:

1 Like

Why do I get different timing results? On Julia Version 1.5.3:

julia> @btime partialsort_cutoff!(y,0.5) setup=(y=copy($x)) evals=1
  0.001 ns (0 allocations: 0 bytes)
7

EDIT:
Timings are the same until length(x) == 134:

julia> x = rand(134);

julia> @btime partialsort_cutoff!(y,0.5) setup=(y=copy($x)) evals=1
  100.000 ns (0 allocations: 0 bytes)
65

That is a strange result, and certainly it is not meaningful. I’m in Julia 1.6.2, for if that matters.

Anyway, it is probably better to benchmark the function with:

julia> @benchmark partialsort_cutoff!(y,0.5) setup=(y=rand(20)) evals=1
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):   88.000 ns … 444.000 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     172.000 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   172.809 ns Β±  25.730 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                         β–‚ β–„β–β–β–†β–‚β–‡β–ƒβ–ˆβ–ƒβ–ƒβ–ˆβ–ƒβ–ˆβ–‚β–†β–β–‚β–„ β–ƒ                  
  β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–…β–„β–†β–…β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–†β–…β–‡β–…β–†β–„β–…β–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–‚ β–…
  88 ns            Histogram: frequency by time          236 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


(updated the original post)

1 Like

Is the initial distribution on number totally random ?

Yet, it is.

What, on the other side, might be useful, is that I don’t need the values greater than the cutoff. I would be happy if I got the elements smaller than the cutoff, and the number of elements smaller than the cutoff. But I don’t know if something can be faster than swapping the elements directly.

This should minimize the number of swaps but the overall higher complexity doesn’t seem to pay off unless the swaps are more expensive than in the benchmark.

function partialsort_cutoff2!(x, cutoff)
    i, j = extrema(eachindex(x))
    while true
        while i < j && x[i] <= cutoff
            i += 1
        end
        while j > i && x[j] > cutoff
            j -= 1
        end
        if j <= i
            return i - 1
        end
        x[i], x[j] = x[j], x[i]
    end
end

Edit: There’s a corner case bug if all elements are lower than the cutoff.

1 Like

This is called β€œpartitioning” the array, and is a subroutine of the quicksort algorithm. See the Base.partition! function.

4 Likes

Thanks. But the swaps are pretty cheap, yes. They are swaps of a struct of shape:

struct S
  i::Int 
  x::Float64
  v::SVector{3,Float64}
end

and the sorting is on the x values.

Probably a little bit more expensive to swap, but still very cheap.

filter! ?

2 Likes

I cannot deallocate the remaining of the array, because it will be reused. Could that be adapted to that situation?

two comments:

the Base.partition! is interesting, but I will have to understand exactly how to use it in this case.

Also, filter! seems to be fast. Another function to study how it is implemented, even if to adapt to what I need here.

If you use the partition! algorithm you don’t need to swap; you can just do a single assignment into the lower half.

2 Likes

Thanks all! I have some clear directions to study alternatives. I need to take care of the kid now :slight_smile:

I will post what I get from those suggestions asap.

1 Like

The original:

function filter!(f, a::AbstractVector)
    j = firstindex(a)
    for ai in a
        @inbounds a[j] = ai
        j = ifelse(f(ai), nextind(a, j), j)
    end
    j > lastindex(a) && return a
    if a isa Vector
        resize!(a, j-1)
        sizehint!(a, j-1)
    else
        deleteat!(a, j:lastindex(a))
    end
    return a
end

adapted to your needs:

function yourFilter!(f, a::AbstractVector)
    j = firstindex(a)
    for ai in a
        @inbounds a[j] = ai
        j = ifelse(f(ai), nextind(a, j), j)
    end
    return j-1
end

You can check if @inbounds is good for your swapping too.

2 Likes

Thanks for that. But I realized now that the way I have things implemented now, I actually need the swapping.

I will see how faster is the partition! function.

The adapted Base.Sort.partition! function appears to be slightly faster than what I originally had. If someone sees anything else to improve, I’ll appreciate. But I guess that if that is what is implemented by default in QuickSort, I guess it can’t get much faster.

julia> @benchmark partition!(y,0.5) setup=(y=rand(20)) evals=1
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):   71.000 ns …  14.398 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     157.000 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   159.097 ns Β± 145.189 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                            β–‚ β–β–…β–‚β–†β–‚β–‚β–ˆβ–„β–‚β–‡β–ƒβ–†β–‚β–β–…  β–ƒ
  β–‚β–β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–…β–„β–„β–†β–…β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–†β–…β–‡β–„β–„β–†β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒ β–…
  71 ns            Histogram: frequency by time          216 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark partialsort_cutoff!(y,0.5) setup=(y=rand(20)) evals=1
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):   76.000 ns …  31.006 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     169.000 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   172.485 ns Β± 309.801 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                           ▁ β–ƒ β–„β–β–†β–‚β–‡β–ƒβ–‡β–ƒβ–‡β–ƒβ–ˆβ–ƒβ–†β–β–†β–β–‚ ▁
  β–‚β–β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–ƒβ–…β–„β–†β–…β–‡β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–…β–‡β–…β–…β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒ β–…
  76 ns            Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


Code
using BenchmarkTools

function partition!(v::AbstractVector, cutoff)
    i, j = 0, length(v)+1
    @inbounds while true
        i += 1; j -= 1
        while 
          v[i] <= cutoff
          i += 1  
        end
        while 
          v[j] > cutoff
          j -= 1 
        end
        i >= j && break
        v[i], v[j] = v[j], v[i]
    end
    return j
end

function partialsort_cutoff!(x,cutoff)
  iswap = 1
  @inbounds for i in eachindex(x)
    if x[i] <= cutoff
      if iswap != i
        x[iswap], x[i] = x[i], x[iswap]
      end
      iswap = iswap + 1
    end
  end
  return iswap - 1
end

Do you have guarantees that you have values both below and above the cutoff? Otherwise you will go out of bounds in one of the while loops. (The Base partition! avoids this with the selectpivot! semantics.)

5 Likes

As a rule of thumb, subnanosecond timings don’t make any sense. Usually this means there is constant propagation going on, so that the result is effectively computed at compile time and @btime is trying to benchmark a function which directly returns a constant value. See Manual Β· BenchmarkTools.jl

1 Like

Indeed, I only tested that in the toy example, so I missed that. I’ll add checking limits to the loops, as the use of a pivot does not seem to be exactly what I need. But maybe that will narrow the diference to the initial implementation even further.

@GunnarFarneback now I notice that is exactly what you posted above. Thanks for that.

Probably what he is seen deserves a new thread, because if he ran exactly the code I posted that should not be happening.