Best `separate` performance?

Hi all, I’ve been trying out some ways of separating a Vector into a true and false part based on predicate f, and the basic code below is what I ended up with. My question/challenge is whether anyone knows a faster way of doing something like this.

function separate(f::Function, A::AbstractVector)
    la = length(A)
    ix = Vector{Int}(undef, la)
    true_id = 1
    false_id = 0
    for (i, a) in enumerate(A)
	if f(a)
		ix[true_id] = i
		true_id += 1
	else
		ix[end - false_id] = i
		false_id += 1
	end
    end
    reverse!(ix, true_id, la)
    return permute!(A, ix), true_id
end

#julia> @btime separate(x -> x>5, $(collect(1:10)))
#  68.969 ns (3 allocations: 352 bytes)
#([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], 6)

When asking these type of questions it is quite important to mention what types of input the function needs to handle. For example:

  • Is a length of 10 very typical or could the length be 10^6?
  • Will the separation function be as simple as x > 5 or will it be more general.
  • Does the output need to be exactly like what you have right now?

Optimization is often exploiting problem specific properties so to do best you need to know the constraints you are working under.

1 Like

Yes this is very true, maybe this wasn’t really a good question. It’s supposed to be just a relatively general tool that works with Vectors up to a length of the order of around 1e3, and the function not drastically more complex than this one, something one would use in a findfirst(x-> ...., A) or map(x->...., A) kind of way (not with a gigantic do end block).

Output can be different but it should at least be structured similarly, i.e. either like it is now, or something like return trues, falses where they are just two arrays instead of one.
I forgot to mention: The ordering of the true and false elements should be kept intact.

aside from this being almost 10x slower, true_id that is returned is the index of the first false element.

The simplest would just be to use trues = findall(f, arr) to get the trues and then a setdiff(1:length(arr), trues) to get the falses, but maybe that has some overhead?

Something like this?

function separate2(f::Function, A::AbstractVector)
	trues = findall(f, A)
	falses = setdiff(1:length(A), trues)
	return A[trues], A[falses]
end

gives me

@btime separate2(x->x>5, $(collect(1:10)))
  757.307 ns (28 allocations: 1.63 KiB)
([6, 7, 8, 9, 10], [1, 2, 3, 4, 5])

A simple one and a fast one – sep2 is about 2x quicker at 10, 4x quicker at 10^3, for me:

sep(f,A) = filter(f,A), filter(!f,A)

function sep2(f,A::Vector)
    out = similar(A)
    nt, nf = 0, length(A)+1
    @inbounds for a in A
        if f(a)
            out[nt+=1] = a
        else
            out[nf-=1] = a
        end
    end
    reverse!(@view out[1+nt:end])
    out
end
1 Like

Ah nice, basically the same but without the index buffer, great!

Is there any way to do this fully in-place (no allocations)? I gave it a couple of tries but I couldn’t figure it out.

I wrote the code below for fun. It is fast and in-place (no extra allocation), however it does not fill your pre-requisite that:

The ordering of the true and false elements should be kept intact.

I am not sure if this entirely possible to be done in a O(n)-time and O(1)-memory algorithm, seems like you need a O(n)-time-and-memory (like your first algorithm), or a O(1)-memory but O(n^2)-time. If you really not want the allocation (but can pay O(n^2) time), you can do a single pass where the first true element after some false ones is saved to a variable, and then shift the whole false block one step by the side (copy i:(j-1) where i is the first false and j is the first true after the falses to (i+1):j) and put the true element in position i, but this is horrid.

I think the only sane thing if you need to keep the ordering and do not want to allocate at each call is to take the permutation Vector as a parameter, and make sure its has the right size and none of the old values will be used, then you can reuse it between calls.

using BenchmarkTools

sep(f::Function, A::AbstractVector) = filter(f,A), filter(!f,A)

# does not work for vectors with different index types
function separate(f::Function, A::AbstractVector)
  la = length(A)
  ix = Vector{Int}(undef, la)
  true_id = 1
  false_id = 0
  for (i, a) in enumerate(A)
	  if f(a)
	  	ix[true_id] = i
	  	true_id += 1
	  else
	  	ix[end - false_id] = i
	  	false_id += 1
	  end
  end
  reverse!(ix, true_id, la)
  return permute!(A, ix), true_id
end

# does not work for vectors with different index types
function sep3(f::Function, A::AbstractVector)
  i, j = 1, length(A)
  @inbounds while i < j
    if f(A[i]) # i element at right position
      if f(A[j]) # j element at wrong position
        i += 1
        while i < j && f(A[i]); i += 1; end
        if i < j
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:j], A[(j+1):length(A)]
        end
      else # both elements at right position
        i += 1
        j -= 1
      end
    else
      if f(A[j]) # i at wrong possition but j at right position
        A[i], A[j] = A[j], A[i]
        i += 1
        j -= 1
      else # both alements at wrong position
        j -= 1
        while j > i && !f(A[j]); j -= 1; end
        if j > i
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:(i-1)], A[i:length(A)]
        end
      end
    end
  end
  if i == j
    if f(A[i])
      @views return A[1:i], A[(i+1):length(A)]
    else
      @views return A[1:(i-1)], A[i:length(A)]
    end
  else
    @assert i == j + 1
    @views return A[1:j], A[i:length(A)]
  end
end

function run_bench()
  N = 10^3
  b = rand(N)
  a = deepcopy(b)
  ts1,fs1 = @btime sep(x -> x > 0.5, $a)
  a = deepcopy(b)
  ts2,fs2 = @btime separate(x -> x > 0.5, $a)
  a = deepcopy(b)
  ts3,fs3 = @btime sep3(x -> x > 0.5, $a)
end

run_bench()

#   7.882 μs (19 allocations: 16.69 KiB)
#   3.272 μs (3 allocations: 15.91 KiB)
# 519.591 ns (3 allocations: 128 bytes)

Is there any way to do this fully in-place (no allocations)? I gave it a couple of tries but I couldn’t figure it out.

Here’s a stable, in-place partition algorithm that runs in O(n*log(n)) (I think…). It’s a lot like a merge sort. Based on an implementation in D’s standard library, described here.

This version puts all the elements for which the predicate returns true at the front, and returns the number of elements that evaluate to true.

function partition!(pred, arr)
    if length(arr) === 0
        return 0
    end
    
    # Base case
    if length(arr) === 1
        if pred(arr[1])
            return 1
        else
            return 0
        end
    end

    # Recursively partition upper and lower halves
    middle = length(arr) >> 1
    lower = partition!(pred, @view arr[1:middle])
    upper = partition!(pred, @view arr[middle+1:end])

    # Swap upper block of falses and lower block of trues
    reverse!(@view arr[lower+1:middle])
    reverse!(@view arr[middle+1:middle+upper])
    reverse!(@view arr[lower+1:middle+upper])

    # Return total number of trues
    return lower + upper
end

Example:

julia> let data = collect(1:10)
           partition!(x->x>5, data), data
       end
(5, [6, 7, 8, 9, 10, 1, 2, 3, 4, 5])
1 Like

Here’s a paper claiming you can do it in O(n) time, but it looks a lot trickier to implement: Stable Minimum Space Partitioning in Linear Time [PDF].

I think Block Sort (I’d link to Wikipedia, but this forum limits new users to two links per post) uses some similar ideas to achieve an in-place, stable sort with worst case O(n*log(n)) running time.

In retrospective, this is a kinda obvious transformation I have forgotten about. You can treat falses as zeros and trues as ones (or the contrary) and solve by sorting the array by the value associated, I had also forgotten that mergesort did not need a temporary vector.

However, your solution is the one that takes more time and memory, for at least an order of magnitude more than the others, I have run the code below and what I get is:

# one-liner that create two separate vectors using two filters
  7.490 μs (19 allocations: 16.69 KiB)
# first post code, uses aux vector for permutation 
  3.255 μs (3 allocations: 15.91 KiB)
# first post code, but I take the permutation vector as parameter
# and discovered that permute! makes a copy of  the permutation
# vector so I use Base.permute!! instead
  2.385 μs (1 allocation: 32 bytes)
# my code that does not preserve order and return views
  507.447 ns (3 allocations: 128 bytes)
# your code based on mergesort
  49.923 μs (4995 allocations: 234.14 KiB)

The problem with your code is that it calls recursively while allocating views, and you base case is a single element. So in the end you make 2*length(arr) calls and allocate considerably more 2*length(arr) views (because there is not only the views for calling recursively as there are the ones for reverse! too). Each view is allocated in the heap and take more memory than a Float64, that is the eltype of the array I used to test, so this allocates far more than an aux vector, which @louisponet wanted to avoid. Maybe if a pair of Int64 are used as indexes and passed as parameters recursively and to reverse!, this way no allocation happens in heap, and at max log2(length(arr)) tuples of paramters will be stacked in the stack memory.

Also, @louisponet, by what I have seen in permute! implementation (https://github.com/JuliaLang/julia/blob/ac8c5b70514fc0bd743f4e5df6a13dccdaccddb9/base/combinatorics.jl#L97) it seems to enforce one-based indexing. Unfortunately your effort to make the code index-agnostic is not sufficient.

The code used for the tests is below:

using BenchmarkTools

sep(f::Function, A::AbstractVector) = filter(f,A), filter(!f,A)

# does not work for vectors with different index types
function separate(f::Function, A::AbstractVector)
  la = length(A)
  ix = Vector{Int}(undef, la)
  true_id = 1
  false_id = 0
  for (i, a) in enumerate(A)
	  if f(a)
	  	ix[true_id] = i
	  	true_id += 1
	  else
	  	ix[end - false_id] = i
	  	false_id += 1
	  end
  end
  reverse!(ix, true_id, la)
  return permute!(A, ix), true_id
end

# does not work for vectors with different index types
function separate2(f :: Function, A :: AbstractVector, B :: AbstractVector)
  i, j = 0, length(A) + 1
  for (k, a) in enumerate(A)
	  if f(a)
	  	B[i += 1] = k
	  else
	  	B[j -= 1] = k
	  end
  end
  reverse!(B, i, length(A))
  return Base.permute!!(A, B), i
end

# does not work for vectors with different index types
function sep3(f::Function, A::AbstractVector)
  i, j = 1, length(A)
  @inbounds while i < j
    if f(A[i]) # i element at right position
      if f(A[j]) # j element at wrong position
        i += 1
        while i < j && f(A[i]); i += 1; end
        if i < j
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:j], A[(j+1):length(A)]
        end
      else # both elements at right position
        i += 1
        j -= 1
      end
    else
      if f(A[j]) # i at wrong possition but j at right position
        A[i], A[j] = A[j], A[i]
        i += 1
        j -= 1
      else # both alements at wrong position
        j -= 1
        while j > i && !f(A[j]); j -= 1; end
        if j > i
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:(i-1)], A[i:length(A)]
        end
      end
    end
  end
  if i == j
    if f(A[i])
      @views return A[1:i], A[(i+1):length(A)]
    else
      @views return A[1:(i-1)], A[i:length(A)]
    end
  else
    @assert i == j + 1
    @views return A[1:j], A[i:length(A)]
  end
end

function partition!(pred, arr)
    if length(arr) === 0
        return 0
    end
    
    # Base case
    if length(arr) === 1
        if pred(arr[1])
            return 1
        else
            return 0
        end
    end

    # Recursively partition upper and lower halves
    middle = length(arr) >> 1
    lower = partition!(pred, @view arr[1:middle])
    upper = partition!(pred, @view arr[middle+1:end])

    # Swap upper block of falses and lower block of trues
    reverse!(@view arr[lower+1:middle])
    reverse!(@view arr[middle+1:middle+upper])
    reverse!(@view arr[lower+1:middle+upper])

    # Return total number of trues
    return lower + upper
end

function run_bench()
  N = 10^3
  b = rand(N)
  a = deepcopy(b)
  @btime sep(x -> x > 0.5, $a)
  a = deepcopy(b)
  @btime separate(x -> x > 0.5, $a)
  a = deepcopy(b)
  B = Vector{Int64}(undef, length(a))
  @btime separate2(x -> x > 0.5, $a, $B)
  a = deepcopy(b)
  @btime sep3(x -> x > 0.5, $a)
  a = deepcopy(b)
  @btime partition!(x -> x > 0.5, $a)
end

run_bench()
1 Like

Very interesting! I also had a solution exactly based on replacing with ones and zeros and using the by keyword of the sort! function with something like x -> f(x) ? 0 : 1. But indeed I gave up on that since I wanted to find a way that keeps the order.

I actually also tried to pass it a closure that mutated two counters (one for true one for false), but that gave me a read-only memory error.

I didn’t know that, thanks for that!

But indeed I gave up on that since I wanted to find a way that keeps the order.

uhhhh, I think your solution based in the default sort! is a good idea. You can choose the algorithm used by sort! and if you choose mergesort, then it will be stable.

function sort_sep!(f::Function, A::AbstractVector)
  sort!(A; by = f, alg = Base.MergeSort, rev = true)
end

function run_bench()
  N = 10^3
  b = rand(N)
  a = deepcopy(b)
  @btime sort_sep!(x -> x > 0.5, $a)
  ...
end

gives

  7.262 μs (2 allocations: 4.08 KiB)

What is not without allocation, but is has about the same time than the two-filter solution (and number of lines, XD), and allocates just one quarter of the memory. The only problem is that it does not give the number of trues, however you can use a binary search after the sort, to find the first false position in O(log n).

The problem with your code is that it calls recursively while allocating views, and you base case is a single element. So in the end you make 2*length(arr) calls and allocate considerably more 2*length(arr) views (because there is not only the views for calling recursively as there are the ones for reverse! too). Each view is allocated in the heap and take more memory than a Float64, that is the eltype of the array I used to test, so this allocates far more than an aux vector, which @louisponet wanted to avoid. Maybe if a pair of Int64 are used as indexes and passed as parameters recursively and to reverse! , this way no allocation happens in heap, and at max log2(length(arr)) tuples of paramters will be stacked in the stack memory.

Good point. There’s an issue discussing stack allocated array views along with some ideas for work-arounds.

Oh, this would be very interesting. Every method that takes an AbstractArray would work on these views and they would be stack allocated, it would be effortless to write efficient code operating over array subsets.

Finally had some time to look through the comments, this was very useful for me to understand different ways of doing this and how to leverage some things in the language. Thanks!