Best `separate` performance?

louisponet · June 29, 2019, 4:26pm

Hi all, I’ve been trying out some ways of separating a Vector into a true and false part based on predicate f, and the basic code below is what I ended up with. My question/challenge is whether anyone knows a faster way of doing something like this.

function separate(f::Function, A::AbstractVector)
    la = length(A)
    ix = Vector{Int}(undef, la)
    true_id = 1
    false_id = 0
    for (i, a) in enumerate(A)
	if f(a)
		ix[true_id] = i
		true_id += 1
	else
		ix[end - false_id] = i
		false_id += 1
	end
    end
    reverse!(ix, true_id, la)
    return permute!(A, ix), true_id
end

#julia> @btime separate(x -> x>5, $(collect(1:10)))
#  68.969 ns (3 allocations: 352 bytes)
#([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], 6)

kristoffer.carlsson · June 29, 2019, 4:36pm

When asking these type of questions it is quite important to mention what types of input the function needs to handle. For example:

Is a length of 10 very typical or could the length be 10^6?
Will the separation function be as simple as x > 5 or will it be more general.
Does the output need to be exactly like what you have right now?

Optimization is often exploiting problem specific properties so to do best you need to know the constraints you are working under.

louisponet · June 29, 2019, 4:42pm

Yes this is very true, maybe this wasn’t really a good question. It’s supposed to be just a relatively general tool that works with Vectors up to a length of the order of around 1e3, and the function not drastically more complex than this one, something one would use in a findfirst(x-> ...., A) or map(x->...., A) kind of way (not with a gigantic do end block).

Output can be different but it should at least be structured similarly, i.e. either like it is now, or something like return trues, falses where they are just two arrays instead of one.
I forgot to mention: The ordering of the true and false elements should be kept intact.

louisponet · June 29, 2019, 4:51pm

aside from this being almost 10x slower, true_id that is returned is the index of the first false element.

kristoffer.carlsson · June 29, 2019, 4:58pm

The simplest would just be to use trues = findall(f, arr) to get the trues and then a setdiff(1:length(arr), trues) to get the falses, but maybe that has some overhead?

louisponet · June 29, 2019, 5:02pm

Something like this?

function separate2(f::Function, A::AbstractVector)
	trues = findall(f, A)
	falses = setdiff(1:length(A), trues)
	return A[trues], A[falses]
end

gives me

@btime separate2(x->x>5, $(collect(1:10)))
  757.307 ns (28 allocations: 1.63 KiB)
([6, 7, 8, 9, 10], [1, 2, 3, 4, 5])

improbable22 · June 29, 2019, 5:04pm

A simple one and a fast one – sep2 is about 2x quicker at 10, 4x quicker at 10^3, for me:

sep(f,A) = filter(f,A), filter(!f,A)

function sep2(f,A::Vector)
    out = similar(A)
    nt, nf = 0, length(A)+1
    @inbounds for a in A
        if f(a)
            out[nt+=1] = a
        else
            out[nf-=1] = a
        end
    end
    reverse!(@view out[1+nt:end])
    out
end

louisponet · June 29, 2019, 5:08pm

Ah nice, basically the same but without the index buffer, great!

louisponet · June 29, 2019, 5:12pm

Is there any way to do this fully in-place (no allocations)? I gave it a couple of tries but I couldn’t figure it out.

Henrique_Becker · June 29, 2019, 9:10pm

I wrote the code below for fun. It is fast and in-place (no extra allocation), however it does not fill your pre-requisite that:

The ordering of the true and false elements should be kept intact.

I am not sure if this entirely possible to be done in a O(n)-time and O(1)-memory algorithm, seems like you need a O(n)-time-and-memory (like your first algorithm), or a O(1)-memory but O(n^2)-time. If you really not want the allocation (but can pay O(n^2) time), you can do a single pass where the first true element after some false ones is saved to a variable, and then shift the whole false block one step by the side (copy i:(j-1) where i is the first false and j is the first true after the falses to (i+1):j) and put the true element in position i, but this is horrid.

I think the only sane thing if you need to keep the ordering and do not want to allocate at each call is to take the permutation Vector as a parameter, and make sure its has the right size and none of the old values will be used, then you can reuse it between calls.

using BenchmarkTools

sep(f::Function, A::AbstractVector) = filter(f,A), filter(!f,A)

# does not work for vectors with different index types
function separate(f::Function, A::AbstractVector)
  la = length(A)
  ix = Vector{Int}(undef, la)
  true_id = 1
  false_id = 0
  for (i, a) in enumerate(A)
	  if f(a)
	  	ix[true_id] = i
	  	true_id += 1
	  else
	  	ix[end - false_id] = i
	  	false_id += 1
	  end
  end
  reverse!(ix, true_id, la)
  return permute!(A, ix), true_id
end

# does not work for vectors with different index types
function sep3(f::Function, A::AbstractVector)
  i, j = 1, length(A)
  @inbounds while i < j
    if f(A[i]) # i element at right position
      if f(A[j]) # j element at wrong position
        i += 1
        while i < j && f(A[i]); i += 1; end
        if i < j
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:j], A[(j+1):length(A)]
        end
      else # both elements at right position
        i += 1
        j -= 1
      end
    else
      if f(A[j]) # i at wrong possition but j at right position
        A[i], A[j] = A[j], A[i]
        i += 1
        j -= 1
      else # both alements at wrong position
        j -= 1
        while j > i && !f(A[j]); j -= 1; end
        if j > i
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:(i-1)], A[i:length(A)]
        end
      end
    end
  end
  if i == j
    if f(A[i])
      @views return A[1:i], A[(i+1):length(A)]
    else
      @views return A[1:(i-1)], A[i:length(A)]
    end
  else
    @assert i == j + 1
    @views return A[1:j], A[i:length(A)]
  end
end

function run_bench()
  N = 10^3
  b = rand(N)
  a = deepcopy(b)
  ts1,fs1 = @btime sep(x -> x > 0.5, $a)
  a = deepcopy(b)
  ts2,fs2 = @btime separate(x -> x > 0.5, $a)
  a = deepcopy(b)
  ts3,fs3 = @btime sep3(x -> x > 0.5, $a)
end

run_bench()

#   7.882 μs (19 allocations: 16.69 KiB)
#   3.272 μs (3 allocations: 15.91 KiB)
# 519.591 ns (3 allocations: 128 bytes)

jwmerrill · June 29, 2019, 11:15pm

Is there any way to do this fully in-place (no allocations)? I gave it a couple of tries but I couldn’t figure it out.

Here’s a stable, in-place partition algorithm that runs in O(n*log(n)) (I think…). It’s a lot like a merge sort. Based on an implementation in D’s standard library, described here.

This version puts all the elements for which the predicate returns true at the front, and returns the number of elements that evaluate to true.

function partition!(pred, arr)
    if length(arr) === 0
        return 0
    end
    
    # Base case
    if length(arr) === 1
        if pred(arr[1])
            return 1
        else
            return 0
        end
    end

    # Recursively partition upper and lower halves
    middle = length(arr) >> 1
    lower = partition!(pred, @view arr[1:middle])
    upper = partition!(pred, @view arr[middle+1:end])

    # Swap upper block of falses and lower block of trues
    reverse!(@view arr[lower+1:middle])
    reverse!(@view arr[middle+1:middle+upper])
    reverse!(@view arr[lower+1:middle+upper])

    # Return total number of trues
    return lower + upper
end

Example:

julia> let data = collect(1:10)
           partition!(x->x>5, data), data
       end
(5, [6, 7, 8, 9, 10, 1, 2, 3, 4, 5])

jwmerrill · June 29, 2019, 11:17pm

Here’s a paper claiming you can do it in O(n) time, but it looks a lot trickier to implement: Stable Minimum Space Partitioning in Linear Time [PDF].

I think Block Sort (I’d link to Wikipedia, but this forum limits new users to two links per post) uses some similar ideas to achieve an in-place, stable sort with worst case O(n*log(n)) running time.

Henrique_Becker · June 30, 2019, 5:03pm

In retrospective, this is a kinda obvious transformation I have forgotten about. You can treat falses as zeros and trues as ones (or the contrary) and solve by sorting the array by the value associated, I had also forgotten that mergesort did not need a temporary vector.

However, your solution is the one that takes more time and memory, for at least an order of magnitude more than the others, I have run the code below and what I get is:

# one-liner that create two separate vectors using two filters
  7.490 μs (19 allocations: 16.69 KiB)
# first post code, uses aux vector for permutation 
  3.255 μs (3 allocations: 15.91 KiB)
# first post code, but I take the permutation vector as parameter
# and discovered that permute! makes a copy of  the permutation
# vector so I use Base.permute!! instead
  2.385 μs (1 allocation: 32 bytes)
# my code that does not preserve order and return views
  507.447 ns (3 allocations: 128 bytes)
# your code based on mergesort
  49.923 μs (4995 allocations: 234.14 KiB)

The problem with your code is that it calls recursively while allocating views, and you base case is a single element. So in the end you make 2*length(arr) calls and allocate considerably more 2*length(arr) views (because there is not only the views for calling recursively as there are the ones for reverse! too). Each view is allocated in the heap and take more memory than a Float64, that is the eltype of the array I used to test, so this allocates far more than an aux vector, which @louisponet wanted to avoid. Maybe if a pair of Int64 are used as indexes and passed as parameters recursively and to reverse!, this way no allocation happens in heap, and at max log2(length(arr)) tuples of paramters will be stacked in the stack memory.

Also, @louisponet, by what I have seen in permute! implementation (https://github.com/JuliaLang/julia/blob/ac8c5b70514fc0bd743f4e5df6a13dccdaccddb9/base/combinatorics.jl#L97) it seems to enforce one-based indexing. Unfortunately your effort to make the code index-agnostic is not sufficient.

The code used for the tests is below:

using BenchmarkTools

sep(f::Function, A::AbstractVector) = filter(f,A), filter(!f,A)

# does not work for vectors with different index types
function separate(f::Function, A::AbstractVector)
  la = length(A)
  ix = Vector{Int}(undef, la)
  true_id = 1
  false_id = 0
  for (i, a) in enumerate(A)
	  if f(a)
	  	ix[true_id] = i
	  	true_id += 1
	  else
	  	ix[end - false_id] = i
	  	false_id += 1
	  end
  end
  reverse!(ix, true_id, la)
  return permute!(A, ix), true_id
end

# does not work for vectors with different index types
function separate2(f :: Function, A :: AbstractVector, B :: AbstractVector)
  i, j = 0, length(A) + 1
  for (k, a) in enumerate(A)
	  if f(a)
	  	B[i += 1] = k
	  else
	  	B[j -= 1] = k
	  end
  end
  reverse!(B, i, length(A))
  return Base.permute!!(A, B), i
end

# does not work for vectors with different index types
function sep3(f::Function, A::AbstractVector)
  i, j = 1, length(A)
  @inbounds while i < j
    if f(A[i]) # i element at right position
      if f(A[j]) # j element at wrong position
        i += 1
        while i < j && f(A[i]); i += 1; end
        if i < j
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:j], A[(j+1):length(A)]
        end
      else # both elements at right position
        i += 1
        j -= 1
      end
    else
      if f(A[j]) # i at wrong possition but j at right position
        A[i], A[j] = A[j], A[i]
        i += 1
        j -= 1
      else # both alements at wrong position
        j -= 1
        while j > i && !f(A[j]); j -= 1; end
        if j > i
          A[i], A[j] = A[j], A[i]
          i += 1
          j -= 1
        else
          @assert i == j
          @views return A[1:(i-1)], A[i:length(A)]
        end
      end
    end
  end
  if i == j
    if f(A[i])
      @views return A[1:i], A[(i+1):length(A)]
    else
      @views return A[1:(i-1)], A[i:length(A)]
    end
  else
    @assert i == j + 1
    @views return A[1:j], A[i:length(A)]
  end
end

function partition!(pred, arr)
    if length(arr) === 0
        return 0
    end
    
    # Base case
    if length(arr) === 1
        if pred(arr[1])
            return 1
        else
            return 0
        end
    end

    # Recursively partition upper and lower halves
    middle = length(arr) >> 1
    lower = partition!(pred, @view arr[1:middle])
    upper = partition!(pred, @view arr[middle+1:end])

    # Swap upper block of falses and lower block of trues
    reverse!(@view arr[lower+1:middle])
    reverse!(@view arr[middle+1:middle+upper])
    reverse!(@view arr[lower+1:middle+upper])

    # Return total number of trues
    return lower + upper
end

function run_bench()
  N = 10^3
  b = rand(N)
  a = deepcopy(b)
  @btime sep(x -> x > 0.5, $a)
  a = deepcopy(b)
  @btime separate(x -> x > 0.5, $a)
  a = deepcopy(b)
  B = Vector{Int64}(undef, length(a))
  @btime separate2(x -> x > 0.5, $a, $B)
  a = deepcopy(b)
  @btime sep3(x -> x > 0.5, $a)
  a = deepcopy(b)
  @btime partition!(x -> x > 0.5, $a)
end

run_bench()

louisponet · June 30, 2019, 6:10pm

Very interesting! I also had a solution exactly based on replacing with ones and zeros and using the by keyword of the sort! function with something like x -> f(x) ? 0 : 1. But indeed I gave up on that since I wanted to find a way that keeps the order.

I actually also tried to pass it a closure that mutated two counters (one for true one for false), but that gave me a read-only memory error.

I didn’t know that, thanks for that!

Henrique_Becker · June 30, 2019, 7:17pm

But indeed I gave up on that since I wanted to find a way that keeps the order.

uhhhh, I think your solution based in the default sort! is a good idea. You can choose the algorithm used by sort! and if you choose mergesort, then it will be stable.

function sort_sep!(f::Function, A::AbstractVector)
  sort!(A; by = f, alg = Base.MergeSort, rev = true)
end

function run_bench()
  N = 10^3
  b = rand(N)
  a = deepcopy(b)
  @btime sort_sep!(x -> x > 0.5, $a)
  ...
end

gives

  7.262 μs (2 allocations: 4.08 KiB)

What is not without allocation, but is has about the same time than the two-filter solution (and number of lines, XD), and allocates just one quarter of the memory. The only problem is that it does not give the number of trues, however you can use a binary search after the sort, to find the first false position in O(log n).

jwmerrill · June 30, 2019, 10:02pm

The problem with your code is that it calls recursively while allocating views, and you base case is a single element. So in the end you make 2*length(arr) calls and allocate considerably more 2*length(arr) views (because there is not only the views for calling recursively as there are the ones for reverse! too). Each view is allocated in the heap and take more memory than a Float64, that is the eltype of the array I used to test, so this allocates far more than an aux vector, which @louisponet wanted to avoid. Maybe if a pair of Int64 are used as indexes and passed as parameters recursively and to reverse! , this way no allocation happens in heap, and at max log2(length(arr)) tuples of paramters will be stacked in the stack memory.

Good point. There’s an issue discussing stack allocated array views along with some ideas for work-arounds.

Henrique_Becker · June 30, 2019, 10:45pm

Oh, this would be very interesting. Every method that takes an AbstractArray would work on these views and they would be stack allocated, it would be effortless to write efficient code operating over array subsets.

louisponet · July 2, 2019, 9:45pm

Finally had some time to look through the comments, this was very useful for me to understand different ways of doing this and how to leverage some things in the language. Thanks!

goerch · November 1, 2022, 3:13pm

I stumbled upon this thread when searching for something like std::partition or std::stable_partition in Julia. Do we have these in Base?

Topic		Replies	Views
Fastest way to partition array, given a condition Performance sort	26	2399	July 26, 2021
Fast permutation vector: code needed Performance sorting	38	643	March 21, 2023
In place sorting for views Performance question , sort , views	12	2743	March 1, 2021
Split vector into two using a boolean test General Usage question	1	821	September 28, 2018
Compare permutaion of two numbers efficiently? Performance	32	854	August 4, 2020

Best `separate` performance?

Related topics