# Best `separate` performance?

Hi all, I’ve been trying out some ways of separating a `Vector` into a `true` and `false` part based on predicate `f`, and the basic code below is what I ended up with. My question/challenge is whether anyone knows a faster way of doing something like this.

``````function separate(f::Function, A::AbstractVector)
la = length(A)
ix = Vector{Int}(undef, la)
true_id = 1
false_id = 0
for (i, a) in enumerate(A)
if f(a)
ix[true_id] = i
true_id += 1
else
ix[end - false_id] = i
false_id += 1
end
end
reverse!(ix, true_id, la)
return permute!(A, ix), true_id
end

#julia> @btime separate(x -> x>5, \$(collect(1:10)))
#  68.969 ns (3 allocations: 352 bytes)
#([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], 6)
``````

When asking these type of questions it is quite important to mention what types of input the function needs to handle. For example:

• Is a length of 10 very typical or could the length be 10^6?
• Will the separation function be as simple as `x > 5` or will it be more general.
• Does the output need to be exactly like what you have right now?

Optimization is often exploiting problem specific properties so to do best you need to know the constraints you are working under.

1 Like

Yes this is very true, maybe this wasn’t really a good question. It’s supposed to be just a relatively general tool that works with `Vector`s up to a length of the order of around `1e3`, and the function not drastically more complex than this one, something one would use in a `findfirst(x-> ...., A)` or `map(x->...., A)` kind of way (not with a gigantic `do end` block).

Output can be different but it should at least be structured similarly, i.e. either like it is now, or something like `return trues, falses` where they are just two arrays instead of one.
I forgot to mention: The ordering of the true and false elements should be kept intact.

aside from this being almost 10x slower, `true_id` that is returned is the index of the first false element.

The simplest would just be to use `trues = findall(f, arr)` to get the trues and then a `setdiff(1:length(arr), trues)` to get the falses, but maybe that has some overhead?

Something like this?

``````function separate2(f::Function, A::AbstractVector)
trues = findall(f, A)
falses = setdiff(1:length(A), trues)
return A[trues], A[falses]
end
``````

gives me

``````@btime separate2(x->x>5, \$(collect(1:10)))
757.307 ns (28 allocations: 1.63 KiB)
([6, 7, 8, 9, 10], [1, 2, 3, 4, 5])
``````

A simple one and a fast one – `sep2` is about 2x quicker at 10, 4x quicker at 10^3, for me:

``````sep(f,A) = filter(f,A), filter(!f,A)

function sep2(f,A::Vector)
out = similar(A)
nt, nf = 0, length(A)+1
@inbounds for a in A
if f(a)
out[nt+=1] = a
else
out[nf-=1] = a
end
end
reverse!(@view out[1+nt:end])
out
end
``````
1 Like

Ah nice, basically the same but without the index buffer, great!

Is there any way to do this fully in-place (no allocations)? I gave it a couple of tries but I couldn’t figure it out.

I wrote the code below for fun. It is fast and in-place (no extra allocation), however it does not fill your pre-requisite that:

The ordering of the true and false elements should be kept intact.

I am not sure if this entirely possible to be done in a `O(n)`-time and `O(1)`-memory algorithm, seems like you need a `O(n)`-time-and-memory (like your first algorithm), or a `O(1)`-memory but `O(n^2)`-time. If you really not want the allocation (but can pay `O(n^2)` time), you can do a single pass where the first true element after some false ones is saved to a variable, and then shift the whole false block one step by the side (copy `i:(j-1)` where `i` is the first false and `j` is the first true after the falses to `(i+1):j`) and put the true element in position `i`, but this is horrid.

I think the only sane thing if you need to keep the ordering and do not want to allocate at each call is to take the permutation `Vector` as a parameter, and make sure its has the right size and none of the old values will be used, then you can reuse it between calls.

``````using BenchmarkTools

sep(f::Function, A::AbstractVector) = filter(f,A), filter(!f,A)

# does not work for vectors with different index types
function separate(f::Function, A::AbstractVector)
la = length(A)
ix = Vector{Int}(undef, la)
true_id = 1
false_id = 0
for (i, a) in enumerate(A)
if f(a)
ix[true_id] = i
true_id += 1
else
ix[end - false_id] = i
false_id += 1
end
end
reverse!(ix, true_id, la)
return permute!(A, ix), true_id
end

# does not work for vectors with different index types
function sep3(f::Function, A::AbstractVector)
i, j = 1, length(A)
@inbounds while i < j
if f(A[i]) # i element at right position
if f(A[j]) # j element at wrong position
i += 1
while i < j && f(A[i]); i += 1; end
if i < j
A[i], A[j] = A[j], A[i]
i += 1
j -= 1
else
@assert i == j
@views return A[1:j], A[(j+1):length(A)]
end
else # both elements at right position
i += 1
j -= 1
end
else
if f(A[j]) # i at wrong possition but j at right position
A[i], A[j] = A[j], A[i]
i += 1
j -= 1
else # both alements at wrong position
j -= 1
while j > i && !f(A[j]); j -= 1; end
if j > i
A[i], A[j] = A[j], A[i]
i += 1
j -= 1
else
@assert i == j
@views return A[1:(i-1)], A[i:length(A)]
end
end
end
end
if i == j
if f(A[i])
@views return A[1:i], A[(i+1):length(A)]
else
@views return A[1:(i-1)], A[i:length(A)]
end
else
@assert i == j + 1
@views return A[1:j], A[i:length(A)]
end
end

function run_bench()
N = 10^3
b = rand(N)
a = deepcopy(b)
ts1,fs1 = @btime sep(x -> x > 0.5, \$a)
a = deepcopy(b)
ts2,fs2 = @btime separate(x -> x > 0.5, \$a)
a = deepcopy(b)
ts3,fs3 = @btime sep3(x -> x > 0.5, \$a)
end

run_bench()

#   7.882 μs (19 allocations: 16.69 KiB)
#   3.272 μs (3 allocations: 15.91 KiB)
# 519.591 ns (3 allocations: 128 bytes)
``````

Is there any way to do this fully in-place (no allocations)? I gave it a couple of tries but I couldn’t figure it out.

Here’s a stable, in-place partition algorithm that runs in `O(n*log(n))` (I think…). It’s a lot like a merge sort. Based on an implementation in D’s standard library, described here.

This version puts all the elements for which the predicate returns true at the front, and returns the number of elements that evaluate to true.

``````function partition!(pred, arr)
if length(arr) === 0
return 0
end

# Base case
if length(arr) === 1
if pred(arr)
return 1
else
return 0
end
end

# Recursively partition upper and lower halves
middle = length(arr) >> 1
lower = partition!(pred, @view arr[1:middle])
upper = partition!(pred, @view arr[middle+1:end])

# Swap upper block of falses and lower block of trues
reverse!(@view arr[lower+1:middle])
reverse!(@view arr[middle+1:middle+upper])
reverse!(@view arr[lower+1:middle+upper])

return lower + upper
end
``````

Example:

``````julia> let data = collect(1:10)
partition!(x->x>5, data), data
end
(5, [6, 7, 8, 9, 10, 1, 2, 3, 4, 5])
``````
1 Like

Here’s a paper claiming you can do it in `O(n)` time, but it looks a lot trickier to implement: Stable Minimum Space Partitioning in Linear Time [PDF].

I think Block Sort (I’d link to Wikipedia, but this forum limits new users to two links per post) uses some similar ideas to achieve an in-place, stable sort with worst case `O(n*log(n))` running time.

In retrospective, this is a kinda obvious transformation I have forgotten about. You can treat falses as zeros and trues as ones (or the contrary) and solve by sorting the array by the value associated, I had also forgotten that mergesort did not need a temporary vector.

However, your solution is the one that takes more time and memory, for at least an order of magnitude more than the others, I have run the code below and what I get is:

``````# one-liner that create two separate vectors using two filters
7.490 μs (19 allocations: 16.69 KiB)
# first post code, uses aux vector for permutation
3.255 μs (3 allocations: 15.91 KiB)
# first post code, but I take the permutation vector as parameter
# and discovered that permute! makes a copy of  the permutation
# vector so I use Base.permute!! instead
2.385 μs (1 allocation: 32 bytes)
# my code that does not preserve order and return views
507.447 ns (3 allocations: 128 bytes)
# your code based on mergesort
49.923 μs (4995 allocations: 234.14 KiB)
``````

The problem with your code is that it calls recursively while allocating views, and you base case is a single element. So in the end you make `2*length(arr)` calls and allocate considerably more `2*length(arr)` views (because there is not only the views for calling recursively as there are the ones for `reverse!` too). Each view is allocated in the heap and take more memory than a Float64, that is the `eltype` of the array I used to test, so this allocates far more than an aux vector, which @louisponet wanted to avoid. Maybe if a pair of Int64 are used as indexes and passed as parameters recursively and to `reverse!`, this way no allocation happens in heap, and at max `log2(length(arr))` tuples of paramters will be stacked in the stack memory.

Also, @louisponet, by what I have seen in `permute!` implementation (https://github.com/JuliaLang/julia/blob/ac8c5b70514fc0bd743f4e5df6a13dccdaccddb9/base/combinatorics.jl#L97) it seems to enforce one-based indexing. Unfortunately your effort to make the code index-agnostic is not sufficient.

The code used for the tests is below:

``````using BenchmarkTools

sep(f::Function, A::AbstractVector) = filter(f,A), filter(!f,A)

# does not work for vectors with different index types
function separate(f::Function, A::AbstractVector)
la = length(A)
ix = Vector{Int}(undef, la)
true_id = 1
false_id = 0
for (i, a) in enumerate(A)
if f(a)
ix[true_id] = i
true_id += 1
else
ix[end - false_id] = i
false_id += 1
end
end
reverse!(ix, true_id, la)
return permute!(A, ix), true_id
end

# does not work for vectors with different index types
function separate2(f :: Function, A :: AbstractVector, B :: AbstractVector)
i, j = 0, length(A) + 1
for (k, a) in enumerate(A)
if f(a)
B[i += 1] = k
else
B[j -= 1] = k
end
end
reverse!(B, i, length(A))
return Base.permute!!(A, B), i
end

# does not work for vectors with different index types
function sep3(f::Function, A::AbstractVector)
i, j = 1, length(A)
@inbounds while i < j
if f(A[i]) # i element at right position
if f(A[j]) # j element at wrong position
i += 1
while i < j && f(A[i]); i += 1; end
if i < j
A[i], A[j] = A[j], A[i]
i += 1
j -= 1
else
@assert i == j
@views return A[1:j], A[(j+1):length(A)]
end
else # both elements at right position
i += 1
j -= 1
end
else
if f(A[j]) # i at wrong possition but j at right position
A[i], A[j] = A[j], A[i]
i += 1
j -= 1
else # both alements at wrong position
j -= 1
while j > i && !f(A[j]); j -= 1; end
if j > i
A[i], A[j] = A[j], A[i]
i += 1
j -= 1
else
@assert i == j
@views return A[1:(i-1)], A[i:length(A)]
end
end
end
end
if i == j
if f(A[i])
@views return A[1:i], A[(i+1):length(A)]
else
@views return A[1:(i-1)], A[i:length(A)]
end
else
@assert i == j + 1
@views return A[1:j], A[i:length(A)]
end
end

function partition!(pred, arr)
if length(arr) === 0
return 0
end

# Base case
if length(arr) === 1
if pred(arr)
return 1
else
return 0
end
end

# Recursively partition upper and lower halves
middle = length(arr) >> 1
lower = partition!(pred, @view arr[1:middle])
upper = partition!(pred, @view arr[middle+1:end])

# Swap upper block of falses and lower block of trues
reverse!(@view arr[lower+1:middle])
reverse!(@view arr[middle+1:middle+upper])
reverse!(@view arr[lower+1:middle+upper])

return lower + upper
end

function run_bench()
N = 10^3
b = rand(N)
a = deepcopy(b)
@btime sep(x -> x > 0.5, \$a)
a = deepcopy(b)
@btime separate(x -> x > 0.5, \$a)
a = deepcopy(b)
B = Vector{Int64}(undef, length(a))
@btime separate2(x -> x > 0.5, \$a, \$B)
a = deepcopy(b)
@btime sep3(x -> x > 0.5, \$a)
a = deepcopy(b)
@btime partition!(x -> x > 0.5, \$a)
end

run_bench()
``````
1 Like

Very interesting! I also had a solution exactly based on replacing with ones and zeros and using the `by` keyword of the `sort!` function with something like `x -> f(x) ? 0 : 1`. But indeed I gave up on that since I wanted to find a way that keeps the order.

I actually also tried to pass it a closure that mutated two counters (one for true one for false), but that gave me a read-only memory error.

I didn’t know that, thanks for that!

But indeed I gave up on that since I wanted to find a way that keeps the order.

uhhhh, I think your solution based in the default `sort!` is a good idea. You can choose the algorithm used by `sort!` and if you choose mergesort, then it will be stable.

``````function sort_sep!(f::Function, A::AbstractVector)
sort!(A; by = f, alg = Base.MergeSort, rev = true)
end

function run_bench()
N = 10^3
b = rand(N)
a = deepcopy(b)
@btime sort_sep!(x -> x > 0.5, \$a)
...
end
``````

gives

``````  7.262 μs (2 allocations: 4.08 KiB)
``````

What is not without allocation, but is has about the same time than the two-filter solution (and number of lines, XD), and allocates just one quarter of the memory. The only problem is that it does not give the number of trues, however you can use a binary search after the sort, to find the first false position in `O(log n)`.

The problem with your code is that it calls recursively while allocating views, and you base case is a single element. So in the end you make `2*length(arr)` calls and allocate considerably more `2*length(arr)` views (because there is not only the views for calling recursively as there are the ones for `reverse!` too). Each view is allocated in the heap and take more memory than a Float64, that is the `eltype` of the array I used to test, so this allocates far more than an aux vector, which @louisponet wanted to avoid. Maybe if a pair of Int64 are used as indexes and passed as parameters recursively and to `reverse!` , this way no allocation happens in heap, and at max `log2(length(arr))` tuples of paramters will be stacked in the stack memory.

Good point. There’s an issue discussing stack allocated array views along with some ideas for work-arounds.

Oh, this would be very interesting. Every method that takes an AbstractArray would work on these views and they would be stack allocated, it would be effortless to write efficient code operating over array subsets.

Finally had some time to look through the comments, this was very useful for me to understand different ways of doing this and how to leverage some things in the language. Thanks!