Exploring Efficient not in Operations in Julia: Benchmarking Different Methods

So, it turns out there are a number of ways to write code in Julia that checks if a scalar is or is not within a vector.

See benchmark repo.

Exploring Efficient not in Operations in Julia

Introduction

Julia provides multiple ways to implement a not in (\notin) operation, which checks whether a scalar is not present in a vector. While the syntax !(scalar in vector) is straightforward, there are alternative methods like using any, all, and manual loops, each with potential trade-offs in performance.

I decided to benchmark different implementations to evaluate their efficiency under various scenarios. In this post, I will:

  • Present the methods I tested.
  • Outline the benchmarking process I followed.
  • Seek advice from the Julia community on potential improvements or overlooked methods.

Methods Tested

Here are the five approaches I tested:

  1. !(node in route): The standard not in operation in Julia.
  2. !any(isequal(node), route): Uses any to check for equality, then negates the result.
  3. all(x -> x != node, route): Verifies that all elements in the vector are not equal to the scalar.
  4. Manual for loop: Iterates through the vector and breaks early if a match is found.
  5. SIMD-optimized for loop: A manual loop using the @simd macro for potential vectorization and faster execution.

Benchmarking Setup

The benchmarking script:

  • Generates random-sized vectors (routes) containing integers between 1 and NB_VERTICES.
  • Tests each method against random scalar values (nodes).
  • Measures execution time using the BenchmarkTools.jl package over multiple iterations.

Key Parameters

  • NB_VERTICES: Total number of unique vertices.
  • NB_ROUTES: Number of routes (vectors) to generate.
  • MAX_ROUTE_SIZE: Maximum size of each vector.
  • ITERATIONS: Number of benchmarking repetitions.

Discussion Points

  • Are there any better approaches to write a not in operation that I may have missed?
  • For small and large vectors, which method should theoretically be the most efficient? Why?
  • How would SIMD optimizations (@simd) impact this operation in practice?

I need to optimize this operation for some custom Dynamic-Programming algorithms.


Request for Feedback

I’d love to hear your thoughts on:

  • The choice of methods I benchmarked.
  • Additional ideas for writing efficient not in operations in Julia.
  • Any relevant insights into optimizing this operation, particularly with respect to memory and computational efficiency.

Feel free to share your expertise or suggest any improvements to my benchmarking approach. Let’s discuss the best way to handle such operations in Julia! You can review my code, and propose improvements on the GitHub repo

Results

Here are results with a decent amount of iterations:

Method: !in Results:  Mean: 4.741224284129747e-7 Std: 1.905167918928099e-8
Method: any Results:  Mean: 4.74334978316611e-7 Std: 1.5981341335909413e-8
Method: all Results:  Mean: 4.7388496802150037e-7 Std: 1.7295353118322162e-8
Method: for Results:  Mean: 4.747417947615649e-7 Std: 1.7384824081541754e-8
Method: sfor Results: Mean: 4.759807047715669e-7 Std: 1.9178525661399665e-8

So it seems there is not really any significant difference between those alternative writings.

4 Likes

If this is a workload you care about, you really shouldn’t be using a Vector. a Set or tree-based structure will do these operations way way way faster.

7 Likes

I’m relieved by this conclusion. These should all be equivalent unless one of them is not getting optimized appropriately.

2 Likes

For a hash-based approach with a nice interface, see AcceleratedArrays.jl.

2 Likes
  • How would SIMD optimizations (@simd) impact this operation in practice?

If only using Vectors, this implementation seems faster by 30-40% than !(x in X) for length 1000 on my PC:

sum(X .== x) == 0

Probably due to easier SIMD. This does two passes over data, so I expect a hand-written loop to gain another 2x.

1 Like

I guess you are speaking about when X is a Vector of Vectors or Matrix. I will test the not in in this situation later.

So for now, I have tested:

function method_1(node, route)
    !(node in route)
end

function method_2(node, route)
    !any(isequal(node), route)
end

function method_3(node, route)
    all(x -> x != node, route)
end

function method_4(node, route)
    for r in route
        if r == node
            return false
        end
    end
    return true
end

function method_5(node, route)
    @simd for r in route
        if r == node
            return false
        end
    end
    return true
end

function method_6(node, route)
    return sum(route .== node) == 0
end

function method_7(node, route)
    sum = 0
    @simd for r in route
            sum += r == node
    end
    return sum == 0
end

Where I name each method according to:

METHODS::Vector{Tuple{String, Function}} = [
    ("!in", method_1),
    ("any", method_2),
    ("all", method_3),
    ("for", method_4),
    ("sfor", method_5),
    ("sum", method_6),
    ("ssum", method_7),
]

And I get:

Method: !in Mean: 4.76043449380637e-7 Std: 1.671772030037151e-8
Method: any Mean: 4.748514428584657e-7 Std: 1.8416113584003793e-8
Method: all Mean: 4.773872523225068e-7 Std: 1.565677259120496e-8
Method: for Mean: 4.775637474296272e-7 Std: 1.840971015276048e-8
Method: sfor Mean: 4.771978918301575e-7 Std: 1.8409102121552966e-8
Method: sum Mean: 9.987918e-5 Std: 5.949062057266853e-6
Method: ssum Mean: 4.768339770688275e-7 Std: 1.918197393282483e-8

Which looks like this:

So again, all methods are equivalent, except for sum method sum(route .== node) == 0 which is clearly slower.

You get the same vectorization, without allocation, by doing

sum(==(x), X)

or, even a bit faster (for some reason)

count(==(x), X)

Early bailout with any is good when x is present, but will ruin simd. A hand-vectorized early bailout version is probably optimal.

julia> @btime sum($x .== $X);
  110.394 ns (3 allocations: 224 bytes)

julia> @btime count(==($x), $X);
  62.245 ns (0 allocations: 0 bytes)

This is with X of length 1024.

It’s also relevant to note that your benchmark code is invalid in the sense of PSA: Microbenchmarks remember branch history

This doesn’t matter for your specific code, because it’s not branch-heavy. But you absolutely need a different benchmarking setup.

Thanks for checking! I imagine GC has an effect on aggregate timings, since method_6 allocates. To me, that is expected, but the extent of the slowdown is a bit surprising (200x, I guess?).

I wonder what timings you get for

using BenchmarkTools
X = rand(Int, 1000);
x = rand(Int);
@btime method_1($x, $X);
@btime method_6($x, $X);

On my laptop:

  601.136 ns (0 allocations: 0 bytes)
  152.242 ns (3 allocations: 224 bytes)
julia> @btime count(==($x), $X);
  81.612 ns (0 allocations: 0 bytes)

Ah, the llvm ir is so neat, nice! (EDIT: LOL. Not IR, assembly, too little sleep, sorry.)

.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vpcmpeqq        ymm5, ymm2, ymmword ptr [rdx + 8*rax + 8]
        vpsubq  ymm0, ymm0, ymm5
        vpcmpeqq        ymm5, ymm2, ymmword ptr [rdx + 8*rax + 40]
        vpsubq  ymm1, ymm1, ymm5
        vpcmpeqq        ymm5, ymm2, ymmword ptr [rdx + 8*rax + 72]
        vpsubq  ymm3, ymm3, ymm5
        vpcmpeqq        ymm5, ymm2, ymmword ptr [rdx + 8*rax + 104]
        vpsubq  ymm4, ymm4, ymm5
        add     rax, 16
        cmp     rsi, rax
        jne     .LBB0_6

FWIW, @DNF’s proposal:

julia> function block_in(needle, haystack)
       found = 0
       len = length(haystack)
       for i=1:(len>>6)
       start = 1 + (i-1)<<6
       stop = min(len, i <<6)
       @simd for j=start:stop
       @inbounds straw = haystack[j]
       found += needle==straw
       end
       found > 0 && return true
       end
julia> r=zeros(Int, 100_000); @btime in(1, $r); @btime block_in(1, $r); @btime count($(==(1)), $r)
  29.678 μs (0 allocations: 0 bytes)
  13.516 μs (0 allocations: 0 bytes)
  11.892 μs (0 allocations: 0 bytes)