Best way to count element pairs (x,y) satisfying a condition?

Hello everyone! I was trying to optimize the following MWE:

using BenchmarkTools
N = 2^20
a = rand(N);
b = rand(N);
@btime count(a+b .< 1)
  2.492 ms (9 allocations: 8.13 MiB)

This seems to be allocating memory for a BitVector, resulting in huge memory costs. I figured that by using function arrow syntax and sum() instead of count()I could improve the performance:

@btime sum(x-> x[1] + x[2] < 1, (a,b));
  367.150 ns (2 allocations: 64 bytes)

Is there a way to achieve better performance? Can I even improve the cost of the RNG?

I think you want to measure the perfomance of @btime sum(x-> x[1] + x[2] < 1, zip(a,b)). Your second example is not equivalent to the initial one.

1 Like
julia> @btime count(x->x[1]+x[2]<1, zip($a,$b))
  1.002 ms (0 allocations: 0 bytes)

is allocation free. So is using sum. The reason you are seeing allocations, is that you weren’t interpolating variables ($) into the benchmark expression.

4 Likes

For the specific MWE in the OP, this also works (according to probability and N=2^20):

julia> using Distributions

julia> @btime rand(Binomial(2^20, 0.5))
  15.981 ns (0 allocations: 0 bytes)
524692

This is tongue-in-cheek but what is true is that a less minimal MWE might get more optimization help from community.

Typically it’s most natural to keep these pairs together from the beginning – in a lightweight columnar table. More convenient, less error-prone, and better performance:

julia> using StructArrays

julia> data = StructArray(; a, b)

julia> @btime count(x->x.a+x.b<1, $data)
  338.458 μs (0 allocations: 0 bytes)

But even if you have these arrays separately and don’t want to always keep them together, creating a table from individual arrays in Julia is free anyway :slight_smile: :

julia> @btime count(x->x.a+x.b<1, StructArray(a=$a, b=$b))
  337.750 μs (0 allocations: 0 bytes)  # same time as above
2 Likes

You can write a simple loop as well:

julia> @btime count(x->x.a+x.b<1, StructArray(a=$a, b=$b))
  660.625 μs (0 allocations: 0 bytes)
524354

julia> function mycount(a,b)
           n = 0
           for i in eachindex(a,b)
               n += a[i] + b[i] < 1
           end
           return n
       end
mycount (generic function with 1 method)

julia> @btime mycount($a,$b)
  658.293 μs (0 allocations: 0 bytes)
524354
3 Likes

As we are seeing here very different values, I would recall everyone that 1 μs = 1000 ns and 1 ms = 1000 μs…

To complete the scenario (of the various possible combinations) I add this variant

count(i-> a[i] + b[i] < 1, eachindex(a))
1 Like

Many thanks to everyone for responding to my first post! I feel like I learned a lot from your answers.

Thanks for noticing, my second example was not even correct :sweat_smile:

Actuallly what I am trying to count is the points inside the circle inscribed in the unit square for a Monte Carlo \pi approximation.

(x-0.5)^2 + (y-0.5)^2 \leq (0.5)^2 \quad \pi \simeq 4 \cdot \frac{\text{#inside circle}}{N}

Your solution seems to be quite fast, although I would prefer not to use additional dependencies for this.

I wrote a script to compare your solutions and some of my own:

function mycount(x,y)
    n = 0
    for i in eachindex(x,y)
        n +=  (x[i]-.5)^2 + (y[i]-.5)^2 <= .25
    end
    return n
end

function naive_count(N)
    n = 0
    for _ in 1:N
        x, y = rand(2)
        n +=  (x-.5)^2 + (y-.5)^2 <= .25
    end
    return n
end

function compare_impl(N=2^20)
    x = rand(N);
    y = rand(N);
    println("Time for method 1")
    @btime sum(($x .- .5).^2 .+ ($y.- .5).^2 .<= .25);
    println("Time for method 2")  
    @btime sum(p-> (p[1]-.5)^2 + (p[2]-.5)^2 <= .25, zip($x, $y));
    #@btime count(p-> (p[1]-.5)^2 + (p[2]-.5)^2 <= .25, zip($x, $y))
    println("Time for method 3")
    @btime mycount($x,$y);
    println("Time for method 4")
    @btime naive_count($N);
    println("Time for method 5")
    @btime sum(i-> (x[i]-.5)^2 + (y[i]-.5)^2 <= .25, eachindex(x))
    println("Time for RNG")
    @btime rand();
    println("Time for RNG vector")
    @btime rand($N);
    println("Time for RNG matrix")
    @btime rand($N,2);
    return
end

This are the results I get for N=2^{20}

Time for method 1
  834.600 μs (4 allocations: 132.27 KiB)
Time for method 2
  759.300 μs (0 allocations: 0 bytes)
Time for method 3
  649.200 μs (0 allocations: 0 bytes)
Time for method 4
  47.157 ms (1048576 allocations: 80.00 MiB)
Time for method 5
  684.200 μs (0 allocations: 0 bytes)
Time for RNG
  5.400 ns (0 allocations: 0 bytes)
Time for RNG vector
  1.936 ms (2 allocations: 8.00 MiB)
Time for RNG matrix
  4.281 ms (2 allocations: 16.00 MiB)

The simple for loop by @lmiq seems to be the best! :partying_face:

1 Like

The reason why I don’t use count() is because I keep getting a strange error

ERROR: UndefVarError: count not defined

because I don’t seem to have the count method for bitarrays. Very weird, although that is on another topic.

The RNG part is now the critical part of my code in terms of cost. Any further ideas on how to improve it?

function fast_pi(N=2^20)
   x = rand(N);
   y = rand(N);
   return 4/N * mycount(x,y);
end

I’d use @tturbo from LoopVectorization (make sure you start Julia with the threads of your computer). Unlike Threads.@threads, it can manage these types of iterations that otherwise wouldn’t be parallelizable.

using BenchmarkTools, LoopVectorization

N=2^20

x = rand(N);
y = rand(N);


function mycount(x,y)
    n = 0
    for i in eachindex(x,y)
        n += (x[i] - 0.5)^2 + (y[i] - 0.5)^2 <= 0.25
    end
    return n
end

function mycount_turbo(x,y) # non-threaded version
    n = 0
    @turbo for i in eachindex(x,y)
        n += (x[i] - 0.5)^2 + (y[i] - 0.5)^2 <= 0.25
    end
    return n
end

function mycount_tturbo(x,y)
    n = 0
    @tturbo for i in eachindex(x,y)
        n += (x[i] - 0.5)^2 + (y[i] - 0.5)^2 <= 0.25
    end
    return n
end


@btime mycount($x,$y)
@btime mycount_turbo($x,$y)
@btime mycount_tturbo($x,$y)

The results in my computer are:

206.600 μs (0 allocations: 0 bytes)
189.000 μs (0 allocations: 0 bytes)
35.200 μs (0 allocations: 0 bytes)

I’m not sure if @turbo is doing anything else that the standard julia compiler doesn´t do already, at least not in my machine:

julia> @benchmark mycount(x,y) setup=(x=rand(2^20); y=rand(2^20)) evals=1
BenchmarkTools.Trial: 1687 samples with 1 evaluation.
 Range (min … max):  657.078 μs …  1.430 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     803.207 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   823.925 μs ± 81.649 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▆█▄▁▆▇▂                                           
  ▂▂▂▂▂▂▂▁▂▂▂▅████████▅▅▅▄▄▄▄▃▃▃▃▃▃▃▃▂▂▃▂▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  657 μs          Histogram: frequency by time         1.18 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mycount_turbo(x,y) setup=(x=rand(2^20); y=rand(2^20)) evals=1
BenchmarkTools.Trial: 1684 samples with 1 evaluation.
 Range (min … max):  648.392 μs …  1.492 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     798.282 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   807.451 μs ± 50.628 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▃▃▅█▆▃▆▄▇▄▃                                
  ▂▁▁▂▁▂▁▁▂▂▂▂▂▁▂▂▁▂▅████████████▇▅▅▄▄▅▄▃▃▃▃▃▃▂▂▂▃▃▂▂▂▂▂▂▂▂▂▁▂ ▄
  648 μs          Histogram: frequency by time          995 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

(with @tturbo and parallelization, of course)

yes, I suspect the same. That’s why I was recommending @tturbo (I was including @turbo just in case someone knew why).

Separate the allocation from the RNG, which you can do using Random.rand!.

Also, I guess it would make sense to use a single vector instead of just two vectors, for more efficient CPU cache usage. Example with cute functional code:

module M

import Random

export mocapi!

predicate(
  p::NTuple{2, F}, ::Val{c} = Val(one(F)/UInt8(2)),
) where {F<:AbstractFloat, c} =
  +(map((x -> x*x) ∘ (x -> x - c), p)...) < c*c

accessor(v::AbstractVector) = let u = v
  (i::Integer) -> u[begin + i]
end

access(v::AbstractVector, i::Integer) = map(accessor(v), (i, i + 1))

count(v::AbstractVector{<:AbstractFloat}) = let
  map = let u = v
    (i::Integer) -> predicate(access(u, i))
  end
  len = length(v)
  (len % Bool) && error("odd length")
  mapreduce(map, +, 0:2:(len - 1), init = 0)
end

mocapi(v::AbstractVector{<:AbstractFloat}) = let
  len = length(v)
  n = len >>> true
  c = count(v)
  c/n*4
end

mocapi!(v::AbstractVector{<:AbstractFloat}) = mocapi(Random.rand!(v))

end

NB: on Julia nightly, Random.rand! actually works for tuple types, so the vector could be Vector{NTuple{2, Float64}} instead of Vector{Float64}, which would simplify the code a bit, if you’re willing to use nightly.

Usage example:

julia> include("/tmp/mocapi/M.jl")
Main.M

julia> using .M

julia> mocapi!(Vector{Float64}(undef, 2^21))
3.1428489685058594

julia> mocapi!(Vector{Float64}(undef, 2^20))
3.1403274536132812

I also tried various other stuff that didn’t seem to make a difference:

  1. @inbounds
  2. Letting the compiler know the vector length is greater than some large constant, to encourage vectorization.
  3. Letting the compiler know the vector length is a multiple of a power of two. This can help the compiler avoid generating unnecessary code for finishing up processing vectorized loops.

One other thing that could maybe help is ensuring the vector’s data has big alignment, and telling the compiler’s optimizer about it. My package UnsafeAssume is relevant for this, but I doubt a significant improvement would materialize.

If you write something like this inside a function,

count = count(x)

then the compiler treats each count as the same local variable, and when the count variable gets called it’s not defined yet.

You can even have spooky action at a distance, like this:

function foo(x)
    y = count(x)
    z = 2y
    count = z
    return count + 1
end
1 Like

I share some considerations while waiting for further information from someone who knows better the internal functioning of the functions involved.

First of all, the @imiq loop is not “simple” or, to be more precise, the way in which it exploits the particularity of the problem to avoid doing the N ifs that some other proposals do is not entirely simple.

Below are some comparisons between the following functions and the use of count()

function cepsac(a,b, cnd)
	cnt=0
	for (x,y) in zip(a,b)
		if cnd(x,y)
			cnt+=1
		end
	end
	cnt
end
function cepsac1(a,b, cnd)
	cnt=0
	for i in eachindex(a)
		cnt+=cnd(a[i],b[i])
	end
	cnt
end

julia> @btime cepsac1($a,$b, (x,y)->(x - 0.5)^2 + (y - 0.5)^2 <= 0.25)
  578.800 μs (0 allocations: 0 bytes)
822831

julia> @btime cepsac($a,$b, (x,y)->(x - 0.5)^2 + (y - 0.5)^2 <= 0.25)
  884.200 μs (0 allocations: 0 bytes)
822831

julia> @btime count(x-> ($a[x]-0.5)^2 + ($b[x]-0.5)^2 <= 0.25, eachindex($a))
  585.400 μs (0 allocations: 0 bytes)
822831

It seems that count() manages to avoid using if statements and just do sums (taking advantage of the fact that false cases do not contribute to the total)

Another consideration that I wanted to submit to other comments is the following.
Starting from Dan’s observation that optimization could depend on the specificity of the problem (in abstract on the OP one could answer that the solution is N/2), I was wondering how the different solutions behave depending on the fraction p with respect to the total of cases that make the condition true.
In other words, how do the sum and if then statements weigh.
below are the extreme cases of only if then and only sums.


julia> @btime cepsac($a,$b, (x,y)->x+y>2)
  756.200 μs (0 allocations: 0 bytes)
0

julia> @btime cepsac1($a,$b, (x,y)->x+y>2)
  535.600 μs (0 allocations: 0 bytes)
0

Another optimization,

N = 2^22
v = rand(UInt64, N)

struct DoubleInt32
    a::Int32
    b::Int32
end

function mycapi(v::Vector{UInt64})
    w = reinterpret(reshape, DoubleInt32, v)
    S = sum(w) do x
        Int64(x.a)^2 + Int64(x.b)^2 < 2^62
    end
    return S/length(w)*4
end

and this can be benchmarked:

julia> @btime mycapi($v);
  1.274 ms (0 allocations: 0 bytes)
3.1413373947143555

which is around 2x of the other alternatives. Note that the random number generation which should also be included in the timing, will also be faster. So it’s a double win.

It would be nice to see the proper way of getting at Int32 internals of a UInt64 than above reinterpret.

1 Like

FTR, this sum loop is just mapreduce in my version.

Indeed, the loop has the trick of avoiding the branch, and that also happens with the count version you posted. That’s simply because the assertion (conditon < 0.25) returns 0 or 1 (false or true).

The more naive implementation of the loop that includes an explicit if statement is probably slower.

Actually, I’d say this performance improvement is caused simply by using 32-bit elements instead of 64-bit ones, thus decreasing the required memory throughput.

A simple adaptation of my above code to your idea of using integer arithmetic (just added another predicate method and relaxed some constraints):

module M

import Random

export mocapi!

square(x) = x*x

predicate(
  p::NTuple{2, F}, ::Val{c} = Val(one(F)/UInt8(2)),
) where {F<:AbstractFloat, c} =
  +(map(square ∘ (x -> x - c), p)...) < c*c

predicate(
  p::NTuple{2, F},
) where {F<:Signed} =
  +(map(square ∘ widen, p)...) < (square ∘ (x -> x + true) ∘ widen ∘ typemax)(F)

accessor(v::AbstractVector) = let u = v
  (i::Integer) -> u[begin + i]
end

access(v::AbstractVector, i::Integer) = map(accessor(v), (i, i + 1))

count(v::AbstractVector{<:Real}) = let
  map = let u = v
    (i::Integer) -> predicate(access(u, i))
  end
  len = length(v)
  (len % Bool) && error("odd length")
  mapreduce(map, +, 0:2:(len - 1), init = 0)
end

mocapi(v::AbstractVector{<:Real}) = let
  len = length(v)
  n = len >>> true
  c = count(v)
  c/n*4
end

mocapi!(v::AbstractVector{<:Real}) = mocapi(Random.rand!(v))

end

Benchmarking reveals that using Int32 is equally fast as using Float32, and the same with Int64 and Float64:

julia> include("/tmp/M.jl")
Main.M

julia> using BenchmarkTools, .M

julia> @btime mocapi!(v) setup=(v = Vector{Float32}(undef, 2^21);)
  1.985 ms (0 allocations: 0 bytes)
3.143360137939453

julia> @btime mocapi!(v) setup=(v = Vector{Float64}(undef, 2^21);)
  4.097 ms (0 allocations: 0 bytes)
3.1419525146484375

julia> @btime mocapi!(v) setup=(v = Vector{Int32}(undef, 2^21);)
  1.918 ms (0 allocations: 0 bytes)
3.14276123046875

julia> @btime mocapi!(v) setup=(v = Vector{Int64}(undef, 2^21);)
  4.077 ms (0 allocations: 0 bytes)
3.1389427185058594

julia> versioninfo()
Julia Version 1.11.0-DEV.708
Commit fb01dd28a5f (2023-10-21 20:20 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

One nice thing to note is that the Int64 version actually uses Int128 arithmetic, after the widening, so the Int128 arithmetic is blazing fast, considering it doesn’t have full hardware support.

1 Like