Broad phase Collision detection: Bitboard approach

I have recently tried a new approach to detect potential collisions in a 2D space, just for the broad phase (In this phase we just try to quickly reduce the number of potential collision to check) so I have come up with a new idea but I don’t want to miss something.

So my idea is simple, you use a grid to subdivide the plane into small cells, each cells has a given index.
Now each object is (just for this phase) represented as a rectangle.
This way we can map each object to a set of cell given by his dimensions.

x0,y0,w0,h0 = get_dimensions(obj)

Next we map each of these data to integer values to represent cells

x, w = _map_to_range(x0,1:size,grid.size[1]), _map_to_range(w0,1:size,grid.size[1])
y, h = _map_to_range(y0,1:size,grid.size[2]), _map_to_range(h0,1:size,grid.size[2])

So x being 1 means cells number 1 on the x axis.
size is the size of the grid, we assume the grid is a square matrix.

Then what we will do now is encoding the size of the rectangle in a UInt.

traits = (h << BIT_WIDTH) | (w)

We will encode each dimensions on 5 bits so BIT_WIDTH = 5
Now what we will do is that we will use a PEXT bitboard to directly get all the cells covered by this object.
I have used Arceus.jl for that.

zones::Vector{UInt64} = grid.bitboard[traits]

bitboard is a MagicBitboard, you ccan check Arceus.jl for more information but basically, you give it a Uint with each trait encoded as a bit add it return the corresponding value in less than 30 ns.
So for each possible combination of dimensions (that suffice on 10 bits), we will get all the matching cells from the origin of the grid.
That’s why we need to offset the result:

data = grid.zones # All the possible zones for collisions
offset = (y-1) * size + x
L = length(data) 
@inbounds for i in zones # For all the cells index the object is covering
	idx = i+offset
        # We will ignore collisions out of the grid
	idx <= L && push!(data[idx], obj) # we add the object in it
end

Next we only have to do this with every objects and just do a filter for for cells with more than 1 object in it.

As for the bitboard, we precompute it with every possible traits combination, so for each trait combination, we do this

"""
    MakeRoute(x::UInt64)

This is our lookup function, it takes a bit bounding box as input and output all the zones he is in.
Using magic bitboard, foreach possible bounding box, a collection of zones are directly mapped to it
"""
function MakeRoute(x::UInt64)
	# We collect our data just by decoding them from `x`.
	# I guess it's not so hard to understand.
	# We first use the mask to put to zero all irrelevants data with a AND
	# Then if necessary we use shift to put the data at their correct position,
	sz = BIT_WIDTH
	w, h = (x & DEFAULT_MASK) + 1, ((x & (DEFAULT_MASK << sz)) >> sz) + 1

    result::Vector{UInt64} = UInt64[]
    size = 1 << sz # Faster than doing 2^sz
	
	for i in 1:w
		for j in 1:h
			push!(result, (j-1)*size+i)
		end
	end

	return result
end

result will be assigned in the lookup table to the index corresponding to this set of traits.

Now I tested this and here are the results on a Intel Pentium T4400 with just 1 thread and a software PEXT function (10 times slower than the harware version)

# Uniform repartition
# 28.931 μs (2 allocations: 16.25 KiB) for 100 objects
# 117.588 μs (2 allocations: 16.25 KiB) for 1k objects
# 1.030 ms (2 allocations: 16.25 KiB) for 10k objects

## Concentration
# 38.263 μs (2 allocations: 16.25 KiB) for 100 objects
# 212.312 μs (2 allocations: 16.25 KiB) for 1k objects
# 1.960 ms (2 allocations: 16.25 KiB) for 10k objects

What do you think about this, does it seems cool or I missed something important ?
Anyways I find this pretty fast, on a modern computer with BMI2, we can have < 1ms for 10k object, even when concentrated.

I might be misunderstanding, but:

Why are the 2D cell indices covered by this object('s quantised bounding box) not just \{(p, q) \; | \; x \leq p \leq x + w, y \leq q \leq y + h\} (or something similar, depending on your conventions)?

This doesn’t look right. What at the right or bottom boundary cells?


In general I’d say the idea looks fine. I’m not really an expert, but it sounds like a very standard approach, one of the most basic spatial partitioning strategies for this broad phase.

I used bitboard because they allow almost instantaneous lookup. Using a loop would introduce too much overhead.

And that is the twist, this approach is based on the most basic collision detection strategy for the broad phase, but with a little trick, bitboard. Naive grid partitioning are mostly O({n \times m}) but bitboard reduce it to O(n) by totally removing the need to manually get cells as you mentionned earlier. So theorically, this is the most time efficient spatial structure. But memory cost is O({2^n}) (where n in the number of bits use for the traits) because of the bit boards.

Oh, I just forgot to check :sweat_smile:
my bad

Isn’t your for i in zones loop (with zones from the look-up table / bitboard) essentially equivalent to a for p = x:x+w, q = y:y+h (or x:min(x+w, size) etc.)?

no, no. It’s O(n) while what you show is O({n\times m}). plus it’s way cheaper sinces zones already contains all the matching cells. We just need to offset them with the object position

What exactly does zones contain then? I though it was all the cells (or 1D indices thereof) covered by a rectangle of width w and height h (packed into traits) with top-left corner (1, 1). (Note that this would be \{1, \ldots, w\} \times \{1, \ldots, h\}.)

And what are n and m?

Yeah, that’s what zones contain, what I means it that it’s way cheaper than a double loop. First of all, it’s a single Vector{Uint} so the CPU is able the load the whole array in the cache, so the iteration is way faster, even if it’s the same number of loops.
When doing a double loop, we have the hustle of managing them both, and you don’t access contiguos data as with the vector.
For a double for loop, it’s a O(n×m) complexity, m is the length of the first loop, n is the length of the second one

for i in x:(x+w)
    # We will have to generate this loop m times where m is the width
    for j in y:(y+h)
        ...
    end
end

This is way slower than a simple


for i in zones
    # If `zones` is small enough, it's stored in the cache and the CPU won't have to get it from the RAM
    # `zones` is contiguous in memory so the CPU can't process it without jumping from one memory slot to another
    # We can say it will just slide through the array
    ...

I think what you propose is what cell lists are. Having performant implementations of that depends a bit on the exact problem to be solved, but here we have CellListMap.jl and PointNeighbors.jl (at least) that implement something of the sort, with different use cases in mind. For very inhomogeneous distributions of points the tree methods as those of NearestNeighbors.jl tend to be faster.

For these definitions of n and m, both loops have n \cdot m iterations, so there is no difference in complexity. If anything, the ‘double’ for approach is faster as you can precompute the upper index bound. With regards to memory: the for i in x:x+w, j in y:y+h (for for j, i depending on the grid layout) does not need to store any indices in memory (unlike the zones approach), so will be faster.

Yeah, I achnowledge that, even the benchs show that the more inhomogeneous is the repartition, the more performances seems to go down. A tree structure may be more efficient.

But these package probably doesn’t implement it the way I am thinking.

Yeah, but the each loop in the double for approach, we need to check bounds and more.
Plus the double for loop do store indices. the y:y+h needs to be stored somewhere for the loop to just look at it for the next indices.
Using a single vector will be way faster. Index are just there you just have to take them (not that the double for need to recompute linear indices + the offset).
The thing is, the CPU will be faster with a single array of index because it’s just a simple traversal, with double for, we need to initialize a loop each times.
So a zone is way faster and even if it’s t he same number of iteration, the complexity is O(n) since we just have a single loop.

y and h will be stored in registers. You don’t need to store anything else.

As a simple example to demonstrate that this is not the case:

using BenchmarkTools

function s1(x, y, w, h)
    s = 0.
    for i in x:x+w, j = y:y+h
        s += sqrt(i+j)  # Need to do something sufficiently non-trivial so that the compiler doesn't just turn this into a closed-form formula
    end
    return s
end

function s2(x, y, w, h)
    s = 0.
    for i in x:x+w
        for j = y:y+h
            s += sqrt(i+j)
        end
    end
    return s
end

function s3(x, y, w, h)
    s = 0.
    i = x
    while i <= x + w
        j = y
        while j <= y + h
             s += sqrt(i + j)
             j += 1
        end
        i += 1
    end
    return s
end

function create_indices(x, y, w, h, sz)
    ids = Int[]
    LI = LinearIndices((sz, sz))
    for i in x:x+w, j = y:y+h  # assume for simplicity that x + w <= sz && y + h <= sz
        push!(ids, LI[j, i])
    end
    return ids
end

function s4(ids)
    s = 0.
    for i in ids
        s += sqrt(i)
    end
    return s
end

@btime s1(5, 10, 20, 30)   #  929.032 ns (0 allocations: 0 bytes)
@btime s2(5, 10, 20, 30)   #  929.032 ns (0 allocations: 0 bytes)
@btime s3(5, 10, 20, 30)   #  927.273 ns (0 allocations: 0 bytes)


ids = create_indices(5, 10, 20, 30, 100)  # [410, 411, ...,  2440]
@btime s4($ids);  #   930.303 ns (0 allocations: 0 bytes) 
# Note that @btime will make sure ids gets cached.

So there’s really no difference between these approaches, if ids gets cached.

In almost all situations the overhead from looping itself will be negligible (and you need to be an expert in Julia and LLVM to predict in advance what this looping will look like in assembly: e.g. s1 could in principle use a single loop jmp; turns out s1 and s2 are identical, but distinct from s3). All these approaches thus have the same time complexity: linear in the number of cells covered by our bounding box (wh).

Well, my own benchs doesnt agree with you:
I tried replacing bitboards by a double for loop in my package (je one using these bitboard things)

and I got 4.998 ms (2 allocations: 16.25 KiB) while that’s what I have with bit boards 212.312 μs (2 allocations: 16.25 KiB).

Also, let’s take your example and add LoopVectorization.jl

using BenchmarkTools
using LoopVectorization

function s1(x, y, w, h)
    s = 0.
    @turbo for i in x:x+w, j = y:y+h
        s += sqrt(i+(j-1)*w)  # Need to do something sufficiently non-trivial so that the compiler doesn't just turn this into a closed-form formula
    end
    return s
end

function s2(x, y, w, h)
    s = 0.
    @turbo for i in x:x+w
        for j = y:y+h
            s += sqrt(i+(j-1)*w)
        end
    end
    return s
end

function s3(x, y, w, h)
    s = 0.
    i = x
    while i <= x + w
        j = y
        while j <= y + h
             s += sqrt(i + (j-1)*w)
             j += 1
        end
        i += 1
    end
    return s
end

function create_indices(x, y, w, h, sz)
    ids = Int[]
    LI = LinearIndices((sz, sz))
    for i in x:x+w, j = y:y+h  # assume for simplicity that x + w <= sz && y + h <= sz
        push!(ids, LI[j, i])
    end
    return ids
end

function s4(ids::Vector{Int64})
    s = 0.
    @turbo for i in eachindex(ids)
        s += sqrt(ids[i])
    end
    return s
end

@btime s1(5, 10, 20, 30)   #  3.007 μs (0 allocations: 0 bytes)
@btime s2(5, 10, 20, 30)   #  3.007 μs (0 allocations: 0 bytes)
@btime s3(5, 10, 20, 30)   #  5.677 μs (0 allocations: 0 bytes) Doesn't support LoopVectorization


ids = create_indices(5, 10, 20, 30, 100)  # [410, 411, ...,  2440]
@btime s4($ids);  #   2.903 μs (0 allocations: 0 bytes) 
# Note that @btime will make sure ids gets cached.

Could you post this benchmark (as a copy-pasteable MWE)?