Broad phase Collision detection: Bitboard approach

I have recently tried a new approach to detect potential collisions in a 2D space, just for the broad phase (In this phase we just try to quickly reduce the number of potential collision to check) so I have come up with a new idea but I don’t want to miss something.

So my idea is simple, you use a grid to subdivide the plane into small cells, each cells has a given index.
Now each object is (just for this phase) represented as a rectangle.
This way we can map each object to a set of cell given by his dimensions.

x0,y0,w0,h0 = get_dimensions(obj)

Next we map each of these data to integer values to represent cells

x, w = _map_to_range(x0,1:size,grid.size[1]), _map_to_range(w0,1:size,grid.size[1])
y, h = _map_to_range(y0,1:size,grid.size[2]), _map_to_range(h0,1:size,grid.size[2])

So x being 1 means cells number 1 on the x axis.
size is the size of the grid, we assume the grid is a square matrix.

Then what we will do now is encoding the size of the rectangle in a UInt.

traits = (h << BIT_WIDTH) | (w)

We will encode each dimensions on 5 bits so BIT_WIDTH = 5
Now what we will do is that we will use a PEXT bitboard to directly get all the cells covered by this object.
I have used Arceus.jl for that.

zones::Vector{UInt64} = grid.bitboard[traits]

bitboard is a MagicBitboard, you ccan check Arceus.jl for more information but basically, you give it a Uint with each trait encoded as a bit add it return the corresponding value in less than 30 ns.
So for each possible combination of dimensions (that suffice on 10 bits), we will get all the matching cells from the origin of the grid.
That’s why we need to offset the result:

data = grid.zones # All the possible zones for collisions
offset = (y-1) * size + x
L = length(data) 
@inbounds for i in zones # For all the cells index the object is covering
	idx = i+offset
        # We will ignore collisions out of the grid
	idx <= L && push!(data[idx], obj) # we add the object in it
end

Next we only have to do this with every objects and just do a filter for for cells with more than 1 object in it.

As for the bitboard, we precompute it with every possible traits combination, so for each trait combination, we do this

"""
    MakeRoute(x::UInt64)

This is our lookup function, it takes a bit bounding box as input and output all the zones he is in.
Using magic bitboard, foreach possible bounding box, a collection of zones are directly mapped to it
"""
function MakeRoute(x::UInt64)
	# We collect our data just by decoding them from `x`.
	# I guess it's not so hard to understand.
	# We first use the mask to put to zero all irrelevants data with a AND
	# Then if necessary we use shift to put the data at their correct position,
	sz = BIT_WIDTH
	w, h = (x & DEFAULT_MASK) + 1, ((x & (DEFAULT_MASK << sz)) >> sz) + 1

    result::Vector{UInt64} = UInt64[]
    size = 1 << sz # Faster than doing 2^sz
	
	for i in 1:w
		for j in 1:h
			push!(result, (j-1)*size+i)
		end
	end

	return result
end

result will be assigned in the lookup table to the index corresponding to this set of traits.

Now I tested this and here are the results on a Intel Pentium T4400 with just 1 thread and a software PEXT function (10 times slower than the harware version)

# Uniform repartition
# 28.931 μs (2 allocations: 16.25 KiB) for 100 objects
# 117.588 μs (2 allocations: 16.25 KiB) for 1k objects
# 1.030 ms (2 allocations: 16.25 KiB) for 10k objects

## Concentration
# 38.263 μs (2 allocations: 16.25 KiB) for 100 objects
# 212.312 μs (2 allocations: 16.25 KiB) for 1k objects
# 1.960 ms (2 allocations: 16.25 KiB) for 10k objects

What do you think about this, does it seems cool or I missed something important ?
Anyways I find this pretty fast, on a modern computer with BMI2, we can have < 1ms for 10k object, even when concentrated.

I might be misunderstanding, but:

Why are the 2D cell indices covered by this object('s quantised bounding box) not just \{(p, q) \; | \; x \leq p \leq x + w, y \leq q \leq y + h\} (or something similar, depending on your conventions)?

This doesn’t look right. What at the right or bottom boundary cells?


In general I’d say the idea looks fine. I’m not really an expert, but it sounds like a very standard approach, one of the most basic spatial partitioning strategies for this broad phase.

I used bitboard because they allow almost instantaneous lookup. Using a loop would introduce too much overhead.

And that is the twist, this approach is based on the most basic collision detection strategy for the broad phase, but with a little trick, bitboard. Naive grid partitioning are mostly O({n \times m}) but bitboard reduce it to O(n) by totally removing the need to manually get cells as you mentionned earlier. So theorically, this is the most time efficient spatial structure. But memory cost is O({2^n}) (where n in the number of bits use for the traits) because of the bit boards.

Oh, I just forgot to check :sweat_smile:
my bad

Isn’t your for i in zones loop (with zones from the look-up table / bitboard) essentially equivalent to a for p = x:x+w, q = y:y+h (or x:min(x+w, size) etc.)?

no, no. It’s O(n) while what you show is O({n\times m}). plus it’s way cheaper sinces zones already contains all the matching cells. We just need to offset them with the object position

What exactly does zones contain then? I though it was all the cells (or 1D indices thereof) covered by a rectangle of width w and height h (packed into traits) with top-left corner (1, 1). (Note that this would be \{1, \ldots, w\} \times \{1, \ldots, h\}.)

And what are n and m?

Yeah, that’s what zones contain, what I means it that it’s way cheaper than a double loop. First of all, it’s a single Vector{Uint} so the CPU is able the load the whole array in the cache, so the iteration is way faster, even if it’s the same number of loops.
When doing a double loop, we have the hustle of managing them both, and you don’t access contiguos data as with the vector.
For a double for loop, it’s a O(n×m) complexity, m is the length of the first loop, n is the length of the second one

for i in x:(x+w)
    # We will have to generate this loop m times where m is the width
    for j in y:(y+h)
        ...
    end
end

This is way slower than a simple


for i in zones
    # If `zones` is small enough, it's stored in the cache and the CPU won't have to get it from the RAM
    # `zones` is contiguous in memory so the CPU can't process it without jumping from one memory slot to another
    # We can say it will just slide through the array
    ...

I think what you propose is what cell lists are. Having performant implementations of that depends a bit on the exact problem to be solved, but here we have CellListMap.jl and PointNeighbors.jl (at least) that implement something of the sort, with different use cases in mind. For very inhomogeneous distributions of points the tree methods as those of NearestNeighbors.jl tend to be faster.

1 Like

For these definitions of n and m, both loops have n \cdot m iterations, so there is no difference in complexity. If anything, the ‘double’ for approach is faster as you can precompute the upper index bound. With regards to memory: the for i in x:x+w, j in y:y+h (for for j, i depending on the grid layout) does not need to store any indices in memory (unlike the zones approach), so will be faster.

Yeah, I achnowledge that, even the benchs show that the more inhomogeneous is the repartition, the more performances seems to go down. A tree structure may be more efficient.

But these package probably doesn’t implement it the way I am thinking.

Yeah, but the each loop in the double for approach, we need to check bounds and more.
Plus the double for loop do store indices. the y:y+h needs to be stored somewhere for the loop to just look at it for the next indices.
Using a single vector will be way faster. Index are just there you just have to take them (not that the double for need to recompute linear indices + the offset).
The thing is, the CPU will be faster with a single array of index because it’s just a simple traversal, with double for, we need to initialize a loop each times.
So a zone is way faster and even if it’s t he same number of iteration, the complexity is O(n) since we just have a single loop.

y and h will be stored in registers. You don’t need to store anything else.

As a simple example to demonstrate that this is not the case:

using BenchmarkTools

function s1(x, y, w, h)
    s = 0.
    for i in x:x+w, j = y:y+h
        s += sqrt(i+j)  # Need to do something sufficiently non-trivial so that the compiler doesn't just turn this into a closed-form formula
    end
    return s
end

function s2(x, y, w, h)
    s = 0.
    for i in x:x+w
        for j = y:y+h
            s += sqrt(i+j)
        end
    end
    return s
end

function s3(x, y, w, h)
    s = 0.
    i = x
    while i <= x + w
        j = y
        while j <= y + h
             s += sqrt(i + j)
             j += 1
        end
        i += 1
    end
    return s
end

function create_indices(x, y, w, h, sz)
    ids = Int[]
    LI = LinearIndices((sz, sz))
    for i in x:x+w, j = y:y+h  # assume for simplicity that x + w <= sz && y + h <= sz
        push!(ids, LI[j, i])
    end
    return ids
end

function s4(ids)
    s = 0.
    for i in ids
        s += sqrt(i)
    end
    return s
end

@btime s1(5, 10, 20, 30)   #  929.032 ns (0 allocations: 0 bytes)
@btime s2(5, 10, 20, 30)   #  929.032 ns (0 allocations: 0 bytes)
@btime s3(5, 10, 20, 30)   #  927.273 ns (0 allocations: 0 bytes)


ids = create_indices(5, 10, 20, 30, 100)  # [410, 411, ...,  2440]
@btime s4($ids);  #   930.303 ns (0 allocations: 0 bytes) 
# Note that @btime will make sure ids gets cached.

So there’s really no difference between these approaches, if ids gets cached.

In almost all situations the overhead from looping itself will be negligible (and you need to be an expert in Julia and LLVM to predict in advance what this looping will look like in assembly: e.g. s1 could in principle use a single loop jmp; turns out s1 and s2 are identical, but distinct from s3). All these approaches thus have the same time complexity: linear in the number of cells covered by our bounding box (wh).

1 Like

Well, my own benchs doesnt agree with you:
I tried replacing bitboards by a double for loop in my package (je one using these bitboard things)

and I got 4.998 ms (2 allocations: 16.25 KiB) while that’s what I have with bit boards 212.312 μs (2 allocations: 16.25 KiB).

Also, let’s take your example and add LoopVectorization.jl

using BenchmarkTools
using LoopVectorization

function s1(x, y, w, h)
    s = 0.
    @turbo for i in x:x+w, j = y:y+h
        s += sqrt(i+(j-1)*w)  # Need to do something sufficiently non-trivial so that the compiler doesn't just turn this into a closed-form formula
    end
    return s
end

function s2(x, y, w, h)
    s = 0.
    @turbo for i in x:x+w
        for j = y:y+h
            s += sqrt(i+(j-1)*w)
        end
    end
    return s
end

function s3(x, y, w, h)
    s = 0.
    i = x
    while i <= x + w
        j = y
        while j <= y + h
             s += sqrt(i + (j-1)*w)
             j += 1
        end
        i += 1
    end
    return s
end

function create_indices(x, y, w, h, sz)
    ids = Int[]
    LI = LinearIndices((sz, sz))
    for i in x:x+w, j = y:y+h  # assume for simplicity that x + w <= sz && y + h <= sz
        push!(ids, LI[j, i])
    end
    return ids
end

function s4(ids::Vector{Int64})
    s = 0.
    @turbo for i in eachindex(ids)
        s += sqrt(ids[i])
    end
    return s
end

@btime s1(5, 10, 20, 30)   #  3.007 μs (0 allocations: 0 bytes)
@btime s2(5, 10, 20, 30)   #  3.007 μs (0 allocations: 0 bytes)
@btime s3(5, 10, 20, 30)   #  5.677 μs (0 allocations: 0 bytes) Doesn't support LoopVectorization


ids = create_indices(5, 10, 20, 30, 100)  # [410, 411, ...,  2440]
@btime s4($ids);  #   2.903 μs (0 allocations: 0 bytes) 
# Note that @btime will make sure ids gets cached.

Could you post this benchmark (as a copy-pasteable MWE)?

2 Likes

It’s part of a greater package, for example, here is what I ran to have these result:

using .Interactions
using BenchmarkTools

world = IWorld2D()
const Vec2f = Interactions.Vec2f

for i in 1:1000
	push!(world.particles, Particle2D(rand(), Vec2f(rand(1:40), rand(1:20)), Vec2f(rand(-5:5),rand(-5:5)), 
		Vec2f(rand(0:5),rand(0:5)), Vec2f(0,0), 0.95))
end

grid = world.grid
particles = world.particles
@btime ProcessObjects($grid, $particles)

I just done this in the internals:

function ProcessObjects(grid::CollisionGrid, obj::Vector)
	empty!.(grid.zones)
	for o in obj
		SetInZone(grid, o)
	end

	zones = filter(v -> length(v) > 1, grid.zones)
	for zone in zones
		# The collision resolution
	end
end

I changed the function SetInZone from this:

function SetInZone(grid::CollisionGrid{T,N,L}, obj::T) where {T,N,L}
	x0,y0,w0,h0 = get_dimensions(obj)
	size = 1 << BIT_WIDTH # The size of our mask

	x, w = _map_to_range(x0,1:size,grid.size[1]), _map_to_range(w0,1:size,grid.size[1])
	y, h = _map_to_range(y0,1:size,grid.size[2]), _map_to_range(h0,1:size,grid.size[2])

    !(0 ≤ x ≤ 31 && 0 ≤ y ≤ 31) && return

	traits = to_bit_boundingbox(w,h,1,1)
	zones::Vector{UInt64} = grid.bitboard[traits]
    
    data = grid.zones
    offset = _compute_coordinate((x,y), (size,size))
	@inbounds for i in zones
		idx = i+offset
		idx <= L && push!(data[idx], obj)
	end
end

To this

function SetInZone(grid::CollisionGrid{T,N,L}, obj::T) where {T,N,L}
	x0,y0,w0,h0 = get_dimensions(obj)
	size = 1 << BIT_WIDTH # The size of our mask

	x, w = _map_to_range(x0,1:size,grid.size[1]), _map_to_range(w0,1:size,grid.size[1])
	y, h = _map_to_range(y0,1:size,grid.size[2]), _map_to_range(h0,1:size,grid.size[2])

    !(0 ≤ x ≤ 31 && 0 ≤ y ≤ 31) && return
    data = grid.zones
    offset = _compute_coordinate((x,y), (size,size))
	@inbounds for i = x:x+w, j=y:y+h
		idx = i + (j-1)*size + offset
		idx <= L && push!(data[idx], obj)
	end
end

Doing a double for loop is like doing more instruction. It’s hard to see the difference with a little scope. But in my example, I have 1000 Particle2D, for each of them, I have to put them in zones.
The 2 SetInZone would have similar performance for 1 or 2 particle but the more we increase the number, the greater is the gaps.
There is the hustle of managing 2 loops (plus one loop generated m times where m is the width) + the linear indice calculation.
In you previous example it’s the same thing that made the vector version outperform the other after I added @turbo, a double for can’t be correctly unrolled by the CPU while a simple loop is kind of straighforward for it.

If Julia allowed us to see branch mispredictions or cache misses, you would saw the difference

I can’t get Cruise.jl installed:

  • It does not seem to be registered, so ] add Cruise does not work.
  • When adding the GitHub repo I get unsatisfiable requirements for the environment.
  • Project.toml is missing dependencies like StaticArrays and BenchmarkTools.
  • Even when manually setting up the environment, and fixing some module end without a module start in Crates.jl, I keep getting errors, and can’t be bothered to further try to fix them.

Could you provide some self-contained code, where you strip out all the non-essentials?


Regardless, note that in the second version of SetInZone, you don’t need the offset, and you also do not need to have computed grid.bitboard. That should not explain the performance difference, but still would need to be changed for a fair comparison.

Also, w = floor(Int, w0 * size / grid.size[1]) is most likely not what you want: a small object with w0 * size < grid.size[1] would get w == 0. Furthermore, if you have a very narrow bounding box just across the border of two cells, I suppose you’d want w == 2?

I’m not quite sure how @turbo works, but the compiler does not know how many iterations there are in the loops (as this is not encoded in the types), so it can’t unroll either type of loop. The difference in timing between s1 and s4 was also only 0.1 µs on a scale of 3 µs, which I would describe as insignificant.

You’re correct that the proper order of the double loop would be for j = y:y+h, i=x:x+w (assuming the idx calculation is correct), so indeed this was not cache optimal. Are there any other cache misses you’re referring to? And which branches would get mispredicted? How would they differ between the two versions?

It’s normal, I’m actually completly remaking it. I am working on the physic engine as a standalone package, you don’t need Cruise.jl for that.
The thing is the Physic engine also have dependencies (that are not yet registered) and I’m still testing. You would need to copy the math library I made and the code I’m actually makin
I will push it to github so you can clone it.
here is the link GitHub - Gesee-y/Interactions.jl: A 2D/3D Game physic engine. It's a mix of a mass aggregate and rigid body engine.
Here is the math library, I changed the name thought so
GitHub - Gesee-y/GDMathLib.jl: Mathematic library for game development
Lastly, you need this: GitHub - Gesee-y/Arceus.jl: My take on this: Arcane julia entity management system

Huh, technically, the compiler does know how many iteration there is, isn’t that the very definition of a for loop ? A loop where we already know the number of iterations ?

I wouldn’t say that, 0.1 µs is far from negligleable. On a game, we hardly have one object. I we have 10k objects, this will give us 0.1 µs {\times} 10000 = 1 ms (and we only got 16.6 ms per frame).

I talk about this

@inbounds for i = x:x+w, j=y:y+h
		idx = i + (j-1)*size + offset
		idx <= L && push!(data[idx], obj)
	end

There is that check at the end that could get mispredicted because of the calculation before it.
Plus in the next version, There will be more checks, so potentially more mispredictions