Broad phase Collision detection: Bitboard approach

foobar_lv2 · August 12, 2025, 9:11am

GitHub - JuliaPerf/LinuxPerf.jl is your friend.

Gesee · August 12, 2025, 9:16am

Does it works on windows ? I’m not on Linux

foobar_lv2 · August 12, 2025, 9:31am

I don’t think so. You can try via WSL2, but afaiu perf under wsl2 is kinda buggy.

I don’t know what the windows people are using. Maybe take a look at CPU Performance Counters on Windows | Random ASCII – tech blog of Bruce Dawson (this is generally speaking a very good blog).

No idea whether any julia wrappers exist.

eldee · August 12, 2025, 9:44am

There are still many issues like MagicBilboard not being exported, IWorld2D not being defined, …? So I’ll reiterate my request to just post minimal working code here, without internal dependencies. It’s fine if this is a couple hundreds lines of code.

You’re assuming here that there is no timing variance / uncertainty. In

myalloc1(n) = rand(n)
myalloc2(n) = rand(n)

@time myalloc1(10);  #   0.000005 seconds (2 allocations: 144 bytes)
@time myalloc2(10);  #   0.000003 seconds (2 allocations: 144 bytes)

I hope you wouldn’t take this as evidence that myalloc2 is superior. BenchmarkTools.jl is of course better in this regard, but can never be perfect. Also look at e.g. the standard deviation from @benchmark to make more informed judgements. (Don’t get me wrong, it’s possible that this 0.1 µs is significant, but currently there is not enough evidence.)

No, we will just decrement a counter until it reaches 0. In this regard there is no difference between while loops and for loops. To give you an example, consider

function f(n)
    for i = 1:n
        println(i)
    end
end

n = rand(1:10)  # or n = parse(Int, readline(stdin)), etc.
f(n)

What should the compiler make of f?

idx will always be identical between both versions, if you use the correct order for i and j. So the branching behaviour is also identical.

Gesee · August 12, 2025, 10:24am

Huh, Have you got the package directly from their repo ?
Everything is exported there
I would be glad to give an example but if you want to reproduce EXACTLY what I did, you can’t without them.
I would be to much to copy, plus I will probably forget things.

0.1 µs is significant. I worked on an ECS, a difference from 1.3 µs from 1.2 µs is far from negligeable, that’s what make the difference between the top-notch ECS. if 100 ns was negligeable, then type instability wouldn’t be a problem since it’s more or less the overhead it had (at least on my computer)

Also there is a difference between a single loop and a double for loop, look at this

If we go back to your example

function s1(x, y, w, h)
    s = 0.
    for i in x:x+w, j = y:y+h
        s += sqrt(i+(j-1)*w)  # Need to do something sufficiently non-trivial so that the compiler doesn't just turn this into a closed-form formula
    end
    return s
end

function create_indices(x, y, w, h, sz)
    ids = Int[]
    LI = LinearIndices((sz, sz))
    for i in x:x+w, j = y:y+h  # assume for simplicity that x + w <= sz && y + h <= sz
        push!(ids, LI[j, i])
    end
    return ids
end

function s4(ids::Vector{Int64})
    s = 0.
    for i in eachindex(ids)
        s += sqrt(ids[i])
    end
    return s
end

And try seeing the instructions generated

julia> @code_lowered s1(5, 10, 20, 30)
CodeInfo(
1 ─       s = 0.0
│   %2  = x + w
│   %3  = x:%2
│         @_6 = Base.iterate(%3)
│   %5  = @_6 === nothing
│   %6  = Base.not_int(%5)
└──       goto #7 if not %6
2 ┄ %8  = @_6
│         i@_9 = Core.getfield(%8, 1)
│   %10 = Core.getfield(%8, 2)
│   %11 = y + h
│   %12 = y:%11
│         @_8 = Base.iterate(%12)
│   %14 = @_8 === nothing
│   %15 = Base.not_int(%14)
└──       goto #5 if not %15
3 ┄ %17 = i@_9
│         i@_11 = %17
│   %19 = @_8
│         j = Core.getfield(%19, 1)
│   %21 = Core.getfield(%19, 2)
│   %22 = s
│   %23 = i@_11
│   %24 = j - 1
│   %25 = %24 * w
│   %26 = %23 + %25
│   %27 = Main.sqrt(%26)
│         s = %22 + %27
│         @_8 = Base.iterate(%12, %21)
│   %30 = @_8 === nothing
│   %31 = Base.not_int(%30)
└──       goto #5 if not %31
4 ─       goto #3
5 ┄       @_6 = Base.iterate(%3, %10)
│   %35 = @_6 === nothing
│   %36 = Base.not_int(%35)
└──       goto #7 if not %36
6 ─       goto #2
7 ┄       return s
)

julia> @code_lowered s4(ids)
CodeInfo(
1 ─       s = 0.0
│   %2  = Main.eachindex(ids)
│         @_3 = Base.iterate(%2)
│   %4  = @_3 === nothing
│   %5  = Base.not_int(%4)
└──       goto #4 if not %5
2 ┄ %7  = @_3
│         i = Core.getfield(%7, 1)
│   %9  = Core.getfield(%7, 2)
│   %10 = s
│   %11 = Base.getindex(ids, i)
│   %12 = Main.sqrt(%11)
│         s = %10 + %12
│         @_3 = Base.iterate(%2, %9)
│   %15 = @_3 === nothing
│   %16 = Base.not_int(%15)
└──       goto #4 if not %16
3 ─       goto #2
4 ┄       return s
)

A single loop has less step, iterating over a single array should logically be faster than a double for loop.

Thx for this, you just showed me how to reduce misprediction. Since I’m using a bitboard, I can just sort the array before hand.

eldee · August 12, 2025, 12:09pm

It’s hard to compare these since showing more calls does not mean that the code is slower. For example, Base.iterate for a UnitRange boils down to an addition (and a subtraction, apparently), while for the Vector{Int} you would get a memory load.

A key point to note here is that a UnitRange (like x:x+w) is not an Array (like collect(x:x:w) or ids).

But coming back to @code_lowered, perhaps more importantly, it shows unoptimised code. As an extreme example, compare these two implementations of the classical Fibonacci sequence

function fib(n)
    (n == 0 || n == 1) && return 1
    return fib(n - 1) + fib(n - 2)
end

fibv(::Val{0}) = 1
fibv(::Val{1}) = 1
fibv(::Val{n}) where n = fibv(Val(n - 1)) + fibv(Val(n - 2))
fibv(n) = fibv(Val(n))

Then

julia> @code_lowered fib(100)
CodeInfo(
1 ─ %1  = Main.:(==)
│   %2  = (%1)(n, 0)
└──       goto #3 if not %2
2 ─       goto #4
3 ─ %5  = Main.:(==)
│   %6  = (%5)(n, 1)
└──       goto #6 if not %6
4 ┄       return 1
5 ─       goto #6
6 ┄ %10 = Main.:+
│   %11 = Main.fib
│   %12 = Main.:-
│   %13 = (%12)(n, 1)
│   %14 = (%11)(%13)
│   %15 = Main.fib
│   %16 = Main.:-
│   %17 = (%16)(n, 2)
│   %18 = (%15)(%17)
│   %19 = (%10)(%14, %18)
└──       return %19
)

julia> @code_lowered fibv(Val(100))  # @code_lowered fibv(100) just shows the fibv(Val(100)) call
CodeInfo(
1 ─ %1  = Main.:+
│   %2  = Main.fibv
│   %3  = Main.Val
│   %4  = Main.:-
│   %5  = $(Expr(:static_parameter, 1))
│   %6  = (%4)(%5, 1)
│   %7  = (%3)(%6)
│   %8  = (%2)(%7)
│   %9  = Main.fibv
│   %10 = Main.Val
│   %11 = Main.:-
│   %12 = $(Expr(:static_parameter, 1))
│   %13 = (%11)(%12, 2)
│   %14 = (%10)(%13)
│   %15 = (%9)(%14)
│   %16 = (%1)(%8, %15)
└──       return %16
)

So these are roughly equally ‘long’. But

julia> @btime fib(100)  # Takes forever

julia> @btime fibv(100)
  1.200 ns (0 allocations: 0 bytes)
1298777728820984005

The reason is how the compiler optimised these methods:

julia> @code_native fib(100)
        .text
        .file   "fib"
        .globl  julia_fib_4451                  # -- Begin function julia_fib_4451
        .p2align        4, 0x90
        .type   julia_fib_4451,@function
julia_fib_4451:                         # @julia_fib_4451
; Function Signature: fib(Int64)
; ┌ @ REPL[1]:1 within `fib`
        .cfi_startproc
# %bb.0:                                # %top
; │ @ REPL[1] within `fib`
        #DEBUG_VALUE: fib:n <- $rcx
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        push    rsi
        push    rdi
        push    rbx
        sub     rsp, 40
        .cfi_offset rbx, -40
        .cfi_offset rdi, -32
        .cfi_offset rsi, -24
        mov     eax, 1
; │ @ REPL[1]:2 within `fib`
        cmp     rcx, 1
        jbe     .LBB0_2
# %bb.1:                                # %L7
        mov     rsi, rcx
; │ @ REPL[1]:3 within `fib`
; │┌ @ int.jl:86 within `-`
        dec     rcx
; │└
        movabs  rbx, offset julia_fib_4451
        call    rbx
        mov     rdi, rax
; │┌ @ int.jl:86 within `-`
        add     rsi, -2
; │└
        mov     rcx, rsi
        call    rbx
; │┌ @ int.jl:87 within `+`
        add     rax, rdi
.LBB0_2:                                # %common.ret
; │└
; │ @ REPL[1] within `fib`
        add     rsp, 40
        pop     rbx
        pop     rdi
        pop     rsi
        pop     rbp
        ret
.Lfunc_end0:
        .size   julia_fib_4451, .Lfunc_end0-julia_fib_4451
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

julia> @code_native fibv(Val(100))
        .text
        .file   "fibv"
        .globl  julia_fibv_4858                 # -- Begin function julia_fibv_4858
        .p2align        4, 0x90
        .type   julia_fibv_4858,@function
julia_fibv_4858:                        # @julia_fibv_4858
; Function Signature: fibv(Base.Val{100})
; ┌ @ REPL[5]:1 within `fibv`
        .cfi_startproc
# %bb.0:                                # %top
        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        movabs  rax, 1298777728820984005
        pop     rbp
        ret
.Lfunc_end0:
        .size   julia_fibv_4858, .Lfunc_end0-julia_fibv_4858
        .cfi_endproc
; └
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

So fibv(Val(100)) is essentially just return 1298777728820984005, because we forced compile-time memoisation. (As a side-note, this result is wrong due to us working with Int64, but fib(100) should be the same.)

I’m not sure if this would fit into your framework, and you’re likely already aware of this, but if not, sweep and prune is a popular broad phase algorithm for collision detection. See for example

Gesee · August 12, 2025, 12:16pm

In fact my problem is not really about fo loops, or detection collision.
I just want to know if bitboards can short circuit operations that takes forever otherwise

For exemple, if knowing the path to an object takes forever in a quad tree, can we use a bit board to get it immediatly ?
So that’s why I wanted to know.
There are probably people who master collision detections technique better than me so I wanna know if the ability of directly getting a value from a set of trait (like an integer cell indices) could be a +

eldee · August 12, 2025, 12:28pm

Ah, turns out I was still working in the Cruise.jl version. My bad. The code in the links you posted does run, and I can replicate your results that ProcessObjects using for i in zones is much faster than ProcessObjects_forfor, which is the version using a double for j = y:y+h, i = x:x+w and no offset (the reverse order seems equally fast):

julia> @btime ProcessObjects($grid, $particles)
  37.400 μs (6 allocations: 16.15 KiB)

julia> @btime ProcessObjects_forfor($grid, $particles)
  1.137 ms (7 allocations: 16.20 KiB)

But:

julia> ProcessObjects(grid, particles); grid.zones[5]  # All grid zones are empty
Particle2D[]

julia> ProcessObjects_forfor(grid, particles); grid.zones[5]
6-element Vector{Particle2D}:
 Particle2D(0.38222343f0, Float32[4.0, 1.0], Float32[-5.0, 5.0], Float32[5.0, 2.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.31700322f0, Float32[2.0, 1.0], Float32[0.0, -3.0], Float32[5.0, 4.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.007850104f0, Float32[1.0, 1.0], Float32[-4.0, 4.0], Float32[5.0, 1.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.3172211f0, Float32[4.0, 1.0], Float32[-1.0, 1.0], Float32[5.0, 5.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.040334694f0, Float32[4.0, 1.0], Float32[2.0, -1.0], Float32[3.0, 1.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.11214117f0, Float32[2.0, 1.0], Float32[0.0, -2.0], Float32[0.0, 2.0], Float32[0.0, 0.0], 0.95f0)

So ProcessObjects is faster by virtue of not doing anything. The problem lies in MakeRoute:

for i in x_start:w
    for j in y_start:h
       push!(result, _compute_coordinate((i,j),(size,size)))
    end
end

This should be 1:w and 1:h. If we correct this, then ProcessObjects_forfor turns out to be faster.

julia> @btime ProcessObjects($grid, $particles)  # Using a new world (to recreate grid.bitboard) and particles
  1.510 ms (6 allocations: 16.15 KiB)

julia> grid.zones[5]
3-element Vector{Particle2D}:
 Particle2D(0.07529773f0, Float32[2.0, 1.0], Float32[0.0, 2.0], Float32[1.0, 3.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.47455254f0, Float32[2.0, 1.0], Float32[-1.0, 2.0], Float32[4.0, 4.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.94236004f0, Float32[3.0, 1.0], Float32[-4.0, -2.0], Float32[3.0, 4.0], Float32[0.0, 0.0], 0.95f0)

julia> @btime ProcessObjects_forfor($grid, $particles)
  1.122 ms (7 allocations: 16.20 KiB)

julia> grid.zones[5]
3-element Vector{Particle2D}:
 Particle2D(0.07529773f0, Float32[2.0, 1.0], Float32[0.0, 2.0], Float32[1.0, 3.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.47455254f0, Float32[2.0, 1.0], Float32[-1.0, 2.0], Float32[4.0, 4.0], Float32[0.0, 0.0], 0.95f0)
 Particle2D(0.94236004f0, Float32[3.0, 1.0], Float32[-4.0, -2.0], Float32[3.0, 4.0], Float32[0.0, 0.0], 0.95f0)

Also note that the timing comparison is still unfair as we are ignoring the time to create the bitboard, which we do not use in the _forfor version. This time is significant:

julia> @btime MagicBitboard(UInt64((1 << (2BIT_WIDTH))-1), 0, 44, MakeRoute, Vector{UInt64}; PEXT=true);
  1.829 ms (5715 allocations: 13.76 MiB)

though this is only a one time cost, presumably.

Gesee · August 12, 2025, 12:40pm

So that was the truth. Hopefully I posted it here, I would not have seen it. I should have tested the result.

I think the overhead comme mostly from getting the correct zone from the bitboard. It’s not negligeable

It’s a one time cost so it’s okay. Unless the size of the grid changes frequently

eldee · August 12, 2025, 12:40pm

Using (potentially lazily computed) look-up tables (the eager variant of which is what I understand your bitboard to be) could certainly help in certain situations. Now problems like collision detection and nearest neighbour search are well-studied, so it will be very hard to improve upon the state of the art.
But don’t let that stop you from trying stuff. Often major breakthroughs (think e.g. calculus) seem easy in hindsight, and one might wonder why no-one though of it before.

Gesee · August 12, 2025, 12:45pm

Lazily computed…
Why didn’t I thought of that, It would greatly reduce the cost of building a bitboard.
Anyways, I was thinking that it should be better to use these bit board on a tree, like getting precomputed path to avoid searching with cells the object match
Anyways, I will give up on grids and try a tree to see how it comes out
This time I won’t forget to test before coming to a conclusion

Topic		Replies	Views
Performance of Julia as a general-purpose programming language Performance	30	7743	November 19, 2017
Pairwise computation slower than Python (Cython) code (BallTree very slow!) Performance question , tree	27	2153	June 18, 2021
Performance optimisation: Julia vs C++ Performance unrolling	31	2839	April 21, 2021
Understanding Julia performance in simple finite difference code New to Julia	38	4559	June 7, 2018
Compiling to branch table General Usage question	18	2766	December 29, 2018

Broad phase Collision detection: Bitboard approach

Related topics