GitHub - JuliaPerf/LinuxPerf.jl is your friend.
Does it works on windows ? Iβm not on Linux
I donβt think so. You can try via WSL2, but afaiu perf under wsl2 is kinda buggy.
I donβt know what the windows people are using. Maybe take a look at CPU Performance Counters on Windows | Random ASCII β tech blog of Bruce Dawson (this is generally speaking a very good blog).
No idea whether any julia wrappers exist.
There are still many issues like MagicBilboard
not being exported, IWorld2D
not being defined, β¦? So Iβll reiterate my request to just post minimal working code here, without internal dependencies. Itβs fine if this is a couple hundreds lines of code.
Youβre assuming here that there is no timing variance / uncertainty. In
myalloc1(n) = rand(n)
myalloc2(n) = rand(n)
@time myalloc1(10); # 0.000005 seconds (2 allocations: 144 bytes)
@time myalloc2(10); # 0.000003 seconds (2 allocations: 144 bytes)
I hope you wouldnβt take this as evidence that myalloc2
is superior. BenchmarkTools.jl is of course better in this regard, but can never be perfect. Also look at e.g. the standard deviation from @benchmark
to make more informed judgements. (Donβt get me wrong, itβs possible that this 0.1 Β΅s is significant, but currently there is not enough evidence.)
No, we will just decrement a counter until it reaches 0. In this regard there is no difference between while loops and for loops. To give you an example, consider
function f(n)
for i = 1:n
println(i)
end
end
n = rand(1:10) # or n = parse(Int, readline(stdin)), etc.
f(n)
What should the compiler make of f
?
idx
will always be identical between both versions, if you use the correct order for i
and j
. So the branching behaviour is also identical.
Huh, Have you got the package directly from their repo ?
Everything is exported there
I would be glad to give an example but if you want to reproduce EXACTLY what I did, you canβt without them.
I would be to much to copy, plus I will probably forget things.
0.1 Β΅s is significant. I worked on an ECS, a difference from 1.3 Β΅s from 1.2 Β΅s is far from negligeable, thatβs what make the difference between the top-notch ECS. if 100 ns was negligeable, then type instability wouldnβt be a problem since itβs more or less the overhead it had (at least on my computer)
Also there is a difference between a single loop and a double for loop, look at this
If we go back to your example
function s1(x, y, w, h)
s = 0.
for i in x:x+w, j = y:y+h
s += sqrt(i+(j-1)*w) # Need to do something sufficiently non-trivial so that the compiler doesn't just turn this into a closed-form formula
end
return s
end
function create_indices(x, y, w, h, sz)
ids = Int[]
LI = LinearIndices((sz, sz))
for i in x:x+w, j = y:y+h # assume for simplicity that x + w <= sz && y + h <= sz
push!(ids, LI[j, i])
end
return ids
end
function s4(ids::Vector{Int64})
s = 0.
for i in eachindex(ids)
s += sqrt(ids[i])
end
return s
end
And try seeing the instructions generated
julia> @code_lowered s1(5, 10, 20, 30)
CodeInfo(
1 β s = 0.0
β %2 = x + w
β %3 = x:%2
β @_6 = Base.iterate(%3)
β %5 = @_6 === nothing
β %6 = Base.not_int(%5)
βββ goto #7 if not %6
2 β %8 = @_6
β i@_9 = Core.getfield(%8, 1)
β %10 = Core.getfield(%8, 2)
β %11 = y + h
β %12 = y:%11
β @_8 = Base.iterate(%12)
β %14 = @_8 === nothing
β %15 = Base.not_int(%14)
βββ goto #5 if not %15
3 β %17 = i@_9
β i@_11 = %17
β %19 = @_8
β j = Core.getfield(%19, 1)
β %21 = Core.getfield(%19, 2)
β %22 = s
β %23 = i@_11
β %24 = j - 1
β %25 = %24 * w
β %26 = %23 + %25
β %27 = Main.sqrt(%26)
β s = %22 + %27
β @_8 = Base.iterate(%12, %21)
β %30 = @_8 === nothing
β %31 = Base.not_int(%30)
βββ goto #5 if not %31
4 β goto #3
5 β @_6 = Base.iterate(%3, %10)
β %35 = @_6 === nothing
β %36 = Base.not_int(%35)
βββ goto #7 if not %36
6 β goto #2
7 β return s
)
julia> @code_lowered s4(ids)
CodeInfo(
1 β s = 0.0
β %2 = Main.eachindex(ids)
β @_3 = Base.iterate(%2)
β %4 = @_3 === nothing
β %5 = Base.not_int(%4)
βββ goto #4 if not %5
2 β %7 = @_3
β i = Core.getfield(%7, 1)
β %9 = Core.getfield(%7, 2)
β %10 = s
β %11 = Base.getindex(ids, i)
β %12 = Main.sqrt(%11)
β s = %10 + %12
β @_3 = Base.iterate(%2, %9)
β %15 = @_3 === nothing
β %16 = Base.not_int(%15)
βββ goto #4 if not %16
3 β goto #2
4 β return s
)
A single loop has less step, iterating over a single array should logically be faster than a double for loop.
Thx for this, you just showed me how to reduce misprediction. Since Iβm using a bitboard, I can just sort the array before hand.
Itβs hard to compare these since showing more calls does not mean that the code is slower. For example, Base.iterate
for a UnitRange
boils down to an addition (and a subtraction, apparently), while for the Vector{Int}
you would get a memory load.
A key point to note here is that a UnitRange
(like x:x+w
) is not an Array
(like collect(x:x:w)
or ids
).
But coming back to @code_lowered
, perhaps more importantly, it shows unoptimised code. As an extreme example, compare these two implementations of the classical Fibonacci sequence
function fib(n)
(n == 0 || n == 1) && return 1
return fib(n - 1) + fib(n - 2)
end
fibv(::Val{0}) = 1
fibv(::Val{1}) = 1
fibv(::Val{n}) where n = fibv(Val(n - 1)) + fibv(Val(n - 2))
fibv(n) = fibv(Val(n))
Then
julia> @code_lowered fib(100)
CodeInfo(
1 β %1 = Main.:(==)
β %2 = (%1)(n, 0)
βββ goto #3 if not %2
2 β goto #4
3 β %5 = Main.:(==)
β %6 = (%5)(n, 1)
βββ goto #6 if not %6
4 β return 1
5 β goto #6
6 β %10 = Main.:+
β %11 = Main.fib
β %12 = Main.:-
β %13 = (%12)(n, 1)
β %14 = (%11)(%13)
β %15 = Main.fib
β %16 = Main.:-
β %17 = (%16)(n, 2)
β %18 = (%15)(%17)
β %19 = (%10)(%14, %18)
βββ return %19
)
julia> @code_lowered fibv(Val(100)) # @code_lowered fibv(100) just shows the fibv(Val(100)) call
CodeInfo(
1 β %1 = Main.:+
β %2 = Main.fibv
β %3 = Main.Val
β %4 = Main.:-
β %5 = $(Expr(:static_parameter, 1))
β %6 = (%4)(%5, 1)
β %7 = (%3)(%6)
β %8 = (%2)(%7)
β %9 = Main.fibv
β %10 = Main.Val
β %11 = Main.:-
β %12 = $(Expr(:static_parameter, 1))
β %13 = (%11)(%12, 2)
β %14 = (%10)(%13)
β %15 = (%9)(%14)
β %16 = (%1)(%8, %15)
βββ return %16
)
So these are roughly equally βlongβ. But
julia> @btime fib(100) # Takes forever
julia> @btime fibv(100)
1.200 ns (0 allocations: 0 bytes)
1298777728820984005
The reason is how the compiler optimised these methods:
julia> @code_native fib(100)
.text
.file "fib"
.globl julia_fib_4451 # -- Begin function julia_fib_4451
.p2align 4, 0x90
.type julia_fib_4451,@function
julia_fib_4451: # @julia_fib_4451
; Function Signature: fib(Int64)
; β @ REPL[1]:1 within `fib`
.cfi_startproc
# %bb.0: # %top
; β @ REPL[1] within `fib`
#DEBUG_VALUE: fib:n <- $rcx
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
push rsi
push rdi
push rbx
sub rsp, 40
.cfi_offset rbx, -40
.cfi_offset rdi, -32
.cfi_offset rsi, -24
mov eax, 1
; β @ REPL[1]:2 within `fib`
cmp rcx, 1
jbe .LBB0_2
# %bb.1: # %L7
mov rsi, rcx
; β @ REPL[1]:3 within `fib`
; ββ @ int.jl:86 within `-`
dec rcx
; ββ
movabs rbx, offset julia_fib_4451
call rbx
mov rdi, rax
; ββ @ int.jl:86 within `-`
add rsi, -2
; ββ
mov rcx, rsi
call rbx
; ββ @ int.jl:87 within `+`
add rax, rdi
.LBB0_2: # %common.ret
; ββ
; β @ REPL[1] within `fib`
add rsp, 40
pop rbx
pop rdi
pop rsi
pop rbp
ret
.Lfunc_end0:
.size julia_fib_4451, .Lfunc_end0-julia_fib_4451
.cfi_endproc
; β
# -- End function
.section ".note.GNU-stack","",@progbits
julia> @code_native fibv(Val(100))
.text
.file "fibv"
.globl julia_fibv_4858 # -- Begin function julia_fibv_4858
.p2align 4, 0x90
.type julia_fibv_4858,@function
julia_fibv_4858: # @julia_fibv_4858
; Function Signature: fibv(Base.Val{100})
; β @ REPL[5]:1 within `fibv`
.cfi_startproc
# %bb.0: # %top
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
movabs rax, 1298777728820984005
pop rbp
ret
.Lfunc_end0:
.size julia_fibv_4858, .Lfunc_end0-julia_fibv_4858
.cfi_endproc
; β
# -- End function
.section ".note.GNU-stack","",@progbits
So fibv(Val(100))
is essentially just return 1298777728820984005
, because we forced compile-time memoisation. (As a side-note, this result is wrong due to us working with Int64
, but fib(100)
should be the same.)
Iβm not sure if this would fit into your framework, and youβre likely already aware of this, but if not, sweep and prune is a popular broad phase algorithm for collision detection. See for example
In fact my problem is not really about fo loops, or detection collision.
I just want to know if bitboards can short circuit operations that takes forever otherwise
For exemple, if knowing the path to an object takes forever in a quad tree, can we use a bit board to get it immediatly ?
So thatβs why I wanted to know.
There are probably people who master collision detections technique better than me so I wanna know if the ability of directly getting a value from a set of trait (like an integer cell indices) could be a +
Ah, turns out I was still working in the Cruise.jl version. My bad. The code in the links you posted does run, and I can replicate your results that ProcessObjects
using for i in zones
is much faster than ProcessObjects_forfor
, which is the version using a double for j = y:y+h, i = x:x+w
and no offset
(the reverse order seems equally fast):
julia> @btime ProcessObjects($grid, $particles)
37.400 ΞΌs (6 allocations: 16.15 KiB)
julia> @btime ProcessObjects_forfor($grid, $particles)
1.137 ms (7 allocations: 16.20 KiB)
But:
julia> ProcessObjects(grid, particles); grid.zones[5] # All grid zones are empty
Particle2D[]
julia> ProcessObjects_forfor(grid, particles); grid.zones[5]
6-element Vector{Particle2D}:
Particle2D(0.38222343f0, Float32[4.0, 1.0], Float32[-5.0, 5.0], Float32[5.0, 2.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.31700322f0, Float32[2.0, 1.0], Float32[0.0, -3.0], Float32[5.0, 4.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.007850104f0, Float32[1.0, 1.0], Float32[-4.0, 4.0], Float32[5.0, 1.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.3172211f0, Float32[4.0, 1.0], Float32[-1.0, 1.0], Float32[5.0, 5.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.040334694f0, Float32[4.0, 1.0], Float32[2.0, -1.0], Float32[3.0, 1.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.11214117f0, Float32[2.0, 1.0], Float32[0.0, -2.0], Float32[0.0, 2.0], Float32[0.0, 0.0], 0.95f0)
So ProcessObjects
is faster by virtue of not doing anything. The problem lies in MakeRoute
:
for i in x_start:w
for j in y_start:h
push!(result, _compute_coordinate((i,j),(size,size)))
end
end
This should be 1:w
and 1:h
. If we correct this, then ProcessObjects_forfor
turns out to be faster.
julia> @btime ProcessObjects($grid, $particles) # Using a new world (to recreate grid.bitboard) and particles
1.510 ms (6 allocations: 16.15 KiB)
julia> grid.zones[5]
3-element Vector{Particle2D}:
Particle2D(0.07529773f0, Float32[2.0, 1.0], Float32[0.0, 2.0], Float32[1.0, 3.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.47455254f0, Float32[2.0, 1.0], Float32[-1.0, 2.0], Float32[4.0, 4.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.94236004f0, Float32[3.0, 1.0], Float32[-4.0, -2.0], Float32[3.0, 4.0], Float32[0.0, 0.0], 0.95f0)
julia> @btime ProcessObjects_forfor($grid, $particles)
1.122 ms (7 allocations: 16.20 KiB)
julia> grid.zones[5]
3-element Vector{Particle2D}:
Particle2D(0.07529773f0, Float32[2.0, 1.0], Float32[0.0, 2.0], Float32[1.0, 3.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.47455254f0, Float32[2.0, 1.0], Float32[-1.0, 2.0], Float32[4.0, 4.0], Float32[0.0, 0.0], 0.95f0)
Particle2D(0.94236004f0, Float32[3.0, 1.0], Float32[-4.0, -2.0], Float32[3.0, 4.0], Float32[0.0, 0.0], 0.95f0)
Also note that the timing comparison is still unfair as we are ignoring the time to create the bitboard
, which we do not use in the _forfor
version. This time is significant:
julia> @btime MagicBitboard(UInt64((1 << (2BIT_WIDTH))-1), 0, 44, MakeRoute, Vector{UInt64}; PEXT=true);
1.829 ms (5715 allocations: 13.76 MiB)
though this is only a one time cost, presumably.
So that was the truth. Hopefully I posted it here, I would not have seen it. I should have tested the result.
I think the overhead comme mostly from getting the correct zone from the bitboard. Itβs not negligeable
Itβs a one time cost so itβs okay. Unless the size of the grid changes frequently
Using (potentially lazily computed) look-up tables (the eager variant of which is what I understand your bitboard
to be) could certainly help in certain situations. Now problems like collision detection and nearest neighbour search are well-studied, so it will be very hard to improve upon the state of the art.
But donβt let that stop you from trying stuff. Often major breakthroughs (think e.g. calculus) seem easy in hindsight, and one might wonder why no-one though of it before.
Lazily computedβ¦
Why didnβt I thought of that, It would greatly reduce the cost of building a bitboard.
Anyways, I was thinking that it should be better to use these bit board on a tree, like getting precomputed path to avoid searching with cells the object match
Anyways, I will give up on grids and try a tree to see how it comes out
This time I wonβt forget to test before coming to a conclusion