Help needed for improving the performance of a simple table-lookup-function (~hashing) in a chess-engine

Hi,

I’m in the process of migrating my basic chess-engine from java to julia.

As I’m still very new to julia, I decided to start evaluating some design choices early to increase learning-opportunities. In particular, I need to find appropriate data-structures to perform the bread-and-butter tasks of my engine in the most efficient way, that is naturally achievable, in julia.

Note, that performance is one of the pillars of a chess-engine’s strength, the other being effective pruning of the search-space. Any strong chess-engine can’t do without trying hard to optimize both of those aspects.

In particular, the engine implements traditional alpha-beta search (with modern refinements), that is: Sequentially traversing a tree of chess-positions in a (assumed) best-first (depth-first) manner and while doing so, successively narrowing down the expectations of an expected best overall evaluation of the current position, according to an eval-algo, which is (only ever!) applied at the leaves. “Leaf”, here, either means a terminal node in the tree (won / lost / tied position, where no further moves are possible) OR a node at the maximum search-depth of the current search (which can vary, according to expectations and fluctuations of parameters within a particular subtree).

More specifically, traversing the tree depth-first and applying heuristics (=move-ordering) means, that at every position (node) in the tree, (all) legal moves (edges) have to be generated and (according to some heuristics) one best move is picked (and executed in a simulation), first, to get to the next-deeper position (node), and so on. When returning to a node, not all other sibling-nodes are ever evaluated, but a significant amount of them is usually (hopefully!) pruned away from the search-tree.

As a general orientation, the purely technical part of move-generation (= finding and executing moves) has to be able to produce a throughput of at least some 2-digit-million # of nodes per second on a regular mid-range laptop, for any engine to even be halfway competitive (the actual throughput with evaluations and heuristics being applied will then be a lot lower, ofc.).

In an attempt to migrate some (purely technical) part of move-generation (no heuristics, etc.) to julia, first, I picked rook-movements, since rooks produce the most number of moves, on average (I’m including queens, here, as their movements are just the union of rook + bishop-moves) and are therefore the most performance-critical aspect. Also, rooks + bishops are sliding-pieces, which makes their move-generation more involved, since movement is restricted by other pieces, which are positioned on the rays of the (sliding) move-dimensions (as opposed to knights & kings).

Specifically, one important part of my move-generation for rooks (quite analogous for bishops + queens, also) can be understood, just in terms of (2-stage) lookups of pre-computed values in a hashtable. The critical part of the code (which needs to be executed millions of times per second, with different values) boils down to the following (4 lines of codes + lots of comments):

# =========================================================================================
# move-generator, implementing "fancy var-shift magics" for ROOKS
# reference: htps://www.chessprogramming.org/Magic_Bitboards#Fancy
# sq:  0..63                                  : all possible positions a1..h8 of a rook
# occ: 0x0000000000000000..0xffffffffffffffff : all combos of occupied sq. by other pieces
# -----------------------------------------------------------------------------------------
# all UPPERCASE-vars: pre-initialized, global vars for quick lookup
# LOOKUP_OFFSET_IDX_HASHT_ROOK  :      64   x UInt64
# LOOKUP_MAGICS_ROOK            :      64   x UInt64
# HASHT_SLIDE_ROOK_SIZE         : 102.400   x UInt64 (min. perfect hasht. for rook-moves)
# =========================================================================================
function movegen_rook(sq::UInt8, occ::UInt64)
    # pre-lookups (stage #1)
    magic   = LOOKUP_MAGICS_ROOK[sq]                % UInt64
    offset  = LOOKUP_OFFSET_IDX_HASHT_ROOK[sq]      % UInt32
    shift   = SQUARES - LOOKUP_BITSHIFT_ROOK[sq]    % UInt8
    # magic hash-key
    key     = ( (magic * occ) >> shift )            % UInt16
    # final lookup (stage #2)
    return HASHT_SLIDE_ROOK[offset + key]           % UInt64
end
# =========================================================================================

Semantically, it can be understood in terms of 3 lookups, 1 multiplication + 1 shift:
magic : 1 lookup in a 64 x UInt64 - table
shift : 1 lookup in a 64 x UInt8 - table
key : simple computation, involving 1x Mult + 1x Shift - Operations
returns : 1 lookup in a 102.400 x UInt64 - table (820 KBytes)

Technically, there is one additional lookup, of an offset, which is a variant, I chose, as to make the final lookup into HASHT_SLIDE_ROOK 1-dimensional, instead of 2-dimensional (which I had implemented in java, as that worked neatly with java’s ragged arrays (variable sizes in the 2nd dimension):

For testing with the BenchmarkTools and obtaining results in a comparable manner, I wrapped the calls to movegen_rook in a wrapper-function, using the same (trivial = 64-bit-xorshift) rng, as in java, to pre-compute values, which I fed to movegen_rook

function benchm_pseudo_legal_moves(iterations)
    rn = 0xabcedf0123456789 % UInt64 # rng-seed
    dummy = 0 % UInt64
    for i = 1:iterations
        # just benchmark on random numbers, for now...
        rn = rng(rn)
        # do same as in java-code, here, for fairness-reasons, as java's jit-compiler
        # removes unused results & dead code, without returning something, using the results...
        dummy ⊻= movegen_rook( ((rn & 0x3f)+1) % UInt8, rn )
    end
    return dummy
end

…these are the results I’m getting…

…sadly the throughput is “just” about 1M movegen_rook - calls / sec. (a bit less, even). This is more than 2 orders of magnitude slower, than my implementation in java. :disappointed_relieved:

Now, I’m pretty sure, I’m likely misshandling something.

Things I am uncertain about:

  • How do my global lookup-tables impact performance and if at all, what would even be a sensible alternative implementation, as they are by their very nature simply precomputed, global values (constants!), that are not supposed to do anything, other than being available in ram as a lookup (preferably in L1-cache, as much as possible :sweat_smile:).
    Note: I’ve read about globals not being a good choice in julia, but I still don’t quite understand why and how I can avoid them, in this case.

  • How do my chosen datatypes impact performance? I have tried statically typing (and narrowed down!) everything as much as possible. I have likely annotated types rather excessively. But I figured, that, in case a wider datatype would be more efficient, the compiler should be choosing that for me? And if nothing else, more densely allocated RAM should result in less cache-misses, etc.

  • Speaking of memory-allocation, I’m not quite sure, what it means, but @benchmark estimates an allocation of ~110 MByte (!!!) RAM per 1M calls of movegen_rook. Since I put the call of that function in a wrapper, where every result is just being xor’d and consumed by the same single UInt64 (dummy) - there should not be allocations happening, apart from stack-pushes and -pops, I think? :thinking:

Does anyone have some hints, what I could do or try, to improve performance?

Can you include your constants and helper functions as well? It’s easier to help with performance if others can simply copy-paste all necessary code into their own REPLs to run your example.

I just realized, that I typed a whole essay, here, and it’s likely still pretty cryptic to most, who are not familiar with chess-engines.

This is an illustration, of what the 5 lines in movegen_rook actually achieve:

sq: Index (0…63) to the square of a rook (white O in the pic).
occ: Masks the squares of potential blockers / targets of rook-moves (black Os in the 1st pic).
return of the function: Mask(sq,occ) of all target-squares of rook-moves ([ ] in the 1st pic). The 2nd & 3rd pic just show subsets of that mask for non-capturing / capturing moves, respecitively.

1 Like

Welcome to Julia!
This is probably a type stability problem. Specifically you are using global variables that I’m guessing aren’t marked const. I strongly suggest reading the performance tips. (If you have any specific questions, I’ve worked on Lc0 a bit so I know a decent amount about chess programming).

non-constant global variables may have their type change at any point, which prevents many compiler optimizations. It’s ok to have globals, but they absolutely must be const (or, in new julia versions, annotated with a type) for performance. See Performance Tips · The Julia Language

The fact that your benchmark allocates any memory at all is suspicious. Accessing non-const globals can allocate, so perhaps that’s the problem?

Can you include your constants and helper functions as well? It’s easier to help with performance if others can simply copy-paste all necessary code into their own REPLs to run your example.

Yes, ofc, I’ll try doing that! Not easy to give a minimal working-example, though, which is digestible.

Constants / globals…

using ShiftedArrays
using BenchmarkTools

const SQUARES   = 64
# =========================================================================================
# Hashtables & consts for move-gen., implementing "fancy var-shift magics"
# =========================================================================================
# total size:  102.400 elements @ 8 bytes, each ==> 819.200 bytes (~820 KBytes)
HASHT_SLIDE_ROOK_SIZE         = 4*4096 + 24*2048 + 36*1024 # 4 corner + 4x6 edge + 6*6 inner - SUARES
HASHT_SLIDE_ROOK              = Array{UInt64}( undef, HASHT_SLIDE_ROOK_SIZE )
LOOKUP_MAGICS_ROOK            = Array{UInt64}( undef, SQUARES )
# Bitshifts: 12 bits for corner-squares, 11 bits for edge-squares and 10 bits for inner squares, like so...
#   12  11  11  11  11  11  11  12
#   11  10  10  10  10  10  10  11
#   11  10  10  10  10  10  10  11
#   11  10  10  10  10  10  10  11
#   11  10  10  10  10  10  10  11
#   11  10  10  10  10  10  10  11
#   11  10  10  10  10  10  10  11
#   12  11  11  11  11  11  11  12
const LOOKUP_BITSHIFT_ROOK          = Matrix( [10 10 10 10 10 10 10 10] .+ [1;0;0;0;0;0;0;1] .+ [1 0 0 0 0 0 0 1] )[:]
# Offset-Idx: 0 for [Square=1]   ...   HASHT_SLIDE_ROOK_SIZE - size[Square=64]
const LOOKUP_OFFSET_IDX_HASHT_ROOK  = ShiftedArray( cumsum(1 .<< LOOKUP_BITSHIFT_ROOK), 1, default=0 ) .+ 1
# =========================================================================================

for the (purely technical) benchmarks, I didn’t bother actually filling the hashtable with meaningful values, yet, but just initialized them with random vars, like so (within a function, that I call, before benchmarking)…

    global HASHT_SLIDE_ROOK = rand(UInt64,HASHT_SLIDE_ROOK_SIZE)
    global LOOKUP_MAGICS_ROOK = rand(UInt64,SQUARES)

I’m generating the random numbers with this piece of code…

# simple 64bit-xorshift-rng
function rng(x::UInt64)
    x ⊻= (x << 21)
    x ⊻= (x >>> 35)
    x ⊻= (x << 4)
    return x % UInt64
end

…I’m aware, that julia (like any other computing-language) likely has a much better rng-algo, but I want to have java + julia do the same overhead, when comparing benchmark-results.
And also, with this rng, it’s at least meaningful overhead, as my chess-engine is full of bitshifts on 64bit-ints, anyways. :sweat_smile:

…I hope, I didn’t forget anything, that is relvant for calling the benchmark, like so…

@benchmark benchm_pseudo_legal_moves(10000)

…I have tried, calling my benchmark-wrapper-fct with varying parameters (# iterations), but it doesn’t seem to matter much (which I expected), i.e. calling it for 1M iterations roughly gives 1000x the results of calling it for 1K iterations, etc.

Changing to

const HASHT_SLIDE_ROOK_SIZE         = 4*4096 + 24*2048 + 36*1024 # 4 corner + 4x6 edge + 6*6 inner - SUARES
const HASHT_SLIDE_ROOK              = Array{UInt64}( undef, HASHT_SLIDE_ROOK_SIZE )
const LOOKUP_MAGICS_ROOK            = Array{UInt64}( undef, SQUARES )

should fix it.

2 Likes

what is rng?

(since the function in julia is rand).

Concerning the global lookup tables: from the other post it seems that you have declared them constants, so that should be fine.

Type-annotating everything does not usually affects performance in Julia, unless you have a type-instability, in which case typing the return value of a function may help. But, in that case, you better try to remove the instability.

edit: I see you have answered all that in the subsequent post.

Yep, allocations are gone and runtime for 1000 iterations drops two orders of magnitude:

julia> @btime benchm_pseudo_legal_moves(1000) # non-const globals
  414.100 μs (7459 allocations: 116.55 KiB)
0x9b1a006fa64ebcad
julia> @btime benchm_pseudo_legal_moves(1000) # const globals
  2.444 μs (0 allocations: 0 bytes)
0x0000000000000000
2 Likes

wow, thx so much! :heart::+1:

Initially, I thought, it got worse (from 1.14, earlier, to 2.44 time-units per call, now) and re-started my vs-code, until I relized, those are micro-seconds (now) vs. milli-seconds (earlier). :sweat_smile:

Those 3x consts gave roughly a 500x speedup!!! :rofl:

Also, memory-allocation literally dropped to zero, which is very reassuring.

Translating this to calls of movegen_rook per time-unit gives ~410M calls / sec., which is already almost on-par with my java-implementation (a pretty consistent 625…630M calls / sec.). Is there something I can still do, to push this further?

One of the things, I’m really unsure about, is, how to design that last table (-lookup) into HASHT_SLIDE_ROOK (~820 KByte) in julia. A 2d-table would be a natural way to index into it, from a semantic perspective [square, magic-index], but I’m not sure, how I could be doing that, in julia, without allocating a larger array, than neccessary, as…

table[ a1, .. ] …would have to contain 4096 UInt64-entries (corner-square), while…
table[ c5, .. ] …would only need 1024 UInt64-entries (inner-square).

…for example.

You can use a Vector{Vector{UInt64}} for this.

If you’re curious how things compile down, I recommend looking at @code_native (also @code_warntype which will catch type stability problems). Without looking at the java version, it’s not obvious to me where we’re losing time.

thx for this pointer - I’ll be looking into this, for sure!

If you’re curious how things compile down, I recommend looking at @code_native (also @code_warntype which will catch type stability problems). Without looking at the java version, it’s not obvious to me where we’re losing time.

Awesome, I’m definetly curious! Though I kind of lack familiarity with contemporary assembler-code as to easily judge whether the compile is adequate, or not. But I had pushed that goal (looking at compile-results) on my stack, already, when I saw, that julia has that tool. I just dug into it, a bit, again and saw something, looking like array-bounds-checks. :yum: Will be exploring that, for sure, also!

Oh, right. You probably want @inbounds if you know (but the compiler doesn’t) that your lookups aren’t going to be out of bounds. Be warned however, that marking @inbounds incorrectly can result in very weird results (either crashes or silently wrong answers).

2 Likes

What about starting julia with

julia --check-bounds=no

at least for production should it be equivalent to adding @inbounds all the way?
(that is something I never completely understood).

1 Like

The difference is that @inbounds is local (per indexing) while --check-bounds is global and overrides @inbounds (i.e. --checkbounds=yes will check bounds even when @inbounds is marked).

3 Likes

hard to know how this will affect performance, but with 64 elements that may still be in the realm where StaticArrays.jl may be useful. For instance you could replace that one with:

using StaticArrays
const LOOKUP_MAGICS_ROOK = zeros(MVector{64,UInt64})

and similary to other smalish-sized arrays.

ps: The idiomatic way to write Array{UInt64}(undef, SQUARES) is Vector{UInt64}(undef, SQUARES), which is an alias for Array{UInt64,1}. Without the 1 that may be confused with the abstract array type, which you don’t get as the output type here just because the notation is redundant in this case.

It actually won’t be useful here. StaticArrays is only useful when you are doing math-like operations. For this use-case, the only thing you care about is the getindex speed.

2 Likes

Edit: I couldn’t post for a while, as I had exceeded “the maximum number of replies a new user can create”, but tried further optimizations, meanwhile… :sweat_smile:

Also, I realized, that I had gotten confused, when interpreting the benchmark-results, earlier, when I looked at the SD, instead of the M of the results… :man_facepalming:

Initially, I thought, it got worse (from 1.14, earlier, to 2.44 time-units […]

But I tested a bit more and this is what I ended up, with, for now:

# typeof(HASHT_SLIDE_ROOK) = Vector{Vector{UInt64}} <== this changed
function movegen_rook(sq::UInt8, occ::UInt64)
    # pre-lookup of magic + shift in ONE step (stage #1)
    @inbounds magic = LOOKUP_MAGICS_ROOK[sq]                      % UInt64 
    shift = (magic >>> 56)                                        % UInt8
    # magic hash-key
    key = ((((magic & 0x00ffffffffffffff) * occ) >>> shift) + 1)  % UInt16
    # final lookup (stage #2)
    return @inbounds HASHT_SLIDE_ROOK[sq][key]                    % UInt64
end

Changes, wrt. to the start of this thread:

  1. Everything global is now also const, as suggested by @Oscar_Smith & @rdeits (huge win, performance-wise, roughly 2 orders of magnitude).

  2. Vector{Vector{}} (2d-lookup of ragged size) for the last lookup-table (HASHT_SLIDE_ROOK), @Oscar_Smith.
    (I’m not sure, this improved anything, performance-wise, but it didn’t hurt, either, and is much nicer, coding-wise / less error prone, as it saved computing my own offset into the 1d-array).

  3. @inbounds got me another (I think) 10%…30% or so improvement.

  4. With the 2d-hashtable (2.), I saved 1 looup, already (3 instead of 4), but figured, that I could save one more as the highest 8 bits in LOOKUP_MAGICS_ROOK are never used (=0) and the previously computed shift (with an additional lookup per square) can be encoded, in 6 bits, only.
    It’s basically trading 1 lookup for 2 basic ops (>>>, &), which seem to yield another ~10%…20% improvement, on my laptop.

…I’ll likely roll back 3. & 4., again, until I have some stable version of an engine running, as those changes make debugging harder, potentially, but I learned a bunch, meanwhile, about julia.

Thanks, everyone, who contributed!

And for this…!

Welcome to Julia!

I’ve worked on Lc0 a bit so I know a decent amount about chess programming

…nice to know, I might hit you up, some time, though I’ll likely never be exploring the Lc0-way (as much, as I like and admire it), as I simply lack any meaningful computing-power, that would be necessary to do the training (but will likely include some NNUE in my engine, at some point and am interested in playing with other architectures of neural networks, which are feasible to run on a cpu). :+1:

In case anyone cares, this is where I came out, in the end, with the changes:

…translating to ~524M calls of movegen_rook per sec., on my laptop. This is very respectable, I think (esp. for only ~2 days, I invested as a newcomer to julia, in this), and the final bit (~20% or so), my java-engine is still ahead might vanish, as I become more familar with julia. Or it won’t even matter in the big picture, where this part of the code is just a fraction of what needs to be done, for every chess-position.

The @code_native also looks like there’s not much left to be optimized (to me)…
(annotations-lines stripped from the call, for breviety)

.globl  julia_movegen_rook_3150         # -- Begin
.cfi_startproc
        pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
        movq    %rsp, %rbp
.cfi_def_cfa_register %rbp
        subq    $32, %rsp
        movq    215647376, %rax
        movzbl  %cl, %ecx
        movq    -8(%rax,%rcx,8), %r8
        testq   %r8, %r8
        je      .LBB0_2
        movq    165832272, %rax
        movq    -8(%rax,%rcx,8), %rax
        movb    $56, %cl
        bzhiq   %rcx, %rax, %rcx
        shrq    $56, %rax
        imulq   %rdx, %rcx
        xorl    %edx, %edx
        cmpb    $64, %al
        shrxq   %rax, %rcx, %rcx
        cmovbq  %rcx, %rdx
        movq    (%r8), %rcx
        movq    (%rcx,%rdx,8), %rax
        addq    $32, %rsp
        popq    %rbp
        retq
.LBB0_2:
        movabsq $ijl_throw, %rax
        movl    $288560368, %ecx                # imm = 0x113314F0
        callq   *%rax
.Lfunc_end0:

The only thing, I’m not sure about is that testq %r8, %r8 - statement. Not sure, what’s actually being tested (Maybe function-parameters being within the domain of their type? I’d have to brush up on my assembler-knowledge), but it likely doesn’t matter much, especially, as it seems to be related to throwing an exception, which simply doesn’t happen, so this should be cheap for current branch-predictors.

1 Like

Unless I missed it, I don’t think you shared code where you access array elements. If you aren’t doing it already, note that the correct way to iterate over the elements of an array x is with

for el in x
    # do something with element `el`...
end

or, if you instead need the indices

for idx in eachindex(x)
    # do something with `x[idx]`...
end

In these ways you won’t need to use manual @inbounds at all, as the compiler should know that you’re iterating over existing indices/elements.