I’m in the process of migrating my basic chess-engine from java to julia.
As I’m still very new to julia, I decided to start evaluating some design choices early to increase learning-opportunities. In particular, I need to find appropriate data-structures to perform the bread-and-butter tasks of my engine in the most efficient way, that is naturally achievable, in julia.
Note, that performance is one of the pillars of a chess-engine’s strength, the other being effective pruning of the search-space. Any strong chess-engine can’t do without trying hard to optimize both of those aspects.
In particular, the engine implements traditional alpha-beta search (with modern refinements), that is: Sequentially traversing a tree of chess-positions in a (assumed) best-first (depth-first) manner and while doing so, successively narrowing down the expectations of an expected best overall evaluation of the current position, according to an eval-algo, which is (only ever!) applied at the leaves. “Leaf”, here, either means a terminal node in the tree (won / lost / tied position, where no further moves are possible) OR a node at the maximum search-depth of the current search (which can vary, according to expectations and fluctuations of parameters within a particular subtree).
More specifically, traversing the tree depth-first and applying heuristics (=move-ordering) means, that at every position (node) in the tree, (all) legal moves (edges) have to be generated and (according to some heuristics) one best move is picked (and executed in a simulation), first, to get to the next-deeper position (node), and so on. When returning to a node, not all other sibling-nodes are ever evaluated, but a significant amount of them is usually (hopefully!) pruned away from the search-tree.
As a general orientation, the purely technical part of move-generation (= finding and executing moves) has to be able to produce a throughput of at least some 2-digit-million # of nodes per second on a regular mid-range laptop, for any engine to even be halfway competitive (the actual throughput with evaluations and heuristics being applied will then be a lot lower, ofc.).
In an attempt to migrate some (purely technical) part of move-generation (no heuristics, etc.) to julia, first, I picked rook-movements, since rooks produce the most number of moves, on average (I’m including queens, here, as their movements are just the union of rook + bishop-moves) and are therefore the most performance-critical aspect. Also, rooks + bishops are sliding-pieces, which makes their move-generation more involved, since movement is restricted by other pieces, which are positioned on the rays of the (sliding) move-dimensions (as opposed to knights & kings).
Specifically, one important part of my move-generation for rooks (quite analogous for bishops + queens, also) can be understood, just in terms of (2-stage) lookups of pre-computed values in a hashtable. The critical part of the code (which needs to be executed millions of times per second, with different values) boils down to the following (4 lines of codes + lots of comments):
# ========================================================================================= # move-generator, implementing "fancy var-shift magics" for ROOKS # reference: htps://www.chessprogramming.org/Magic_Bitboards#Fancy # sq: 0..63 : all possible positions a1..h8 of a rook # occ: 0x0000000000000000..0xffffffffffffffff : all combos of occupied sq. by other pieces # ----------------------------------------------------------------------------------------- # all UPPERCASE-vars: pre-initialized, global vars for quick lookup # LOOKUP_OFFSET_IDX_HASHT_ROOK : 64 x UInt64 # LOOKUP_MAGICS_ROOK : 64 x UInt64 # HASHT_SLIDE_ROOK_SIZE : 102.400 x UInt64 (min. perfect hasht. for rook-moves) # ========================================================================================= function movegen_rook(sq::UInt8, occ::UInt64) # pre-lookups (stage #1) magic = LOOKUP_MAGICS_ROOK[sq] % UInt64 offset = LOOKUP_OFFSET_IDX_HASHT_ROOK[sq] % UInt32 shift = SQUARES - LOOKUP_BITSHIFT_ROOK[sq] % UInt8 # magic hash-key key = ( (magic * occ) >> shift ) % UInt16 # final lookup (stage #2) return HASHT_SLIDE_ROOK[offset + key] % UInt64 end # =========================================================================================
Semantically, it can be understood in terms of 3 lookups, 1 multiplication + 1 shift:
magic : 1 lookup in a
64 x UInt64 - table
shift : 1 lookup in a
64 x UInt8 - table
key : simple computation, involving
1x Mult +
1x Shift - Operations
returns : 1 lookup in a
102.400 x UInt64 - table (820 KBytes)
Technically, there is one additional lookup, of an
offset, which is a variant, I chose, as to make the final lookup into
HASHT_SLIDE_ROOK 1-dimensional, instead of 2-dimensional (which I had implemented in java, as that worked neatly with java’s ragged arrays (variable sizes in the 2nd dimension):
For testing with the
BenchmarkTools and obtaining results in a comparable manner, I wrapped the calls to
movegen_rook in a wrapper-function, using the same (trivial = 64-bit-xorshift) rng, as in java, to pre-compute values, which I fed to
function benchm_pseudo_legal_moves(iterations) rn = 0xabcedf0123456789 % UInt64 # rng-seed dummy = 0 % UInt64 for i = 1:iterations # just benchmark on random numbers, for now... rn = rng(rn) # do same as in java-code, here, for fairness-reasons, as java's jit-compiler # removes unused results & dead code, without returning something, using the results... dummy ⊻= movegen_rook( ((rn & 0x3f)+1) % UInt8, rn ) end return dummy end
…these are the results I’m getting…
…sadly the throughput is “just” about 1M
movegen_rook - calls / sec. (a bit less, even). This is more than 2 orders of magnitude slower, than my implementation in java.
Now, I’m pretty sure, I’m likely misshandling something.
Things I am uncertain about:
How do my global lookup-tables impact performance and if at all, what would even be a sensible alternative implementation, as they are by their very nature simply precomputed, global values (constants!), that are not supposed to do anything, other than being available in ram as a lookup (preferably in L1-cache, as much as possible ).
Note: I’ve read about globals not being a good choice in julia, but I still don’t quite understand why and how I can avoid them, in this case.
How do my chosen datatypes impact performance? I have tried statically typing (and narrowed down!) everything as much as possible. I have likely annotated types rather excessively. But I figured, that, in case a wider datatype would be more efficient, the compiler should be choosing that for me? And if nothing else, more densely allocated RAM should result in less cache-misses, etc.
Speaking of memory-allocation, I’m not quite sure, what it means, but
@benchmarkestimates an allocation of ~110 MByte (!!!) RAM per 1M calls of
movegen_rook. Since I put the call of that function in a wrapper, where every result is just being xor’d and consumed by the same single
dummy) - there should not be allocations happening, apart from stack-pushes and -pops, I think?
Does anyone have some hints, what I could do or try, to improve performance?