Hi,
I’m in the process of migrating my basic chess-engine from java to julia.
As I’m still very new to julia, I decided to start evaluating some design choices early to increase learning-opportunities. In particular, I need to find appropriate data-structures to perform the bread-and-butter tasks of my engine in the most efficient way, that is naturally achievable, in julia.
Note, that performance is one of the pillars of a chess-engine’s strength, the other being effective pruning of the search-space. Any strong chess-engine can’t do without trying hard to optimize both of those aspects.
In particular, the engine implements traditional alpha-beta search (with modern refinements), that is: Sequentially traversing a tree of chess-positions in a (assumed) best-first (depth-first) manner and while doing so, successively narrowing down the expectations of an expected best overall evaluation of the current position, according to an eval-algo, which is (only ever!) applied at the leaves. “Leaf”, here, either means a terminal node in the tree (won / lost / tied position, where no further moves are possible) OR a node at the maximum search-depth of the current search (which can vary, according to expectations and fluctuations of parameters within a particular subtree).
More specifically, traversing the tree depth-first and applying heuristics (=move-ordering) means, that at every position (node) in the tree, (all) legal moves (edges) have to be generated and (according to some heuristics) one best move is picked (and executed in a simulation), first, to get to the next-deeper position (node), and so on. When returning to a node, not all other sibling-nodes are ever evaluated, but a significant amount of them is usually (hopefully!) pruned away from the search-tree.
As a general orientation, the purely technical part of move-generation (= finding and executing moves) has to be able to produce a throughput of at least some 2-digit-million # of nodes per second on a regular mid-range laptop, for any engine to even be halfway competitive (the actual throughput with evaluations and heuristics being applied will then be a lot lower, ofc.).
In an attempt to migrate some (purely technical) part of move-generation (no heuristics, etc.) to julia, first, I picked rook-movements, since rooks produce the most number of moves, on average (I’m including queens, here, as their movements are just the union of rook + bishop-moves) and are therefore the most performance-critical aspect. Also, rooks + bishops are sliding-pieces, which makes their move-generation more involved, since movement is restricted by other pieces, which are positioned on the rays of the (sliding) move-dimensions (as opposed to knights & kings).
Specifically, one important part of my move-generation for rooks (quite analogous for bishops + queens, also) can be understood, just in terms of (2-stage) lookups of pre-computed values in a hashtable. The critical part of the code (which needs to be executed millions of times per second, with different values) boils down to the following (4 lines of codes + lots of comments):
# =========================================================================================
# move-generator, implementing "fancy var-shift magics" for ROOKS
# reference: htps://www.chessprogramming.org/Magic_Bitboards#Fancy
# sq: 0..63 : all possible positions a1..h8 of a rook
# occ: 0x0000000000000000..0xffffffffffffffff : all combos of occupied sq. by other pieces
# -----------------------------------------------------------------------------------------
# all UPPERCASE-vars: pre-initialized, global vars for quick lookup
# LOOKUP_OFFSET_IDX_HASHT_ROOK : 64 x UInt64
# LOOKUP_MAGICS_ROOK : 64 x UInt64
# HASHT_SLIDE_ROOK_SIZE : 102.400 x UInt64 (min. perfect hasht. for rook-moves)
# =========================================================================================
function movegen_rook(sq::UInt8, occ::UInt64)
# pre-lookups (stage #1)
magic = LOOKUP_MAGICS_ROOK[sq] % UInt64
offset = LOOKUP_OFFSET_IDX_HASHT_ROOK[sq] % UInt32
shift = SQUARES - LOOKUP_BITSHIFT_ROOK[sq] % UInt8
# magic hash-key
key = ( (magic * occ) >> shift ) % UInt16
# final lookup (stage #2)
return HASHT_SLIDE_ROOK[offset + key] % UInt64
end
# =========================================================================================
Semantically, it can be understood in terms of 3 lookups, 1 multiplication + 1 shift:
magic
: 1 lookup in a 64 x UInt64
- table
shift
: 1 lookup in a 64 x UInt8
- table
key
: simple computation, involving 1x Mult
+ 1x Shift
- Operations
returns
: 1 lookup in a 102.400 x UInt64
- table (820 KBytes)
Technically, there is one additional lookup, of an offset
, which is a variant, I chose, as to make the final lookup into HASHT_SLIDE_ROOK
1-dimensional, instead of 2-dimensional (which I had implemented in java, as that worked neatly with java’s ragged arrays (variable sizes in the 2nd dimension):
For testing with the BenchmarkTools
and obtaining results in a comparable manner, I wrapped the calls to movegen_rook
in a wrapper-function, using the same (trivial = 64-bit-xorshift) rng, as in java, to pre-compute values, which I fed to movegen_rook
…
function benchm_pseudo_legal_moves(iterations)
rn = 0xabcedf0123456789 % UInt64 # rng-seed
dummy = 0 % UInt64
for i = 1:iterations
# just benchmark on random numbers, for now...
rn = rng(rn)
# do same as in java-code, here, for fairness-reasons, as java's jit-compiler
# removes unused results & dead code, without returning something, using the results...
dummy ⊻= movegen_rook( ((rn & 0x3f)+1) % UInt8, rn )
end
return dummy
end
…these are the results I’m getting…
…sadly the throughput is “just” about 1M movegen_rook
- calls / sec. (a bit less, even). This is more than 2 orders of magnitude slower, than my implementation in java.
Now, I’m pretty sure, I’m likely misshandling something.
Things I am uncertain about:
-
How do my global lookup-tables impact performance and if at all, what would even be a sensible alternative implementation, as they are by their very nature simply precomputed, global values (constants!), that are not supposed to do anything, other than being available in ram as a lookup (preferably in L1-cache, as much as possible ).
Note: I’ve read about globals not being a good choice in julia, but I still don’t quite understand why and how I can avoid them, in this case. -
How do my chosen datatypes impact performance? I have tried statically typing (and narrowed down!) everything as much as possible. I have likely annotated types rather excessively. But I figured, that, in case a wider datatype would be more efficient, the compiler should be choosing that for me? And if nothing else, more densely allocated RAM should result in less cache-misses, etc.
-
Speaking of memory-allocation, I’m not quite sure, what it means, but
@benchmark
estimates an allocation of ~110 MByte (!!!) RAM per 1M calls ofmovegen_rook
. Since I put the call of that function in a wrapper, where every result is just being xor’d and consumed by the same singleUInt64
(dummy
) - there should not be allocations happening, apart from stack-pushes and -pops, I think?
Does anyone have some hints, what I could do or try, to improve performance?