Hi,
I’m in the process of migrating my basic chessengine from java to julia.
As I’m still very new to julia, I decided to start evaluating some design choices early to increase learningopportunities. In particular, I need to find appropriate datastructures to perform the breadandbutter tasks of my engine in the most efficient way, that is naturally achievable, in julia.
Note, that performance is one of the pillars of a chessengine’s strength, the other being effective pruning of the searchspace. Any strong chessengine can’t do without trying hard to optimize both of those aspects.
In particular, the engine implements traditional alphabeta search (with modern refinements), that is: Sequentially traversing a tree of chesspositions in a (assumed) bestfirst (depthfirst) manner and while doing so, successively narrowing down the expectations of an expected best overall evaluation of the current position, according to an evalalgo, which is (only ever!) applied at the leaves. “Leaf”, here, either means a terminal node in the tree (won / lost / tied position, where no further moves are possible) OR a node at the maximum searchdepth of the current search (which can vary, according to expectations and fluctuations of parameters within a particular subtree).
More specifically, traversing the tree depthfirst and applying heuristics (=moveordering) means, that at every position (node) in the tree, (all) legal moves (edges) have to be generated and (according to some heuristics) one best move is picked (and executed in a simulation), first, to get to the nextdeeper position (node), and so on. When returning to a node, not all other siblingnodes are ever evaluated, but a significant amount of them is usually (hopefully!) pruned away from the searchtree.
As a general orientation, the purely technical part of movegeneration (= finding and executing moves) has to be able to produce a throughput of at least some 2digitmillion # of nodes per second on a regular midrange laptop, for any engine to even be halfway competitive (the actual throughput with evaluations and heuristics being applied will then be a lot lower, ofc.).
In an attempt to migrate some (purely technical) part of movegeneration (no heuristics, etc.) to julia, first, I picked rookmovements, since rooks produce the most number of moves, on average (I’m including queens, here, as their movements are just the union of rook + bishopmoves) and are therefore the most performancecritical aspect. Also, rooks + bishops are slidingpieces, which makes their movegeneration more involved, since movement is restricted by other pieces, which are positioned on the rays of the (sliding) movedimensions (as opposed to knights & kings).
Specifically, one important part of my movegeneration for rooks (quite analogous for bishops + queens, also) can be understood, just in terms of (2stage) lookups of precomputed values in a hashtable. The critical part of the code (which needs to be executed millions of times per second, with different values) boils down to the following (4 lines of codes + lots of comments):
# =========================================================================================
# movegenerator, implementing "fancy varshift magics" for ROOKS
# reference: htps://www.chessprogramming.org/Magic_Bitboards#Fancy
# sq: 0..63 : all possible positions a1..h8 of a rook
# occ: 0x0000000000000000..0xffffffffffffffff : all combos of occupied sq. by other pieces
# 
# all UPPERCASEvars: preinitialized, global vars for quick lookup
# LOOKUP_OFFSET_IDX_HASHT_ROOK : 64 x UInt64
# LOOKUP_MAGICS_ROOK : 64 x UInt64
# HASHT_SLIDE_ROOK_SIZE : 102.400 x UInt64 (min. perfect hasht. for rookmoves)
# =========================================================================================
function movegen_rook(sq::UInt8, occ::UInt64)
# prelookups (stage #1)
magic = LOOKUP_MAGICS_ROOK[sq] % UInt64
offset = LOOKUP_OFFSET_IDX_HASHT_ROOK[sq] % UInt32
shift = SQUARES  LOOKUP_BITSHIFT_ROOK[sq] % UInt8
# magic hashkey
key = ( (magic * occ) >> shift ) % UInt16
# final lookup (stage #2)
return HASHT_SLIDE_ROOK[offset + key] % UInt64
end
# =========================================================================================
Semantically, it can be understood in terms of 3 lookups, 1 multiplication + 1 shift:
magic
: 1 lookup in a 64 x UInt64
 table
shift
: 1 lookup in a 64 x UInt8
 table
key
: simple computation, involving 1x Mult
+ 1x Shift
 Operations
returns
: 1 lookup in a 102.400 x UInt64
 table (820 KBytes)
Technically, there is one additional lookup, of an offset
, which is a variant, I chose, as to make the final lookup into HASHT_SLIDE_ROOK
1dimensional, instead of 2dimensional (which I had implemented in java, as that worked neatly with java’s ragged arrays (variable sizes in the 2nd dimension):
For testing with the BenchmarkTools
and obtaining results in a comparable manner, I wrapped the calls to movegen_rook
in a wrapperfunction, using the same (trivial = 64bitxorshift) rng, as in java, to precompute values, which I fed to movegen_rook
…
function benchm_pseudo_legal_moves(iterations)
rn = 0xabcedf0123456789 % UInt64 # rngseed
dummy = 0 % UInt64
for i = 1:iterations
# just benchmark on random numbers, for now...
rn = rng(rn)
# do same as in javacode, here, for fairnessreasons, as java's jitcompiler
# removes unused results & dead code, without returning something, using the results...
dummy ⊻= movegen_rook( ((rn & 0x3f)+1) % UInt8, rn )
end
return dummy
end
…these are the results I’m getting…
…sadly the throughput is “just” about 1M movegen_rook
 calls / sec. (a bit less, even). This is more than 2 orders of magnitude slower, than my implementation in java.
Now, I’m pretty sure, I’m likely misshandling something.
Things I am uncertain about:

How do my global lookuptables impact performance and if at all, what would even be a sensible alternative implementation, as they are by their very nature simply precomputed, global values (constants!), that are not supposed to do anything, other than being available in ram as a lookup (preferably in L1cache, as much as possible ).
Note: I’ve read about globals not being a good choice in julia, but I still don’t quite understand why and how I can avoid them, in this case. 
How do my chosen datatypes impact performance? I have tried statically typing (and narrowed down!) everything as much as possible. I have likely annotated types rather excessively. But I figured, that, in case a wider datatype would be more efficient, the compiler should be choosing that for me? And if nothing else, more densely allocated RAM should result in less cachemisses, etc.

Speaking of memoryallocation, I’m not quite sure, what it means, but
@benchmark
estimates an allocation of ~110 MByte (!!!) RAM per 1M calls ofmovegen_rook
. Since I put the call of that function in a wrapper, where every result is just being xor’d and consumed by the same singleUInt64
(dummy
)  there should not be allocations happening, apart from stackpushes and pops, I think?
Does anyone have some hints, what I could do or try, to improve performance?