New Alpha zero replica

I posted a new alphazero replica on github:
It’s (very(very)) messy but fast and can easily be tweaked provided you already know alphazero.
It uses Flux as only backend.
The objective is to train a good 9x9 go bot, but currently only 4 games are implemented (gobang, 4 in a row, ultimate tic tac toe and reversi 6x6).
Feel free to ask for details or use it.


Just wondering: Have you looked at Alphazero.jl? If so, what are the main differences?

From a quick look through the mcts source code, it looks pretty good, although you have a couple very short funtcions (eg addVisit) that I feel like are simple enough to not be worth having as functions. Also, since your search is multi-threaded, what’s your strategy for dealing with collisions?

I have looked at Alphazero.jl. First notable difference is my code is uglier and less modular, but maybe simpler du to keeping it very close to standard mcts implementation, also i didn’t bother to make fancy interface which is very helpful to monitor training, though i keep track of loss to prevent overfitting.
On a more profound difference if i recall well, Alphazero does not play games in parallel, but i believe it would be easy to do so.
Regarding mcts, it follows standard implementation only thing added is that it handles many trees at the same time and there are (if you want) many process descending the tree at the same time. There is no lock, the leaf descending the same tree are run in sequential order. It’s possible that many process end in the same leaf, then you waste a little time by evaluating many times but thanks to batching this not a big deal. Virtual loss is used to prevent this happening to often. Multithreading gain is also not very important as inference time is much higher than other operation so you could remove @threads and see almost no difference.
From my testing having many process in the same tree, the MCTS is less strong, but when training one of the major problem is that convergence is fast, so sometimes it’s very long to move from a bad move( many discussions on this on leela forum) so to prevent this i tried to add randomness at start of games and some kind of meta mcts, kept during the whole training (it somehow create an opening book, you follow like in a standard mcts, so all move are tested even very bad ones or so believed), but it was disappointing, but it made me realized that multiprocess descending tree help adding diversity.

As for small functions i noticed that sometimes accessing to struct gave me allocation even if i was only reading value. It was very sensible when i tried to fully optimize vanilla, to try and see if it was possible to go at C speed (i managed 1.5 to 2 times slower), so i kept it. As I’m mostly noob in computer, i don’t really know what is important and what is not, my mantra is prevent as much allocation as possible, though when you use neural networks, it is so much of a bottleneck that, those things matter less.

1 Like

As the developer of AlphaZero.jl, congratulations on the good work!

Here are a few comments:

On parallelism:

I think that simulating games in parallel during self-play is a great idea! I haven’t implemented it yet in AlphaZero.jl because I’m already reaching a pretty high GPU usage by batching MCTS requests within each simulation. This would be a huge win with multiple GPUs though.

More generally, inference requests must be batched together for AlphaZero to scale. There are basically two ways to batch requests:

  • Within a single run of MCTS, as implemented in AlphaZero.jl. This is the so-called asynchronous MCTS algorithm. A virtual loss is used to ensure all workers are not exploring the same branches but this creates an exploration bias and potentially reduces MCTS performances compared to a synchronous version.
  • Across different game simulations: here, MCTS can be executed synchronously on each thread and requests are batched across simulations. I did not implement this solution because it is harder to make it as nice and modular than the alternative. I would be curious to explore it though.

My understanding is that you implemented a mix of those two. Am I correct?

On your experimental results:

On your README, you are reporting the following connect-four results:

On 4IARow: 6 hours produce a net that can beat the perfect player when starting, though it is far from being perfect (tested on hardest set, net alone makes 5 to 10% mistake depending on the set)

What do you mean exactly by “the net can beat the perfect player when starting”? Are you meaning that if your agent and the perfect solver both play their best moves deterministically, then the resulting game ends up in a win for your agent? Have you tried making the experiment with adding nonzero temperature to get a more robust percentage figure? Also, this statement seems to be conflicting with the statement that your agent is “far from perfect”. I would find it surprising that a “far from perfect” agent could win even a single game against a perfect solver.

Moreover, the mistake rates you are reporting do not really make sense to me. 5% seems pretty poor on the easy test sets (on which you should easily get close to 0%), whereas 10% for the network alone on the hardest sets looks extremely impressive (my full AlphaZero agent barely reaches 20%).

I am asking because if you indeed managed to train an AlphaZero agent that is competitive with a perfect player in 6 hours on a single GPU, then this is a major result. This would suggest that both I and the Oracle blog series I link to in my tutorial missed something big (which, to be clear, is totally possible) :slight_smile:.

For the first question answer is yes my implementation is a mix of batching in a single run and across games.
For the second part. First i tried to train 4Irow as you did: 600 readout 32 workers, it somehow worked and I think it is way faster than what you get, because 100 games i played in parallel in my implementation.
But I noticed a huge boost when using only 400 readout but with 4 workers in a single search. Making some testing, it seems that performance degrades too much with so many workers.
Second I’m not sure about this but I think that you and Oracle series are doing too much:
-The net you use are too big inducing huge memory usage, I only used 64x3 or 64x4 and recently tried 64x7, that’s more than enough if you want a good player.
- You don’t need that many games per iteration, you go for 4k (still reasonable), Oracle if I remember go for 20k (compare to AGZ 25K for GO 19x19 !!!).
That’s where I get another boost in training speed. Actually it is quit hard to find the sweet spot, too much and it is too slow, too few and it cannot improve. But SAI managed to train a 9x9 superhuman go bot playing only 2k games per iteration and with net structure far smaller that eg Oracle (their first try on 7x7, the net is only 64x3 and they report no amelioration when going deeper:
But off course their is a trade off, for my very limited testing, with a smaller net you progress faster, but generalization is worse.
That’s where I come to second points mistakes rate are only on hardest set, beggining and middle hard and I used your implementation to make the test. I didn’t bother trying on the easy so i don’t know if mistake rate is close to 0 though I believe it is.
As for beating perfect player i played a game against the solver ( and when starting the net+mcts wins. So how can’t it be perfect ?
My explanation is the following: when training, net becomes stronger and stronger along a line or few lines of play. If it can force the opponent to play along this line, it is indeed very strong. (net alone can beat vanilla with 100k rollout). But that doesn’t mean it finds good solution far from those line of plays, which is exactly what does the testing on positions, compared to playing a game. Imagine you learn by heart a very strong strategy, that can force win then you are strong, even if you don’t know the rules of the game. But if I give you a position out of the box, then probably you will maker blunder.
According to the mistake rate Oracle gives, their nets are way better than mine (they are bigger and trained on many more games), if you want this i believe it will take time. But you can create strong bot for a much cheaper budget.
As a side note, I’m not a researcher, and I wrote to some guys, doing alphazero search in Python. I pointed your implementation to them as it is way nicer than mine, they stick to Python, although i can replicate in 2 hours, run they say, take 1 to 2 days in python… I really don’t get this.
For example started yesterday a run on morpionsolitaire, it took me less time to code single AGZ version, code the game and make a first run, than a single run take in Python. (best result so far 65 with only 200 readout, over the night, off course maybe it will take 5 days more too reach their score of 67, but maybe not:).

PS: I can send you the nets, or parameters to replicate nets if you want to check.
PPS: on reversi 6x6 i didn’t manage to win regularly against Tothello, which is almost perfect, Their’s always a blunder or two in the middle of the game.

1 Like

This is very interesting and I would strongly encourage you to add a section in your README where you provide the exact series of commands needed to replicate your results. In particular, I would love to play with your connect-four agent a little bit. :slight_smile:

To summarize, these would be my takeaways from your results, if confirmed:

1. The performances of asynchronous MCTS degrades pretty fast with the number of workers.

This should not be a huge surprise in theory but I’m still surprised by the apparent magnitude of the effect. In particular, the Oracle series clearly downplayed this:

Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use.

By the way, I was wondering: from the moment your implementation can batch requests across simulations, why also batch requests within each simulation rather than just going full sequential-MCTS? Indeed, there should be enough games to simulate during each iteration to saturate the capacity of your GPU without looking at other batching opportunities.

2. Especially on limited computing resources, it is worth trading network size and iteration size against shorter iterations in increased number.

This is actually something I wanted to try and the SAI reference you are linking to seems to provide further evidence of this.

Related question: how many generations of self-play data are you storing in your memory buffer? The Oracle series, suggests 20, which always seemed a bit much to me.

Some remaining questions:

Still, it is surprising and counter-intuitive to me that an AlphaZero agent with a significant mistake rate can still play the perfect game against an optimal adversary that does not deviate from perfect-play.

I am not fully buying these explanations. These would make a lot of sense in a supervised setting (where you can easily overfit the perfect strategy around the training data and still be pretty bad outside), but I don’t think they apply equaly well in an RL setting.

  • Indeed, AlphaZero is learning from scratch. Therefore, in order to overfit against an optimal adversary, it would have to discover this adversary first, which can only be done through exploration. Smart exploration requires good generalization capability from the network.
  • Said differently: although I agree that during each iteration at least, AlphaZero tends to overfit along a few lines of play (which is another reason why it is important to use training data from different generations), it would be a great coincidence that one of the lines of play it overfits against is the one of a perfect adversary, especially after relatively little exploration.

I can imagine a scenario where your explanation holds though:

  • Suppose the perfect strategy against a perfect adversary (or at least a very good approximation of it) can be expressed very concisely by the chosen network architecture. (Said differently, the network architecture provides a really good inductive bias for this strategy.)
  • Then, with relatively little exploration, the network may converge to it pretty quickly. Having a small network may even help here by adding some form of regularization.
  • However, if the network is too small, it is likely that subsequent iterations will only overfit around this strategy, as the network does not have enough capacity to escape the local minimum it fell into.

One way to test this hypothesis would be to benchmark your agent against a randomized adversary so as to measure its robustness. If this is true, I think it tells us something interesting about the game of connect four itself.

All of this is pretty speculative but I think these are very interesting questions!

1 Like