Announcing AlphaZero.jl

I haven’t checked the code yet and it was some time since I read the AlphaZero paper, so it’s possible that this question has a trivial answer. How easy is it to use it for games with imperfect information or more than two players?

@GunnarFarneback Right now, AlphaZero.jl only supports two-players, symmetric, zero-sum games with perfect information and discrete state spaces. However, my focus for the next release (hopefully in about a month) will be on adding support for more kinds of games. I would also like to demonstrate it on unusual problems such as automated theorem proving or safe robot planning.

Regarding using AlphaZero.jl on games of imperfect information, adapting the interface to allow it should not be too hard. The real question is: how well would AlphaZero perform for those games and how can it be improved? As far as I know, this is pretty much an open research question.

If you have any application in mind you may be interested in, I am happy to think about it with you.

1 Like

Either the Julia language blog, the julia computing blog, or a self hosted site.

@viralbshah may be able to help

2 Likes

I’ve been thinking about trying some kind of learning for the TransAmerica board game. Behind the train theme it’s all about placing edges on a 1-2 weighted graph so that your terminal nodes get connected before the opponents. The exact location of the opponents’ terminals is the missing information.

I’ve actually made a basic TransAmerica MCTS implementation in python this fall. https://github.com/oscardssmith/Trans-America. My solution was to consider 2 player perfect information mcts. It still beat the obvious greedy strategies, so it kind of worked.

I just looked at the rules of the game so please tell me if I’m saying nonsense.

If you only have two players and you give the AI perfect information, then I would expect AlphaZero to perform very well, which is confirmed by @Oscar_Smith’s experience.

If you introduce imperfect information (we still only have two players), things get more complicated. In theory, I think you can still use AlphaZero and hope that the network somewhat learns to infer a set of likely values for the hidden information from observed board states and uses it when estimating values and action priors. This introduces a few additional subtleties. For example:

  • You don’t have access to concrete rewards at the leaves of the game tree anymore when running MCTS simulations and must also use the value network there.
  • You will probably have to include some history information in the game state as building good estimates of the hidden information may require looking at prior events.

I would be very curious to know how good such an agent can be and how it would compare to an agent that maintains beliefs about the hidden state explicitly.

Finally, if you introduce more than two players, then things start going beyond my pay-grade. :slight_smile: Game theory becomes very messy with more than two players and there are many problems appearing with the possibility of different players building coalitions. (If you’re a game theory expert and you’re reading this, please step in!)

@Oscar_Smith I really appreciated you sharing some of your valuable experience with Lc0. I was wondering if you also had some thoughts on a few algorithmic details that made me pause when writing this package.

Randomization during network evaluation

When evaluating a network against another, there needs to be randomization so that the same game is not played repeatedly. However, I found the AlphaZero to be surprisingly ambiguous on how this is achieved, especially since they are claiming to be using \tau \to 0 during evaluation. In my current connect four experiment, I achieve randomization by using a small but positive dirichlet noise and move selection temperature. I also flip the board randomly according to its central vertical axis at every turn with some fixed probability. Would you mind sharing what you are doing with Lc0?

Should we keep updating “bad” networks?

After a checkpoint evaluation, the current network gets to replace the one that is used to generate data if it achieves a sufficiently high win rate against it. In case it performs worse, though, there are two natural things one could possibly do:

  1. Do not use the current network for generating data but keep it as the target of gradient updates.
  2. Throw away the current network and resume optimization on a previous version.

Combinations of those two options could also be imagined where there would be checkpoints of both kinds (e.g. of type A every 1K batches and of type 2 every 10K).

I could see advantages to both options. An obvious problem with (2) is that the process may get stuck, but I can imagine a combination of (1) and (2) to result in faster training and in a reduced risk of over-fitting. From the AlphaZero paper, it seems that DeepMind is doing (1). I am wondering if you are doing the same with Lc0 and if you made any experiment with (2).

Distinguishing two different uses of the move selection temperature

When you think about it, the move selection temperature impacts two different things:

  1. Obviously, it directly impacts how much exploration happens during self-play.
  2. It also impacts the entropy of the data that is presented to the network, as the network is updated based on \pi_{s,a} \propto N_{s,a}^{1/\tau} and not based on p_{s, a} \propto N_{s, a} (if I read Deepmind’s paper correctly).

It is not obvious to me why the same parameter should be used in both cases, especially in an off-policy algorithm such as AlphaZero. For example, I understand why one may want to set \tau \to 0 after a fixed number of moves at each game (so as to get good value estimates once enough exploration is guaranteed). However, I am unsure why one would want to throw away the additional uncertainty information that is contained in p_{s, a} when creating training samples for the network.

Lc0 has a really active discord, and some of these questions, I’m not the perfect person to answer, so I encourage you to ask around a bit here https://discord.gg/bZvDNk.

For your first question, the typical solution in the chess testing community is to use opening books to just start all your games from different positions and then not use any noise (other than the fact that multi-threaded engines aren’t quite deterministic)

For the second, we actually diverge a bit from AlphaZero. We don’t do much testing to make sure that new nets are better than old ones. We do a bit of testing to make sure they aren’t more than about 100 elo worse, but proving a new net better takes enough games that we’ve decided to trust that on average they usually will. That said, once we started using 20x256 nets and bigger something that became important for not regressing often is gradient clipping and not starting from a completely random network. We start by ramping up the LR from zero to the first LR, as this tends to produce nets that are less likely to randomly drop a couple hundred elo.

I just had an interesting conversation on the Lc0 discord, which answered my question about distinguishing two different uses of the move selection temperature. For people following this thread, I am summarizing my conclusions here.

  • After running N MCTS iterations to plan a move, let’s write N_a the number of times action a is explored. We have N = \sum_a N_a.
  • The resulting game policy is to play action a with probability \pi_a := (N_a/N)^{1/\tau} with \tau the move selection parameter. However, the policy target that should be used to update the neural network is (N_a/N)_a and not \pi.
  • I think the AlphaGo Zero paper is misleading here as it uses notation \pi to denote both the policy to follow during self-play and the target update, suggesting these should be the same.

Also:

  • In Lc0, two temperature parameters are introduced. The first one is the move selection temperature, which corresponds to the \tau parameter described above. The second one (which does not appear in the AlphaGo Zero paper) is called the policy temperature and it is applied to the softmax output of the neural network to form the prior probabilities used by MCTS.
  • Typically, the policy temperature should be greather than 1 and the move selection temperature should be less than 1.

I am going to update AlphaZero.jl accordingly. I expect it should result in a significant improvement of the connect four agent.

3 Likes

Update: version 0.2.0 released

As discussed in a previous post, this version introduces two distinct temperature parameters for MCTS: a move selection temperature and a prior temperature.

This resulted in a significant improvement of the connect four agent, which now achieves a win rate of 96% against the minmax baseline after 50 learning iterations (versus 80% after 80 iterations before). Don’t hesitate to try it. :slight_smile:

11 Likes

I can’t believe this didn’t make a much much bigger noise in the community. This looks sick

3 Likes

Thanks for your kind message!

The next few releases shoud bring many exciting features, including:

  • MDP support
  • Integration with POMDPs.jl, ReinforcementLearning.jl and OpenSpiel.jl
  • Support for batching inference requests across simulations
  • A simpler and more accessible codebase.
10 Likes

Congrats on releasing a cool package @jonathan-laurent

2 Likes

Update: version 0.3.0 released

This version introduces many exciting features:

  • A generalized and simplified game interface. Stochastic games and games with intermediate rewards are now supported. Also, this refactoring lays the groundwork for supporting OpenSpiel.jl and CommonRLInterface.jl in a next release.
  • A test suite to check that any given game implementation verifies all expected invariants.
  • Many simplifications across the code base, in particular in the MCTS module.

Also, simplifying the codebase uncovered a significant MCTS bug. Without any tuning, the connect four agent can now score a 100% win rate against both baselines after a few hours of training. Don’t hesitate to try it out!

13 Likes