I just had an interesting conversation on the Lc0 discord, which answered my question about distinguishing two different uses of the move selection temperature. For people following this thread, I am summarizing my conclusions here.
- After running N MCTS iterations to plan a move, let’s write N_a the number of times action a is explored. We have N = \sum_a N_a.
- The resulting game policy is to play action a with probability \pi_a := (N_a/N)^{1/\tau} with \tau the move selection parameter. However, the policy target that should be used to update the neural network is (N_a/N)_a and not \pi.
- I think the AlphaGo Zero paper is misleading here as it uses notation \pi to denote both the policy to follow during self-play and the target update, suggesting these should be the same.
Also:
- In Lc0, two temperature parameters are introduced. The first one is the move selection temperature, which corresponds to the \tau parameter described above. The second one (which does not appear in the AlphaGo Zero paper) is called the policy temperature and it is applied to the softmax output of the neural network to form the prior probabilities used by MCTS.
- Typically, the policy temperature should be greather than 1 and the move selection temperature should be less than 1.
I am going to update AlphaZero.jl accordingly. I expect it should result in a significant improvement of the connect four agent.