AlphaZero MCTS formula question

aaowens · October 31, 2021, 12:05am

I’m trying to understand the MCTS used in AlphaZero, and I’m a bit confused by the upper confidence bound formula, which I’ve seen in most places as

Q(s,a) + c * P(s,a) * \frac{\sqrt{\sum_b{ N(s,b)}}}{1 + N(s,a)}

and appears here in the AlphaZero.jl code https://github.com/jonathan-laurent/AlphaZero.jl/blob/373852801dbf81f5fef0ed7feba69e82aecf0d6d/src/mcts.jl#L179 .

As I understand it, the idea here is to discount the prior P(s,a) if action a has been taken relatively often, so the ratio of the (square root of) the number of times the parent node has been visited to the number of times the child node N(s,a) has been visited would be a small number.

I would expect that the first time we visit a node after expanding it, we would just be picking the action with the highest prior probability, since Q(s,a) is 0 for all a at this point. This would be the case if the scaling term was equal to 1.

However, if this is the first time we’re choosing an action at parent node s, then \sum_b{ N(s,b)} would equal 0, so the whole scaling term equals 0 for all a. In that case, this whole objective function equals 0 for all a, so the arg max is not defined.

I feel like there should also be a 1 in the numerator. Then 1 + \sum_i{ N(s,b)} is the number of times the parent node has been visited, which is one more than the sum of the number of visits to the child nodes, since you have to visit the parent node once before moving down the tree. This would make the formula do what I expect it to.

Am I missing something?

aaowens · November 27, 2022, 10:12pm

I decided to go with my interpretation of the formula since it makes more sense to me. This exploration turned into a blog post about MCTS which I’ve put here https://aaowens.github.io/blog_mcts.html .

Topic		Replies	Views
New Alpha zero replica Machine Learning announcement	6	1314	June 19, 2020
Sumation inside a for loop General Usage	8	481	February 27, 2020
Rand does not return expected confidence interval Optimization (Mathematical)	2	65	November 25, 2024
Numerical Problems General Usage question	12	1045	April 13, 2020
NaNs during MCMC sampling? Statistics	2	244	March 15, 2024

AlphaZero MCTS formula question

Related topics