I’m trying to understand the MCTS used in AlphaZero, and I’m a bit confused by the upper confidence bound formula, which I’ve seen in most places as

Q(s,a) + c * P(s,a) * \frac{\sqrt{\sum_b{ N(s,b)}}}{1 + N(s,a)}

and appears here in the AlphaZero.jl code https://github.com/jonathan-laurent/AlphaZero.jl/blob/373852801dbf81f5fef0ed7feba69e82aecf0d6d/src/mcts.jl#L179 .

As I understand it, the idea here is to discount the prior P(s,a) if action a has been taken relatively often, so the ratio of the (square root of) the number of times the parent node has been visited to the number of times the child node N(s,a) has been visited would be a small number.

I would expect that the first time we visit a node after expanding it, we would just be picking the action with the highest prior probability, since Q(s,a) is 0 for all a at this point. This would be the case if the scaling term was equal to 1.

However, if this is the first time we’re choosing an action at parent node s, then \sum_b{ N(s,b)} would equal 0, so the whole scaling term equals 0 for all a. In that case, this whole objective function equals 0 for all a, so the arg max is not defined.

I feel like there should also be a 1 in the numerator. Then 1 + \sum_i{ N(s,b)} is the number of times the parent node has been visited, which is one more than the sum of the number of visits to the child nodes, since you have to visit the parent node once before moving down the tree. This would make the formula do what I expect it to.

Am I missing something?