Help: policy iteration

I am struggling with the implementation of an (toy) optimization problem using dynamic programming. Although this is not a pure Julia question, I hope someone can help me with this.

Being new to dynamic programming / reinforcement learning I tried to implement Jack’s car rental problem (p. 81 in Sutton/Barto) using policy iteration for my own educational benefit.

The problem is formulated as follows:

Jack’s Car Rental Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables, meaning that the probability that the number is n is λ^n * e^(−λ) / n!, where λ is the expected number. Suppose λ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be γ = 0.9 and formulate this as a continuing finite MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.

Having read the relevant chapters of the wonderful (draft) book from Kochenderfer/Wheeler/Wray I tried to implement the problem using their proposed algorithms for policy iteration. (Note: the book is a wonderful example of how to express algorithms in code using Julia!)

My problem is as follows: I am able to solve the problem in Julia in general (using some very hacky for loops etc.). However, I am not able to solve the problem using the algorithm from Kochenderfer et al.

My main struggle is to correctly specify the probability for moving to state s´ when being in state s and taking action a (Note: this probability is called T(s, a, s`) in Kochenderfer et al and P(s´, a, s) in Sutton/Barto).

I know that my function T(s, a, s´) must be incorrect because if I apply it for all possible s´ it does not sum up to 1. However, maybe there are other issues with my code.

I would greatly appreciate any help / suggestion on my code.

Code is here: here

If this question is inappropriate for this forum, please let me know.

Thanks to anyone willing to help me!