The Final Optimization
I did my final optimization of my setup using Policy Gradient Loss with Reward. Rather than reward all moves of games equally, I implemented discounting whereby the move that generated the win is given the full 1 reward, and the move that generated the loss getting -1, but then each step away from that winning or losing move got a reward that was discounted by 0.95.
As I predicted, the results were not night-and-day different. The one improvement was that there was almost no forgetting how to play against random moves. However, there was some weirdness in model performance that, frankly, I don’t even want to get into here. I could speculate and hypothesize about it, but I’m done with this avenue. I’m convinced that I am approaching the limits of what can be done using this reinforcement approach. The sticking point is performance against an untrained model generating random moves. I have a pretty big model, and getting to 90% win rate against random moves already takes quite a while. The speed at which it learns beyond that is painfully slow. I believe that this should not be the case playing against random moves. I may be wrong, but it’s time to move on.
So now I plan to begin what I suspect will be a daunting task: implementing in-game simulations using Monte Carlo Tree Search and including a value output as well as a policy output. The two things that will make this daunting are 1. There are still some parts about caclulating Upper Confidence Bound which are not clear to me, and 2. I suspect implementing this in Python will be challenging.
In preparation for doing this, I plan to write out a plain English narrative of what needs to happen in Python. I suspect that will help me think through how to implement it. I suppose I could just ask ChatGPT to help me write this. ChatGPT has helped quite a bit with coding already. It was especially helpful in building the Checkers UI. It gave me the basic code to draw the board and the 24 pieces in opening position in PyGame. That was enough for me to understand how PyGame worked, and I was able to do the rest myself. That, however, was just a means to an end. I had (and have) no interest in coding UI’s, but I needed a UI in order to be able to test my AI work.
By contrast, the AlphaGo Zero algorithms are something I want to understand fully. And the best way to understand something is to work with it directly, and think it through myself. I plan to do that. It will be a lot of work.