I’m Vectorizing the Shit Out of Checkers
It’s been a while, but since I last wrote, I managed to implement the AlphaGo Zero gameplay/learning algorithm. I updated my code to do most of the important things that AlphaGo Zero does.
My model now outputs policy probabilities and a value scalar for each board position. The value scalar is a value between 0 and 1 that is the likelihood that a given board position will result in a win. The policy probabilities are a distribution that attempts to select the move on the board that is most likely to improve the value network and lead to a win. They are used practically in the game by controlling the next major thing I added: simulations.
Up until now, a game of checkers that was seventy-five moves only ever executed seventy-five moves. The model was entirely trained to look at the board state and predict what move would most likely lead to a win. In fact, that is what the policy network of AlphaGo Zero does as well. The difference is how it learns. In my prior model, all it would get is board state to make a prediction. It would be reinforced positively if the move it picked led to a win and negatively if it led to a loss. So one board state, one model prediction for the next move.
But AlphaGo Zero’s policy output is trained differently. Rather than one board state, one model prediction for the next move, each move spawns a game simulation, projecting many moves into the future. It basically traverses different paths from the root out to branches according to an algorithm that calculates the Upper Confidence Bound. This is the formula:
\(UCB=Q+c\cdot P\cdot \frac{\sqrt{\sum_{}^{}N}}{1+N}\)
I won’t explain the whole thing here–that’s for a later post. But that formula drives the simulations. The two main mechanics are traversals and depth of traversal. Any simulation will look ahead a certain number of moves. Perhaps 10, perhaps 100. Any simulation also runs repeated traversals (based on the formula) going as deep as whatever depth is set. I haven’t seen figures on how deep AlphaGo Zero would go, but I have understood that one move might involve thousands of traversals.
As I built out how those simulations would work, I started to consider how many traversals per move and what depth I’d go in my own simulations. Here were some of the things I considered. Depending upon how long the game is, my “game engine” (such as it is) was able to play 15 games per second. Getting it to that was a major accomplishment years ago as I switched from playing a single game at a time and, by extension, having the model do inference for a single game state at a time to playing hundreds of games in parallel and doing inference for those hundreds of games in parallel on the GPU in one fell swoop. That was what brought my gameplay from 3 games per second to its current 15.
This means that under standard conditions, running 500 game batches, 500 games of checkers would take about 35 seconds. That served me well when the game engine only did the game moves. But I had decided to run 10 traversals to a depth of 10 for every game move. Leaving aside managing recursive gameplay tree in python, with that kind of tree coverage, it’s 100 simulation moves for every game move. My rate of gameplay went from 15 games per second to a game every 6 seconds. It sounds fast from a human perspective, but it means waiting almost an hour for 500 games. And in reality, with all the recursive logic, it was more than that.
Unfortunately, 500 games of data for every hour the engine runs is not much in terms of model training. It’s not nearly enough to be able to experiment at all rapidly with different models. I was getting concerned that I was at the end of the road.
I started exploring two optimization methods. First, I toyed with the idea of multi-core CPU parallelization of the gameplay engine. It’s written entirely in Python, which is an interpreted language and, therefore, slow. And there was a ton of branching logic to determine legal moves and move pieces on the board. It’s really the first Python code I wrote in the whole project, and I wasn’t thinking at all about optimization. But CPU parallelization seemed very challenging, possibly involving a ton of refactoring. I also toyed with not taking the very lazy approach of recursively reproducing the entire Game()
class for every simulated move of every game. It was during this optimization that I actually got to some hard data on how long various things were taking for each move of each game. It was unsurprising to find out that the Python legal move and piece movement logic was well over 90% of the time. At this point my estimate is closer to 99.8 percent of the total time. It also had almost nothing to do with the recursive child creation. It was almost exclusively the legal move logic.
So for the first time I started figuring out how to do something that ChatGPT had recommended, but that I wasn’t sure was possible: get rid of all the Python logic, vectorize the gameplay engine and use the GPU for almost everything. That’s what I’m working on right now. I’m basically done with what I’ll call Phase 1: creating a legal move vector across all games in a batch simultaneously. There’s one part I don’t want to implement until I have the movement of pieces vectorized. It’s small and probably not a big deal, but since it depends upon filtering legal moves when it’s a double jump, it depends upon how I implement capturing the jump location. I want to be able to choose my method for capturing jump location knowing both how that location will be determined and how I plan to use it to mask legal moves. So that will have to wait.
The other thing I did–because it was probably not the best idea to begin with and also doesn’t vectorize well–is to not have moves expressed in piece number and move from that piece (left/right, forward/back, jump/non-jump). Instead, now all moves are expressed as board location and the same moves from the board location as had been on pieces.
It’s also a high priority right now to get my gameplay UI refectored for all this. I remember when I created the original gameplay, I had a crude ASCII printout method for insuring that the self-play was all kosher. I would much prefer to have a full on UI as I used for me to play against the model. So I’m updating that UI to do two things: consume the new move structure and also support stepping through self-play (no model, just random moves) to insure that the checkers logic I’m creating is rock solid. The last thing I want to be doing is training a model on thousands and thousands of possibly invalid games.
In the meantime, I have a hope for the magnitude of improvement I will get with this vectorization. My hope is that with large enough batch sizes the new gameplay engine will be able to play games faster with 100 move simulation rollouts than the old gameplay engine could with just game moves. That would be amazing and would really accelerate my ability to experiment with models.