TomBolton.io

Tom Bolton’s AI and Machine Learning Lab Notebook.

Machine Learning

Thinking Through Tic Tac Toe – A Neural Net Non-Starter

As I was wrapping up the last week of my course, I stumbled upon a thread in the discussion forums started by one of the other students in the class who wanted to create a NN to play tic tac toe that would be trained by playing against itself. The proposal wasn’t particularly well thought out, but I did start thinking about it, and decided that it would make an interesting first project. However, as I thought it through, it became apparent that an NN to play tic tac toe is overkill. To see the easiest tip off as to why, let’s consider how we’d set this up.

We’d want a net that took each possible board combination as inputs. One way to do this would be 27 input nodes. That is, three for each of the nine positions on the board, one for X, one for O and one for blank. There are other ways to represent this, some of which might be better, but this one is straightforward enough to start. You’d want 9 output nodes which would represent the next move for any given input board combination. The task for the network would be to learn the best possible next move for any given board combination. As for the architecture of the hidden layers, I figured I’d experiment with those, perhaps starting with the two hidden layers on a similar scale to a basic character classifier. So this net would be a classifier of sorts and this was where I landed for a starting point.

However, in retrospect one of the big tip-offs that a NN might be overkill for this system is the following fact: the tic tac toe board has 39 (19,683) possible combinations. And there are significantly less combinations possible in an actual game given that many, many combinations (for instance, a board filled with Xs) are not possible in a real game. So the entire universe of inputs to the net is significantly less than 20,000. It’s not unrealistic to imagine being able to take that input data set and simply label each actual possible game with the ideal next move. Anyone with basic experience playing tic tac toe should be able to do this with relative ease. That’s not really in the spirit of this exercise, but it’s a good clue in retrospect.

There were two things that the net would need to learn. The first was the rules of the game. I suspect there are ways to “hard code” the rules into the net somehow, but I figured it would be more appropriate to have the net learn the rules itself. Here’s how that might work.

We have a system that has 27 input nodes and 9 output nodes. It would learn to play well by playing against itself. At the start, everything would be effectively random from the initial, randomized thetas. After a game was lost by one side and won by the other, each move associated with the loss for the entire game would have its cost function back propagated through the net with whatever cost function and gradients we landed on. Similarly each move on the winning side would be back propagated with the appropriate cost function. The net would learn to play by gradually figuring out what moves lead to winning outcomes.

But at the start the net would not even know the rules of the game. That is, it wouldn’t know that once a square was already picked, you couldn’t choose that place to move. A system to learn this as it played would be relatively straightforward. It would be easy to algorithmically label all the illegal moves and have a move-by-move Illegal Move Cost function that could stochastically update the system as a game was played. Here’s how it would work.

For every given board state you’d label the illegal move slots with a 0 and the legal move slots with a 1. Every time the computer made a move if it was a legal move, you’d stochastically back propagate a simple cost function to reinforce that it was legal, record the move, update the board position, and continue. If the computer made an illegal move (i.e. an incorrect answer), you’d stochastically back propagate that with the same basic cost function, reinforcing that it was illegal. But you wouldn’t update the board position. Instead you’d have the computer try again. If legal, reinforce and move on. If illegal, reinforce and try again. It seems apparent that the bulk of the initial training would be spent learning correct moves as the system made lots of incorrect choices. I’m sure there are details here to be worked through, but that’s where I landed on that.

The real challenge seemed to be how to train to learn the best moves. After all, what would the cost function be? The idea is to negatively reinforce moves associated with losing games, and positively reinforce moves associated with losing games. But here’s where this differs from a basic classifier. A move associated with a losing game is not the same as an illegal move (from above). Nor is it the same as getting a wrong answer when classifying characters. A move that was part of a losing game could be a totally legitimate move to make.

The answer seemed to be in probabilities. In a situation where, for instance, there were six legal next moves, you’d want a cost function that built in, essentially, the won/lost percentage for each possible move. You could do this by, after each game, for every board position/next move combination, accumulating the number of times it was part of a winning game and the number of times it was part of a losing game. The cost function you could use would be W/(W+L). Perhaps with further analysis and study, this could be refined (a sigmoid function, perhaps?). But as you’ll see, the point quickly becomes moot.

At the start, before any games, everything could be set at .5 ( 1/(1+1)). With no data, we’re assuming an equal likelihood of a winning game versus a losing game. The computer would be making effectively random guesses at this point, and after some number of games, next moves for board combinations would start to accumulate meaningful probability values. And the thing is, as those probabilities begin to accumulate to a level of significance, you’ve suddenly lost the need for a neural net in the first place. Once you had enough samples for reasonable win/loss probabilities you’d effectively have a completely labeled set of data, and in this case, a completely labeled universe of possible inputs. Furthermore, you’d be able to get to that point by having the system make completely random legal moves game after game after game. Sure, you could have a neural net optimize to the highest win probability move for any given board position. But why bother with a completely labeled universe of inputs?

Key Takeaways

In thinking through this, I deliberately didn’t research anything related to how to do this sort of thing. With machines like Alpha Go and AlphaGo Zero out there, clearly there’s nothing new to be learned here. For me this was an opportunity to think through a problem with a new set of skills. Here’s what I got out of it.

With a scope of possible inputs that’s as small as it is, and completely finite, there’s just no point in applying a neural net to the problem. This is one area where having a simple algorithm label the data in place of a human is probably the best approach.

Also, it seems apparent to me that this sort of approach is the wrong one for this sort of problem. The approach I was working through is essentially a classifier. That is, in a classifier, for any given input, there is a correct answer and the challenge is to find the correct answer. In a game playing situation — especially one more complicated than tic tac toe — the challenge isn’t finding the one correct answer, but rather to make the best possible decision among many, many possible “good” decisions. And for sure, this would involve algorithms significantly more nuanced than win percentages for a given move on a given board position.

Lastly, while I’m fairly confident of most of this, I’m not certain. Several times since starting to work with neural nets, I’ve had intuitions about things which, upon experimentation and gathering of data, turned out not to be accurate. Ultimately, I’d actually have to build this and try it out to know for certain. I’m in the process of taking my knowledge of this math and the algorithms and learning to implement this stuff in Python. On the one hand, this little set of experiments might be a good trial for those implementation skills. But realistically, regardless of how this actually plays out, this particular challenge just doesn’t feel right for a neural net application. So I’m thinking I will just take an existing Octave/Matlab project and implement that. It should be enough to get familiar with how to implement this sort of thing in Python. Once I’ve got that, I’ll look for a real game project. One with enough variation and complexity to be worthwhile.

Maybe checkers…

Tagged:

LEAVE A RESPONSE

You Might Also Like