TomBolton.io

Tom Bolton’s AI and Machine Learning Lab Notebook.

Machine Learning

Lots of Action

I really can’t believe how much I’ve managed to do since the last post. I was still talking about dropout and projection layers and forgetting. My goodness has a lot happened.

The Old Convolutional Model

My previous model input the (4, 8, 4) board into two parallel convolutional layers, one 3×3 and one 5×5, both outputting (16, 8, 4). Each outputs through layer norm and ReLU in that order. I concatenated those into one (32, 8, 4) which I then flattened and, along with a 409-element vector with the state of each numbered piece, fed into the feed-forward network which is basically a slightly larger varient of my original fully-connected network.

The New, Deeper Model

The new model has 4 convolutional layers in series: 3×3, 5×5, 7×7, and 9×9 in that order, outputting (16, 4, 8), (32, 4, 8), (64, 4, 8), and (128, 4, 8) respectively. Each passes through layer norm and ReLU before being fed to the next layer. Eventualy, it is flattened, concatenated with the piece positions vector and passed to the feed-forward network.

I had an alternate version of this network trying out batch norm instead of layer norm. Performance was slightly better.

On my first few runs in this architecture, I was getting some good results. But after letting the model run all night, I woke up to find that the training had failed because my 1080ti had run out of memory for one of the backprop passes.

The Optimization and the Bug

This happened after several bootstrap versions. One of the things I began capturing was percentage of draws, since as the model learned and started playing against a trained version of itself, the number of draw games seemed to go up significantly from just a few per 500-game batch to something like a third of the games. So I began tracking draw percentage. Here is the output I saw on the morning of the failure:

The bottom graph is the draw percentage. On bootstrap version 5, instead of creeping downward, it goes up and up right until it failed because of insufficient memory.

While it probably wouldn’t have been a huge deal, reducing the batch size to 400 would probably have been an easy fix. But I knew that draw matches are very long with 400 non-jump moves required before a draw is called. Any interestingly, with reward = 0, draw games have no effect on the cost, and therefore the model learns nothing from them. So sending draw games through backprop, where the memory limit was being hit, is unnecessary.

So I wrote a few lines of code to strip any move that was part of a draw game from training. At first the results of this were weird and suspicious, diverging radically from the performance of the model with the 0 reward samples still in backprop.

This led to a tedious day-and-a-half of figuring out what was going on. It turns out that I had screwed up the dimensions in one of the several steps involved in getting to my cost function. Once I sorted that, I ran the model.

The Results

There are a few things worth noting here. First, in in the red win percentages (both rolling and batch-by-batch) we have some very different learning trajectories depending upon which bootstrap version is competing against its ancestor.

Likewise we can see the draw percentage spikes to about 70% for bootstrap version 2, staying close to there for the remainder of the epoch. The draw percentage of version 3 then spikes to about 90% before coming down to about 55-60% by the end of the epoch. Then mysteriously, version 4 starts with almost no draws at the start of the epoch, but then creeps back up to 55-60% again.

Finally, after the start of the version 5 epoch, draw percentage is nearly 100%. This explains the radical swings in non-rolling win percentage at the beginning of the epoch. Each batch, which consists of 500 played games, was only feeding probably 5-10 games, making the beginning of that epoch almost stochastic. The draw percentage starts to creep down, but then creeps back up again before slowly creeping down as, near the end of the epoch, the model suddenly seems to find its mojo and its win rate goes up radically.

Interestingly, this graph stops after the very last batch before the rolling win percentage triggered a bootstrap update. I was watching while this was happening, and the time it was taking for the first parallel game completion to appear in the made me wonder if it was possible that I might get a 500 game batch with 100% draws. As I considered this, I realized that if this were to happen, the process would fail with an error because win percentage is calculated thusly:

red_win_pct = (red_wins * 100)/(red_wins + black_wins)

If there were no red or black wins, there should be a divide-by-zero error. Sure enough, that’s exactly what happened. That seemed like a sign from God that I had enough bootstrap versions in the can to do an evaluation. Here’s what that looked like:

So bootstrap version 5 is basically a mess, failing, for some reason, to improve on versions 1 and 2.

In general though, I think I’ve pushed things as far as makes sense with the current setup. Leaving aside Weird Version 5 he model does a decent job not losing what it learned against its random-move untrained version and resoundingly bests prior trained versions. However, even though the win rate against prior trained versions is nearly 100%, a model that still gets beaten 10% of times against a random move model is just not that impressive.

What I’m realizing is that this model is probably only capable of learning so much with a simplistic reward of 1 and -1 on the log probs of moves made in winning and losing games respectively.

So, What Next?

Ideally, right now, I would move to doing what I’ve been considering these last few weeks. I would move to both value and policy outputs and use the AlphaGo Zero methods for training each of those, including Monte Carlo simulations as part of game play. That is super complicated and would take quite a lot of work.

Before I commit to that, though, there’s one thing I still want to try using the current reward setup and training. I want to try discounting the reward the farther the move is from the actual win. I don’t truly think this will make any kind of night vs. day difference. Basically, I don’t think it will be enough to enable the model to beat me regularly. But I would like to see what the impact is regardless. And fortunately, I can probably set that up in an hour or two. So it seems worth it to at least try before abandoning this training method altogether.

You Might Also Like