A Lesson in Over-Fitting
I have updated my model to be a conv net. In addition to the piece vectors, I am now feeding the 4x8x4 board state into parallel 3×3 and 5×5 conv layers with an 8×4 output and 16 channels each for two 16x8x4 outputs. Both are layer normed. I am then concatenating these together for a 32x8x4 volume that I am then sending those to a 6 layer feed forward network that also receives the piece vector input directly.
After some tweaking, it is apparent that in the bootstrap framework, the model is overfitting to its current opponent. Here is the current plots for a bootstrap training run that is underway right now:
I’ve speculated a lot about what causes the sudden apparent increase speed that the win rate goes up. The answer is in the evaluation. In the evaluation, I start with temperature = 1 which is the temperature at which training occurs. With temperature =1, here is the evaluation:
We can see that 1 vs 0 performs slightly better than 2 vs 0. Likewise, 2 vs 1 is also slightly better than 2 vs 0. Neither of those should be the case. Version 2 should be the best model in all cases. With temperature = 1, the values are still close enough that it’s just a statistical anomaly. However, if we lower the temperature to 0.25, making the model much more greedy, we see that it is not a statistical anomaly:
Here we see that the performance difference of 1 vs 0 is significantly better than 2 vs 0. And 2 has near perfect performance vs 1, but average performance over 0. The 2 vs 1 training was that very steep, quick learning curve between 300,000 and 350,000 games. Clearly version 2 had exploited some learned behavior in version 1 that was not as prevalent in version 0 (which is near-random chance).
So it is apparent the model is overfitting to its latest opponent. The question now is how to deal with this. I have a lot of ideas in my head, but the one I’m going to start with is an idea that has been on the back burner for a while: dropout layers.