Thoughts on Checkers
A while ago, at the time when I was (ludicrously, it seems) considering tic tac toe as a viable candidate for an AI project, it occurred to me that Checkers was a probably a better candidate. I don’t actually play games like Go and Chess, so I don’t even have the domain expertise to tackle such a problem on a purely logistical level. While I don’t doubt that checkers is also a solved problem in the world of AI, for me it still serves as a challenging starting point.
So here begins my thought process around that problem. This is not my documentation of a completed thought process, but rather me thinking this through on the fly. It really is just notes and thoughts, so I don’t know how coherent any of this will end up being, and thus, whether I will ever publish it. But it’s worth a shot.
As I mentioned in earlier pieces, the real challenge for me is that this is a fundamentally different problem than the classification problems I’ve dealt with to date. Unlike classification problems, in a training scenario, each decision does not have a definitive right or wrong answer that can be reduced to a simple cost function to be used for reinforcement. This has led me to a paper published by Deep Mind on sparse reward reinforcement. The paper deals specifically with robotics, but it seems clear to me that there is application to learning to play a game from scratch. I’m still reading and digesting it, but the critical element seems to be a system that encourages certain short term, immediate actions that, while not necessarily being optimal for the learning of the ultimate task, help the AI along a path to get more readily to that long term task.
The actual article concerns itself with the task of a robot learning to open a box and place a box inside it. It’s a complex task to learn, especially from scratch. The challenge seems to be how to properly reward the near infinite things the robot could do that do not ultimately succeed in getting the box open or the block inside. Early on, they describe a system for rewarding certain types of behaviors that, as I see it, would be good “habits” to get into, even if they don’t deliver the ultimate result. They call these “auxiliary rewards.” The one they describe early on is providing auxiliary rewards when the robot is able to move an object. In itself, moving an object does not necessarily get the box open or the block any closer to being in the box. In fact, moving the block might actually move the block farther from being in the box. But it seems that actions that move an object in space are more valuable in general to figure out how to solve the overall problem than those that do not have that effect.
At least that’s how I’m understanding it now…
One thing that seems very important here is “on policy” vs. “off policy” learning. In problem domains such as the one I’m grappling with, this seems like something I’m going to need to get on top of.
The reason this seems relevant to Checkers is that in order to teach the AI to play from scratch, it would be necessary to have some method for rewarding good habits. For instance, if one decision simply moves a checkers piece, but another one is to jump and capture an opponents piece, absent a clear strategic advantage to just moving the piece, it would seem to make sense to train that AI to make the jump rather than move the piece. At some point, there needs to be a way to combine that short term reinforcement reward (“internal reward” seems to be the word they use for it) with the reward for the overall goal of winning (the “external” reward).
Lots, lots to learn here, for sure.