TomBolton.io

Tom Bolton’s AI and Machine Learning Lab Notebook.

Machine Learning

Andrej Karpathy’s Pong from Pixels has been my exemplar for implementing reinforcement learning. I find that in general, I’m following along with his approach, and I feel that conceptually, it’s making sense. But as I set out to actually create my own analogous and more complex version for checkers, I’m working through the code to see exactly how he implements those concepts as a primer for doing so myself. First I discovered that in fact, despite what I inferred from his original post, I was correct that rewards get discounted as you move back in time from the reward-worthy event.

This morning I was reviewing how he implements his gradient, and came across something that I think I had partially inferred from what he wrote, but wasn’t really sure about. The way he sets up his network, he has a single output node that generates a probability \(a\) between 0 and 1 that the paddle will go up. If it doesn’t go up, it goes down. Further, the way he implements the reward is to create a fake label \(y\) for every state which corresponds to the action actually taken by the agent (1 for up, 0 for down). For every state, then, he calculates \(y-a\). When I saw this, I assumed incorrectly that it was a cost function \(L=y-a\) (even though that’s not a very useful cost function). In that case it would have to be the starting point for back propagation. Thus, to get to \(W^{[2]}\), you’d have to put the following together (I’m skipping the sums to keep things simple here):

$$\large\frac{∂ L}{∂ a^{[2]}}=1$$

$$\large\frac{∂ a^{[2]}}{∂ z^{[2]}}=\sigma'(z^{[2]})=a^{[2]}(1-a^{[2]})$$

$$\large\frac{∂ z^{[2]}}{∂ w^{[2]}}=a^{[1]}$$

And thus:

$$\large\frac{∂ L}{∂ w^{[2]}}=1*a^{[2]}(1-a^{[2]})*a^{[1]}$$

But as I mentioned, this all assumes \(L=y-a\), and so of course that wasn’t what his intention was. Instead, he skips right past the derivation of the cost function, or even taking the derivative of the sigmoid function or anything afterwards. Instead, for where we would have \(\frac{∂ L}{∂ z^{[2]}}=\sigma'(z^{[2]})=a^{[2]}(1-a^{[2]})\), he just plugs \(y-a\) right in there to get:

$$\large\frac{∂ L}{∂ z^{[2]}}=y-a^{[2]}$$

$$\large\frac{∂ L}{∂ w^{[2]}}=(y-a^{[2]})*a^{[1]}$$

So what he has actually done is different than what I was originally thinking (and certainly than what I would have done). Here’s the part I couldn’t figure out at first. If I were to go straight from a cross entropy cost function and derive back, then \(\frac{\delta L}{\delta z^{[2]}}=a^{[2]}-y\) which is exactly the negative of what he’s plugged in there (with no explicit cost function). And yet, the logic behind what he’s done is sound. If, for instance, \(a^{[2]}=0.7\) and our fake label is \(y=1\) and it’s a win, we want to encourage (increase the probability) of \(y=1\) which means pushing \(a^{[2]}\) higher.

Turns out, it all makes sense. Using the algorithms we’ve learned, we subtract the gradient from the weights. Meaning: the gradients we’ve always calculated across the dimensions of the \(W\) vector is the direction to go up. But since you’re trying to go down (to minimize cost), you subtract the gradient vector which is the equivalent of going in the opposite direction. Andrej‘s gradient vector is the vector to go down which is why it’s the negative of the derivative that you’d ordinarily get starting from cross entropy, and thus he just adds it directly.

He never bothers to use a cost function or derive explicitly from it because in this case it would seem to be a meaningless number generated from fake labels. It’s important for reinforcement but not as an actual metric. However, it’s clearly implicit in what he’s designed. More importantly though, it confirms my intuition about the approach that I am looking to take (which of course, I haven’t explained yet). One thing I’ve yet to analyze is how exactly he implements updating the weights with the gradients. He seems to be using a method that I’m not familiar with. I’m not too worried though…

 

Tagged:

LEAVE A RESPONSE

You Might Also Like