TomBolton.io

Tom Bolton’s AI and Machine Learning Lab Notebook.

Machine Learning

Checking in on Udacity’s Reinforcement Learning Nano-degree

I started the nano-degree over a year ago last year. The company I was working for at the time was in some dire straits and I had a lot going on so I dropped it back then. The company ended up doing OK, but I ended up starting a new job which was and continues to be very demanding of my time and energy. This year, though, I decided to give it another shot. Fortunately the work I had already done back then still counted (even though I had to pay for the entire course over again).

Having picked up where I left off, so far I’m not impressed with the second half of the course. Most of the first half of the course culminates in Q Learning. They go through Q Learning in exhaustive detail. Having already started to learn at least some of the more modern successors to Q Learning, I found Q Learning to be inelegant, and felt that the intermediate step of having to estimate value functions to be an unnecessary convolution to the what seemed to be the more straightforward and elegant approach of figuring out the policy directly. Nonetheless, the class covers Q Learning at great depth (but not always well).

I’m finally done with those lessons, though, and have finally moved on to policy-based methods. So far, it’s been disappointing. I plan to go into more detail on this once I’m through, but suffice it to say that the reviewers of this course who complained that the instruction and detail in the later lessons are lower quality than the earlier ones appear to be correct. The creators of this course have spent far to much time on the trivially easy stuff, but then leave the student largely to fend for themselves on the complex topics that really form the core of working policy gradient algorithms.

The Simple Stuff

The videos they use to introduce policy-based methods in general, and policy gradients in specific spend a fair amount of time on trying to explain simple concepts with metaphors. Gradient ascent is climbing up a mountain! Gradient descent is climbing down a mountain! See? Get it?

On the one hand, I find it hard to believe that there are people taking this course—people who presumably already have a background in the fundamentals of AI where gradient descent is a foundational principle—who would need to have something like gradient ascent explained. However, if those people are part of the target audience for this course, I supposed there’s nothing wrong with that. If one wasn’t already familiar with that stuff, that’s a good way to start developing intuition on what is actually going on with these algorithms. And videos like that are easy to skip.

There are also helpful videos explaining the mechanics of policy gradient methods as they relate to supervised learning. This is all helpful.

The Complex Stuff

However, the videos then move very quickly move on to the mathematical expressions that capture the actual workings of the algorithm. This is the big one:

\({\nabla }_{\theta }^{}U(\theta )\approx \hat{g}=\frac{1}{m}\sum _{i=1}^{m}\sum _{t=0}^{H}{\nabla }_{\theta }^{}\log{\pi }_{\theta }^{}({a}_{t}^{(i)}\phantom{\rule[-0.1em]{-0.05em}{0.5em}}\phantom{\rule[-0.1em]{-0.05em}{0.5em}}\phantom{\rule[-0.1em]{0.05em}{0.5em}}|{s}_{t}^{(i)}\phantom{\rule[-0.1em]{-0.05em}{0.5em}}\phantom{\rule[-0.1em]{-0.05em}{0.5em}}\phantom{\rule[-0.1em]{0.05em}{0.5em}})R({\tau }_{}^{(i)})\)

This is where things fall down. It’s quite an equation, and there are several new concepts in there, including log probabilities, trajectories, estimates of policy gradients and others. The video includes a few sentences describing each. That’s sorta helpful, but here I can’t help but compare the few sentences of explanation for this gigantic, complex equation to the several videos that go to great lengths to show the various other concepts in clear, animated illustrations. Is there anyone who truly needed a metaphorical explanation of gradient ascent who is even remotely able to comprehend the many concepts involved in this equation with just their explanation?

For my own part I had read and studied Andrej Karpathy’s seminal blog post on this subject as well as the code he provides, and implemented parts of it long before I took this course. It’s something the course material directly references several times including lifting diagrams directly from that post as supporting material. And yet I have been unable connect that giant equation and the videos explaining it directly to what I’ve learned from Karpathy’s post. At least not with a LOT of independent work. Which is another thing: the main video explaining the equation closes by urging the student to review the video over and over again until it makes sense. It seems like they’re telling us: We know you paid good money for this course, but we couldn’t be bothered to do the hard work of creating a clear, understandable explanation of these difficult concepts. So instead, please do the hard work for us by just looking at the video over an over again until it makes sense.

That feels like a cop-out and not what I paid my money for.

The capper for all of this, though, and the thing that really inspired me to write this post is the next section which they seem to think deserves the title “Coding Exercise.” The purpose of this is beyond me. When I see a coding exercise, I expect either or both of the following: to understand how certain mathematical concepts can be executed using actual code, and where possible, to have to figure out much of it myself. This coding exercise is neither.

First, there is no code for me to actually write. It’s a pre-written Jupyter notebook that I am told I should just run. OK. That’s fine. Maybe it’s too complex for us beginners to be expected to come up with right out of the gate, especially given the half-hearted attempt to explain it in the first place. So that leaves only being able to use the pre-written code as an exemplar. But here’s the thing: there is not a single useful comment in the entire notebook nor even any explanation in the notebook markdown except for headers for each block. There are only two actual python comments, and these are they:

gym.logger.set_level(40) # suppress warnings (please remove if gives error)
...
torch.manual_seed(0) # set random seed

Perhaps the purpose of this is the same as the purpose of the video: just keep reviewing it until you understand it. But this is not helpful. By way of contrast, code that forms the substance of Karpathy’s post is very well commented. We’ll see how things progress, but at this point, I’ve learned more about policy gradient methods from Karpathy and his code than I have from this, and I got that for free.

Tagged:

You Might Also Like