TomBolton.io

Tom Bolton’s AI and Machine Learning Lab Notebook.

Machine Learning

Deep Reinforcement Learning So Far – Udacity

I’ve taken quite a few ML/AI of courses on Coursera, and so far, Udacity is a very different experience. All of the Coursera material is covered by Andrew Ng himself in some fairly bare-bones videos with relatively low production values. In contrast to that, Udacity uses much more highly polished videos featuring what appears to be (likely) an actor and a lot of animations to support the material. However, I don’t find there’s a significant advantage to me in understanding and learning the material. Despite the low production values of the videos, I always found Professor Ng’s explanations easy to understand. His method of moving through a set of slide states augmented by the virtual whiteboard software he used, combined with excellent conceptual explanations always left me able to digest the material without much difficulty. I suspect that much of this came from being intrinsically comfortable with the concepts and the math.

By contrast, the Udacity content goes to greater lengths to make things easy to understand. Despite not having much trouble with the concepts and the math, for the most part, I appreciate the effort. Some of the animations and graphics probably help a lot reducing cognitive load with new material. However, in at least one instance, that need to simplify things resulted in an oversimplification that actually confused things a bit. This was the case in the attempt to explain the relationship between an action-value functions and policies.

In the Optimal Policies video, the instructor says “for now, let’s ignore how the agent uses its experience to estimate the value function.” Instead, she says, let’s assume the agent already knows the optimal action value function, but it doesn’t know the corresponding optimal policy. In support of this, she uses the action-value function as mapped out below for the gridworld example:

However, the only way the agent was able to determine any action value function was by applying some arbitrary policy, and in the case above, that deterministic policy actually is the optimal policy. The action values above were only able to be created by applying the deterministic (optimal) policy for each grid block as shown by the solid lines.

However, I can create a gridworld action grid with a completely different action value function if I choose some different arbitrary deterministic policy.

So ultimately it was confusing to talk about figuring out the optimal policy from the action-value function, when, in fact, the only way to have an optimal action value function is by starting with an optimal policy.

I suspect this method of explaining the material might help someone who has not closely examined the relationship understand that there is a relationship. However, it is ultimately a misleading way to describe that relationship. It’s a speed bump that didn’t really slow me down, but if it were me, I would have explained things differently.

Tagged:

You Might Also Like