Why Log Probabilities? Now I Know.
Back when I first read Andrej Karpathy’s revelatory Deep Reinforcement Learning: Pong from Pixels and wrote about it back in this post, I saw that he kept referring to the log probability of any given movement rather than the actual probability. When discussing the various operations and equations, they were always on the log probability. I understand mathematically what the log probability is, but it was never clear to me why you’d want to use that as your basic operational unit rather than the probability itself. It now makes some sense.
I’m midway through the final week of my Sequence Models course, and Professor Ng is discussing beam search. We’re dealing with evaluating the probabilities of sentence-length sequences of words. The probability of even any single word from a very small 10,000 word vocabulary can be quite close to zero. And getting the probabilities of strings of these words involves multiplying the probabilities together, which can bring you much closer to zero. Doing things this way introduces two problems: rounding errors with numbers that low, and numerical underflow where the number is too small to be represented accurately in computer memory. The solution to this is to use log of the product of probabilities instead of just the product of the probabilities. When using the log of the product, that can be translated as the sum of the log of each probability, or rather, the sum of the log probability. Because log is a monotonic function, maximizing the sum of log probabilities will also maximize the product of probabilities.
Thus, we’re effectively representing both the aggregate probability and the individual probabilities in log space, which allows very, very tiny probabilities to be represented as accurately in memory as larger probabilities. In the Pong from Pixels example, numerical underflow does not seem to be a problem, but I’m assuming that someone well-trained in working with probabilities would use log probabilities as a best practice regardless.