- Published on
Goodfellow's Deep Learning- Second Pass Notes
(WIP) Deep Learning: Second Pass Notes
Intro
I really enjoyed reading Deep Learning by Goodfellow; it was one of the more helpful references when I was learning machine learning concepts. While I have read through this book once, there are parts that I glossed over or did not fully internalize. I believe that having an intuitive understanding of all the key concepts is a solid lower bound; thus, I am writing up and sharing those less internalized snippets that I got from my second read here to learn them by heart. While this is more of a collection of notes, hopefully there are some useful ideas for the reader as well.
Linear Algebra
Def: A matrix is an orthogonal matrix if all the rows are mutually orthonormal and all the columns are mutually orthonormal, equivalently .
Eigendecomposition
Every real symmetric matrix can be eigendecomposed with orthogonal matrices, i.e. .
Information Theory
The guiding three principles for an information metric:
- Events with probability 1 should have zero info.
- Less likely events should have higher info.
- Independent events have additive info.
We define information as .
Def: The Shannon Entropy is the expected information of a distribution, i.e .
We can see that for distributions that have a high chance of being a particular value (or more generally whose probability mass collapses for a small region), there is a low Shannon entropy, i.e the distribution gives little info. But on the other hand, if the distribution is quite random (say uniform across the domain), then we have high unexpectedness and thus a high Shannon entropy.
Def: KL divergence measures the "distance" between two distributions, defined by .
Asymmetry of KL divergence
KL divergence is not a metric, but symmetry does not necessarily hold, as in for some distributions.
How to understand KL divergence from hypothesis testing?
We have the definition .
We can see a likelihood ratio . In fact, let us consider the hypothesis test with is the distribution of the data and is the distribution of the data. We see that the simply measures the expected log likelihood under the null hypothesis. So if this value is very large, we might be induced to not reject the null.
We can then consider and the hypothesis test and is the distribution of the data. What we see is that this value can be different than before (such as in the case of comparing a distribution and a normal distribution), which is precisely why we can have asymmetry of KL divergence.
Note to self: I still don't quite understand the connection well.