Information Theory
Entropy
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution.
Example
Suppose we observe a sequence of symbols
generated from distribution . If has high entropy, it will be hard to predict the value of each observation .
Entropy for discrete random variables
Here
Application : DNA sequence logos
Cross entropy
The cross entropy between distribution
Joint entropy
The joint entropy of two random variables
Conditional entropy
The conditional entropy of
Perplexity
The perplexity of a discrete probability distribution
This is often interpreted as a measure of predictability.
Differential entropy for continuous random variables
If
assuming this integral exists, For example, suppose
Relative entropy (KL divergence) *
Given two distributions p and q, it is often useful to define a distance metric to measure how “close” or “similar” they are. In fact, we will be more general and consider a divergence measure D(p, q) which quantifies how far q is from p, without requiring that D be a metric.
It is also known as information gain or relative entropy between two distributions
Interpretation :
We recognize the first term as the negative entropy and the second term as the cross entropy.
Example : KL divergence between two Gaussians
KL divergence between two multivariate Gaussian distribution is given by :
In the scalar case, this becomes
KL divergence and MLE
Forward vs reverse KL
Forwards KL (inclusive KL)
Minimizing this wrt
Minimizing this wrt
Mutual Information
The mutual information between rv’s
Conditional Mutual information
We can define the conditional mutual information
Chain rule for mutual information:
MI as a generalized correlation coefficient
Normalized mutual information
Maximal information coefficient
Where
Data processing inequality
Suppose we have an unknown variable
Sufficient Statistics
An important consequence of the DPI is the following. Suppose we have the chain
If this holds with equality, then we say that
Fano’s Inequality
A common method for feature selection is to pick input features