Information Theory

Entropy

The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution.

Example

Suppose we observe a sequence of symbols generated from distribution . If has high entropy, it will be hard to predict the value of each observation .

Entropy for discrete random variables

Here denotes the entropy of the rv with distribution . The discrete distribution with maximum entropy is the uniform distribution. Hence for a K-ary random variable, the entropy is maximized if

Application : DNA sequence logos

Cross entropy

The cross entropy between distribution and is defined by

Joint entropy

The joint entropy of two random variables and is defined as

Conditional entropy

The conditional entropy of given is the uncertainty we have in after seeing , averaged over possible values for :

Perplexity

The perplexity of a discrete probability distribution is defined as

This is often interpreted as a measure of predictability.

Differential entropy for continuous random variables

If is a continuous random variable with pdf , we define the differential entropy as

assuming this integral exists, For example, suppose Then


Relative entropy (KL divergence) *

Given two distributions p and q, it is often useful to define a distance metric to measure how “close” or “similar” they are. In fact, we will be more general and consider a divergence measure D(p, q) which quantifies how far q is from p, without requiring that D be a metric. It is also known as information gain or relative entropy between two distributions and . For discrete distributions , the KL divergence is defined as follows:

Interpretation :

We recognize the first term as the negative entropy and the second term as the cross entropy.

Example : KL divergence between two Gaussians

KL divergence between two multivariate Gaussian distribution is given by :

In the scalar case, this becomes

KL divergence and MLE
Forward vs reverse KL

Forwards KL (inclusive KL)

Minimizing this wrt is known as an M-projection or moment project. Reverse KL (exclusive KL)

Minimizing this wrt is known as an I-projection or information projection.


Mutual Information

The mutual information between rv’s and is defined as follows:

Conditional Mutual information

We can define the conditional mutual information

Chain rule for mutual information:

MI as a generalized correlation coefficient
Normalized mutual information
Maximal information coefficient

Where is the set of 2d grids, and represents a discretization of the variables onto this grid, and is , where is the number of grid cells in the direction, and is the number of grid cells in the direction.

Data processing inequality

Suppose we have an unknown variable , and we observe a noisy function of it, call it . If we process the noisy observations in some way to create a new variable , it should be intuitively obvious that we cannot increase the amount of information we have about the unknown quantity, . This is known as the data processing inequality.

Sufficient Statistics

An important consequence of the DPI is the following. Suppose we have the chain . Then

If this holds with equality, then we say that is a sufficient statistic of the data for the purposes of inferring .

Fano’s Inequality

A common method for feature selection is to pick input features which have high mutual information with the response variable . Below we justify why this is a reasonable thing to do. In particular, we state a result, known as Fano’s inequality, which bounds the probability of misclassification (for any method) in terms of the mutual information between the features X and the class label .