Univariate Models

Introduction

Probability

Pierre Laplace

Probability Theory is nothing but common sense reduced to calculation.

There are two different interpretation of probability.

  1. Frequentist Interpretation - In this view probabilities represent long run frequencies of events that can happen multiple times
  2. Bayesian Interpretation - In this view probability is used to quantify out uncertainty about something, It is related to information rather than repeated trails.

Types of uncertainty

  • Epistemic uncertainty - Uncertainty due to our ignorance of the underlying hidden causes or mechanism generating our data. A simpler term for this type of uncertainty is model uncertainty.
  • Aleatoric uncertainty - It arises from intrinsic variability, which we cannot reduced even if we collect more data. A simpler term for this is data uncertainty

Probability as an extension of logic

Probability of an event

We define an event, denoted by the binary variable , as some state of the world that either holds or does not hold. For example, might be event “It will rain tomorrow”, or “it rained yesterday”, or “the label is ”, or “the parameter is between 1.5 and 2.0”, etc. The expression denotes the probability with which you believe event is true (or the long run fraction of times that will occur). We require that , where means the event definitely will not happen, and means the event definitely will happen.

We write to denote the probability of event not happening; this is defined to be

Probability of a conjunction of two events

We define the joint probability of events and both happening as follows:

If and are independent events, we have

Example

suppose and are chosen uniformly at random from the set . Let be the event that and be the event that . Then we have .

Probability of a union of two events

The probability of even or happening is given by

If the events are mutually exclusive we get

Conditional probability of one event given another

We define the conditional probability of event happening given that has occurred as follows

This is not defined if , since we cannot condition on an impossible event.

Independence of events

We say that event is independent of event if

Conditional Independence of events

We say the events and are conditionally independent given event if

This is written as .

Random variables

If the value of is unknown and/or could change, we call it a random variable or rv. The set of possible values, denoted is known as the sample space or state space. An event is a set of outcomes from a given sample space.

Discrete random variables

If the sample space is finite or countably infinite, then is called a discrete random variable We denote the probability of the event that has value by . We define the probability mass function as a function which computes the probability of events which correspond to setting the rv to each possible value:

The pmf satisfies the properties and .

Continuous random variables

if is a real-valued quantity, it is called a continuous random variable.

Cumulative distribution function (cdf)

Define the events and , where . We have the and since and are mutually exclusive, the sum rules gives

and hence the probability of being in interval is given by

In general, we define the cumulative distribution function or cdf of the rv X as follows:

Using this, we can compute the probability of being in any interval as follows :

Cdf’s are monotonically non-decreasing functions.

Probability Density function (pdf)

We define the probability density function as pdf as the derivative of the cdf :

Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:

As the size of interval gets smaller, we can write

Intuitively, this says the probability of being in a small interval around is the density at times the width of the interval.

Quantiles

If the cdf is strictly monotonically increasing, it has an inverse, called the inverse cdf or percent point function (ppf) or quantile function.

If P is the cdf of , then is the value such that ; this is called the quantile of . The value is the median of the distribution, with half of the probability mass on the left, and half on the right. The values and are the lower and upper quartiles.

Given a join distribution, we define the marginal distribution of an rv as follows:

This is sometimes called the sum rule or the rule of total probability.

We define the condition distribution of an rv using

We can rearrange this equation to get

This is called the product rule. By extending the product rule to variables, we get the chain rule of probability.

Independence and conditional independence

We say X and Y are unconditionally independent or marginally independent, denoted , if we can represent the joint as the product of the two marginals.

we say a set of variables is (mutually) independent if the joint can be written as a product of marginals for all subsets :

We write and are conditionally independent given iff the conditional joint can be written as a product of conditional marginals.

Moments of a distribution
Mean of a distribution

The most familiar property of a distribution is its mean, or expected value, often denoted by For continuos rv’s the mean if defined as follows:

If the integral is not finite , the mean is not defined. For discreate rv’s, the mean is defined as follows:

However this is only meaningful if the values of are ordered in some way. Since the mean is a linear operator, we have

This is called the linearity of expectation.

Variance of a distribution

The variance is a measure of the spread of a distribution. often denoted by This is defined as :

from which we derive the useful result

The standard deviation is defined as

Mode of a distribution

The mode of a distribution is the value with the highest probability mass or probability density :

Conditional Moments

When we have two or more dependent random variables, we can compute the moments of one given knowledge of the other.

  • Law of iterated expectations (law of total expectation) :
  • Law of total variance (conditional variance formula ):

Limitation of summary Statistics

Although it is common to summarize a probability distribution using simple statistics such as mean and variance, this can lose a lot of information. A striking example of this is known as Anscombe’s quartet.

Bayes’ rule

Sir Harold Jefferys, 1973

Bayes’ theorem is to the theory of probability what pythagoras’s theory to geometry

Inference means “the act of passing from sample data to generalization, usually with calculated degrees of certainty”. And the term “Bayesian” is used to refer to inference methods that represent “degree of certainty” using probability theory.

Bayes’ rule itself is very sample; it is just a formula for computing the probability distribution over possible values of an unknown quantity given some observed data .

This follows automatically from the identity

In this equation the term represents prior distribution, The term represents observation distribution. represents the likelihood. is known as the marginal likelihood. is the posterior distribution.

We can summarize Bayes rule in words as follows :

Using Bayes rule to update a distribution over unknown values of some quantity of interest, given relevant observed data, is called Bayesian inference, or posterior inference. It can also just be called probabilistic inference.

Inverse problems

Probability theory is concerned with predicting a distribution over outcomes given knowledge (or assumptions) about the state of the world, . By contrast, inverse probability is concerned with inferring the state of the world from observations of outcomes. We can think of this as inverting the mapping.

Bernoulli and binomial distributions

Definition

Consider tossing a coin, where the probability of event that it lands heads is given by . Let denote this event, and let denote the event that the coin lands tails. Thus we are assuming that and . This is called the Bernoulli distribution, and can be written as follows

where the symbol means “is sampled from” or “is distributed as”, and Ber refers to Bernoulli. The probability mass function of this distribution is defined as follows:

We can write this in a more concise manner as follows :

Sigmoid (logistic) function

When we want to predict a binary variable given some inputs , we need to use a conditional probability distribution of the form

where is some function that predicts the mean parameter of the output distribution.

where . sigmoid means “s-shaped”.

Binary logistic regression

In this, we use a conditional bernouli model, where we use a linear predictor of the form . Thus the model has the form

In other words,

This is called the logistic regression.

Categorical and multinomial distributions

To represent a distribution over a finite set of labels , we can use the categorical distribution which generalizes the Bernoulli to values. The categorical distribution is a discrete probability distribution with one parameter per class :

Softmax function
Multiclass logistic regression

This is known as multinomial logistic regression.

Univariate Gaussian distribution

The most widely used distribution of real-valued random variables is the Gaussian Distribution, also called the normal distribution.

Cumulative distribution function

represents the cdf. Using this, we can compute the probability of being in any interval as follows:

Cdf’s are monotonically non-decreasing functions. The cdf of the Gaussian is defined by

Probability Density function

We define the probability density function or pdf as a derivate of the cdf:

The pdf of the Gaussian is given by

where is the normalization constant needed to ensure the density integrates to 1.

Regression
Dirac delta function as a limiting case

As the variance of a Gaussian goes to 0, the distribution approaches an infinitely narrow, but infinitely tall, “spike” at the mean. We can write this as follows:

where is the Dira delta function, defined by

where .

Some other common univariate distributions

Student t distribution

The Gaussian distribution is quite sensitive to outliers. A robust alternative to the Gaussian is the Student t-distribution, which we shall call the Student distribution for short.9 Its pdf is as follows:

where is the mean, is the scale parameter and is called the degrees of freedom(degree of normality).

Cauchy distribution

If , the Student distribution is known as the Cauchy or Lorentz distribution. Its pdf is defined as

Laplace distribution

Another distribution with heavy tails is the Laplace distribution, also known as the double sided exponential distribution. This has the following pdf:

Here is the location parameter and is a scale parameter.

Beta Distribution

The beta distribution has support over the interval and is defined as follows:

where is the beta function defined by

where is the Gamma function defined by

Gamma distribution

The gamma distribution is a flexible distribution for positive real valued rv’s, . It is defined in terms of two parameters, called the shape and the rate :

Special cases of Gamma
  • Exponential distribution
  • Chi-squared distribution
  • inverse Gamma distribution
Empirical distribution

This is called the empirical distribution of the dataset

Transformation of random variables

Suppose is some random variable, and is some deterministic transformation of it.

Discrete case

If is a discrete rv, we can derive the pmf for by simply summing up the probability mass for all the x’s such that :

Continuous case

If is continuous, we work with cdf’s as follows:

If is invertible, we can derive the pdf of by differentiating the cdf, If is not invertible, we can use numerical integration, or Monte Carlo approximation.

Invertible transformation (Bijections)
Change of scalars

Suppose and . This function stretches and shifts the probability distribution,

Change of variables
Moments of a linear transformation
The convolution theorem

Let where and are independent rv’s. If these are discrete random variables we can compute the pmf for the sum as follows:

for If and have pdf’s and . what is the distribution of ? The cdf for is given by

Where we integrate over the region defined by Thus the pdf for is

where we used the rule of differentiating under the integral sign:

We can write this as

where represents the convolution operator. and this theorem is called convolution theorem.

Central limit theorem

Consider random variables with pdf’s (not necessarily Gaussian) each with mean and variance , We assume each variable is independent and identically distributed which means are independent samples from the same distribution. Let be the sum of the rv’s. As N increases the distribution of this sum approaches

Hence the distribution of the quantity

converges to the standard normal, where is the sample mean. This is called the central limit theorem.

Monte Carlo approximation

Suppose is a random variable, and is some function of . It is often difficult to compute the induced distribution analytically. One simple but powerful alternative is to draw a large number of samples from the x’s distribution, and then to use these samples (instead of the distribution) to approximate . For example, suppose and . We can approximate by drawing many samples from (using a uniform random number generator), squaring them, and computing the resulting empirical distribution, which is given by

This approach is called a Monte Carlo approximation.


Multivariate Models

Joint distributions for multiple random variables

Covariance

The covariance between two rv’s X and Y measures the degree to which X and Y are (linearly) related.

If is a D-dimensional random vector, its covariance matrix is defined to be the following symmetric, positive semi definite matrix :

𝟙

The Cross-variance between two random vectors is defined as

Correlation

Covariances can be between negative and positive infinity. Sometimes it is more convenient to work with a normalized measure, with a finite lower and upper bound. The (Pearson )Correlation coefficient between and is defined as

In the case of a vector of related random variables, the correlation matrix is given by

This can be written more compactly as

where is the auto-covariance matrix and is the auto correlation matrix.

  • Uncorrelated does not imply independent
  • Correlation does not imply causation

The multivariate Gaussian distribution

The MVN density is defined as

where is the mean vector, and is the covariance matrix defined as follows :

Linear Gaussian systems

Let eb an unknown vector of values, and be some noisy measurement of . we assume these variable are related by the following joint distribution:

Where is the matrix of size . This is an example of a linear Gaussian System.

Bayes Rule for Gaussians

The posterior over the latent is given by

This is known as Bayes rule for Gaussians.

The exponential family

we define the exponential family which includes many common probability distributions parameterized by with fixed support over . we say that the distribution is in the exponential family if its density can be written in the following way:

where is the scaling constant (base measure), are the sufficient statistics are the natural parameters or canonical parameters, is a normalization constant known as the partition function and is the log partition function.

Mixture models

One way to create more complex probability models is to take a convex combination of simple distribution, This is called a mixture model and has the form

where is the kth mixture component and are the mixture weights which satisfy and .

Gaussian mixture models

A Gaussian mixture model also called a mixture of Gaussians is defined as follows:

Bernoulli mixture models

IF the data is binary values, we can use a Bernoulli mixture model where each mixture components has the following form:

Here is te probability that bit turns on in cluster k.

Probabilistic graphical models

Representation

A probabilistic graphical model is a joint probability distribution that uses a graph structure to encode conditional independence assumptions. Ehen the graph is directed acyclic graph the model is sometimes called Bayesian network.

The basic idea in PGMs is that each node in the graph represents a random variable, and each edge represents a direct dependency. More precisely, each lack of edge represents a conditional independency. In the DAG case, we can number the nodes in topological order (parents before children), and then we connect them such that each node is conditionally independent of all its predecessors given its parents:

where are the parents of node , and are the predecessors of node in the ordering. (This is called the ordered Markov property.) Consequently, we can represent the joint distribution as follows:

where is the number of nodes in the graph.

inference

A PGM defines a joint probability distribution. We can therefore use the rules of marginalization and conditioning to compute for any sets of variables and .

Learning

f the parameters of the CPDs are unknown, we can view them as additional random variables, add them as nodes to the graph, and then treat them as hidden variables to be inferred.

More precisely, the model encodes the following “generative story” about the data:

where is some prior over the parameters, and is some specified function.