Univariate Models
Introduction
Probability
Pierre Laplace
Probability Theory is nothing but common sense reduced to calculation.
There are two different interpretation of probability.
- Frequentist Interpretation - In this view probabilities represent long run frequencies of events that can happen multiple times
- Bayesian Interpretation - In this view probability is used to quantify out uncertainty about something, It is related to information rather than repeated trails.
Types of uncertainty
- Epistemic uncertainty - Uncertainty due to our ignorance of the underlying hidden causes or mechanism generating our data. A simpler term for this type of uncertainty is model uncertainty.
- Aleatoric uncertainty - It arises from intrinsic variability, which we cannot reduced even if we collect more data. A simpler term for this is data uncertainty
Probability as an extension of logic
Probability of an event
We define an event, denoted by the binary variable
We write
Probability of a conjunction of two events
We define the joint probability of events
If
Example
suppose
and are chosen uniformly at random from the set . Let be the event that and be the event that . Then we have .
Probability of a union of two events
The probability of even
If the events are mutually exclusive we get
Conditional probability of one event given another
We define the conditional probability of event
This is not defined if
Independence of events
We say that event
Conditional Independence of events
We say the events
This is written as
Random variables
If the value of
Discrete random variables
If the sample space
The pmf satisfies the properties
Continuous random variables
if
Cumulative distribution function (cdf)
Define the events
and hence the probability of being in interval
In general, we define the cumulative distribution function or cdf of the rv X as follows:
Using this, we can compute the probability of being in any interval as follows :
Cdf’s are monotonically non-decreasing functions.
Probability Density function (pdf)
We define the probability density function as pdf as the derivative of the cdf :
Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:
As the size of interval gets smaller, we can write
Intuitively, this says the probability of
Quantiles
If the cdf
If P is the cdf of
Sets of related random variables
Given a join distribution, we define the marginal distribution of an rv as follows:
This is sometimes called the sum rule or the rule of total probability.
We define the condition distribution of an rv using
We can rearrange this equation to get
This is called the product rule.
By extending the product rule to
Independence and conditional independence
We say X and Y are unconditionally independent or marginally independent, denoted
we say a set of variables
We write
Moments of a distribution
Mean of a distribution
The most familiar property of a distribution is its mean, or expected value, often denoted by
If the integral is not finite , the mean is not defined. For discreate rv’s, the mean is defined as follows:
However this is only meaningful if the values of
This is called the linearity of expectation.
Variance of a distribution
The variance is a measure of the spread of a distribution. often denoted by
from which we derive the useful result
The standard deviation is defined as
Mode of a distribution
The mode of a distribution is the value with the highest probability mass or probability density :
Conditional Moments
When we have two or more dependent random variables, we can compute the moments of one given knowledge of the other.
- Law of iterated expectations (law of total expectation) :
- Law of total variance (conditional variance formula ):
Limitation of summary Statistics
Although it is common to summarize a probability distribution using simple statistics such as mean and variance, this can lose a lot of information. A striking example of this is known as Anscombe’s quartet.
Bayes’ rule
Sir Harold Jefferys, 1973
Bayes’ theorem is to the theory of probability what pythagoras’s theory to geometry
Inference means “the act of passing from sample data to generalization, usually with calculated degrees of certainty”. And the term “Bayesian” is used to refer to inference methods that represent “degree of certainty” using probability theory.
Bayes’ rule itself is very sample; it is just a formula for computing the probability distribution over possible values of an unknown quantity
This follows automatically from the identity
In this equation the term
We can summarize Bayes rule in words as follows :
Using Bayes rule to update a distribution over unknown values of some quantity of interest, given relevant observed data, is called Bayesian inference, or posterior inference. It can also just be called probabilistic inference.
Inverse problems
Probability theory is concerned with predicting a distribution over outcomes
Bernoulli and binomial distributions
Definition
Consider tossing a coin, where the probability of event that it lands heads is given by
where the symbol
We can write this in a more concise manner as follows :
Sigmoid (logistic) function
When we want to predict a binary variable
where
where
Binary logistic regression
In this, we use a conditional bernouli model, where we use a linear predictor of the form
In other words,
This is called the logistic regression.
Categorical and multinomial distributions
To represent a distribution over a finite set of labels
Softmax function
Multiclass logistic regression
This is known as multinomial logistic regression.
Univariate Gaussian distribution
The most widely used distribution of real-valued random variables
Cumulative distribution function
Cdf’s are monotonically non-decreasing functions. The cdf of the Gaussian is defined by
Probability Density function
We define the probability density function or pdf as a derivate of the cdf:
The pdf of the Gaussian is given by
where
Regression
Dirac delta function as a limiting case
As the variance of a Gaussian goes to 0, the distribution approaches an infinitely narrow, but infinitely tall, “spike” at the mean. We can write this as follows:
where
where
Some other common univariate distributions
Student t distribution
The Gaussian distribution is quite sensitive to outliers. A robust alternative to the Gaussian is the Student t-distribution, which we shall call the Student distribution for short.9 Its pdf is as follows:
where
Cauchy distribution
If
Laplace distribution
Another distribution with heavy tails is the Laplace distribution, also known as the double sided exponential distribution. This has the following pdf:
Here
Beta Distribution
The beta distribution has support over the interval
where
where
Gamma distribution
The gamma distribution is a flexible distribution for positive real valued rv’s,
Special cases of Gamma
- Exponential distribution
- Chi-squared distribution
- inverse Gamma distribution
Empirical distribution
This is called the empirical distribution of the dataset
Transformation of random variables
Suppose
Discrete case
If
Continuous case
If
If
Invertible transformation (Bijections)
Change of scalars
Suppose
Change of variables
Moments of a linear transformation
The convolution theorem
Let
for
Where we integrate over the region
where we used the rule of differentiating under the integral sign:
We can write this as
where
Central limit theorem
Consider
Hence the distribution of the quantity
converges to the standard normal, where
Monte Carlo approximation
Suppose
This approach is called a Monte Carlo approximation.
Multivariate Models
Joint distributions for multiple random variables
Covariance
The covariance between two rv’s X and Y measures the degree to which X and Y are (linearly) related.
If
The Cross-variance between two random vectors is defined as
Correlation
Covariances can be between negative and positive infinity. Sometimes it is more convenient to work with a normalized measure, with a finite lower and upper bound. The (Pearson )Correlation coefficient between
In the case of a vector
This can be written more compactly as
where
- Uncorrelated does not imply independent
- Correlation does not imply causation
The multivariate Gaussian distribution
The MVN density is defined as
where
Linear Gaussian systems
Let
Where
Bayes Rule for Gaussians
The posterior over the latent is given by
This is known as Bayes rule for Gaussians.
The exponential family
we define the exponential family which includes many common probability distributions parameterized by
where
Mixture models
One way to create more complex probability models is to take a convex combination of simple distribution, This is called a mixture model and has the form
where
Gaussian mixture models
A Gaussian mixture model also called a mixture of Gaussians is defined as follows:
Bernoulli mixture models
IF the data is binary values, we can use a Bernoulli mixture model where each mixture components has the following form:
Here
Probabilistic graphical models
Representation
A probabilistic graphical model is a joint probability distribution that uses a graph structure to encode conditional independence assumptions. Ehen the graph is directed acyclic graph the model is sometimes called Bayesian network.
The basic idea in PGMs is that each node in the graph represents a random variable, and each edge represents a direct dependency. More precisely, each lack of edge represents a conditional independency. In the DAG case, we can number the nodes in topological order (parents before children), and then we connect them such that each node is conditionally independent of all its predecessors given its parents:
where
where
inference
A PGM defines a joint probability distribution. We can therefore use the rules of marginalization and conditioning to compute
Learning
f the parameters of the CPDs are unknown, we can view them as additional random variables, add them as nodes to the graph, and then treat them as hidden variables to be inferred.
More precisely, the model encodes the following “generative story” about the data:
where