Fundamental Algorithms

Linear Regression

Linear regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example.

Problem Statement

We have a collection of labeled examples , where is the size of the collection, is the D-dimensional feature vector of example is a real-valued target and every feature , is also a real number. We want to build a model as a linear combination of features of example x:

where w is a D-dimensional vector of parameters and b is a real number. The notation means that the model is parametrized by two values: and We will use the model to predict the unknown y for a given x like this: . Two models parametrized by two different pair will likely to produce two different predictions which applied to the same example. We want to find the optimal value .

Solution

The optimization procedure which we use to find the optimal values for and tries to minimize the following expression:

In mathematics, the expression we minimize or maximize is called an objective function or simply and objective. The expression in the above objective is called the loss function. It’s a measure of penalty for misclassification of example . This particular choice of the loss function is called squared error loss. All model-Based learning algorithms have a loss function and what we do to find a best model is we try to minimize the objective known as the cost function. In linear regression, the cost function is given by the average loss, also called the empirical risk. The average loss, or empirical risk, for a model, is the average of all penalties obtained by applying the model to the training data. Overfitting is the property of a model such that model predicts very well labels of the examples used during training but frequently make errors when applied to examples that weren’t seen by the learning algorithm during training. Linear models rarely overfit. In 1705, the French mathematician Adrien-Marie Legendre, who first published the sum of squares method for gauging the quality of the model stated that squaring the error before summing is convenient. Why did he say that? The absolute value is not convenient, because it doesn’t have a continuous derivative, which makes the function not smooth. Functions that are not smooth create unnecessary difficulties when employing linear algebra to find closed form solutions to optimization problems. Closed form solutions to finding an optimum of a function are simple algebraic expression and often preferable to using numerical optimization methods such as Gradient Descent.


Logistic Regression

The first thing to say is that logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression.

Problem statement

In logistic regression, we still want to model as a linear function of however, with a binary is not straightforward. The linear combination of features such as is a function that spans from minus infinity to plus infinity, while has only two possible values. If we define negative label as 0 and positive label as 1, then we would just need to find a simple continuous function whose codomain is (0,1). If the value returned by model is close to 0, then assign a negative label to x; otherwise positive. One function that is simple continuous is Standard logistic function (also known as sigmoid function) :

where is the base of the natural logarithm (also called Euler’s numbers).

Solution

In logistic regression instead of using squared loss and trying to minimize the empirical risk, we maximise the of out training set according to model. The optimization criterion in logistic regression is called maximum likelihood.


Decision Tree Learning

A decision tree is an acyclic graph that can be used to make decisions. In each branching node of the graph, a specific feature j of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed; otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs.

Problem Statement

make a simple decision tree classifier.

Solution

There are various formulation of the decision tree learning algorithm. One of them is ID3. The optimization criterion, in this case, is the average log-likelihood:

Support Vector Machine

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a non-parametric learning algorithm. Contrary to other learning algorithms that allow discarding the training data after the model is built, KNN keeps all training examples in memory. Once a new, previously unseen example x comes in, the KNN algorithm finds k training examples closest to x and returns the majority label (in case of classification) or the average label (in case of regression). The closeness of two points is given by a distance function. For example, Euclidean distance seen above is frequently used in practice. Another popular choice of the distance function is the negative cosine similarity. Cosine similarity defined as,

is a measure of similarity of the directions of two vectors. If the angle between two vectors is 0 degrees, then two vectors points to the same direction and cosine similarity is equal to 1. If the vectors are orthogonal, the cosine similarity is 0. For vectors pointing in opposite direction, the cosine similarity is -1. If we want to use cosine similarity as distance metric we need to multiply it by -1. Other popular distance metrics include Chebychev distance, Mahanobis distance, Hamming Distance. The choice of distance metric as well the value for k are the choices the analyst makes before running the algorithm. So these are hyperparameters.