| order | in general |
| 0 | \([1]\) |
| 1 | \([1, x_1, ... x_d]\) |
| 2 | \([1, x_1, ... x_d, x_1^2, x_1x_2, ... \text{all two way products}]\) |
| 3 | \([1, x_1, ... x_d, x_1^2, x_1x_2, ... \text{all two way products}, x_1^3, ... \text{all three way products}]\) |
| ... | ... |
Notes from MIT OCW's 6.036 (Introduction to Machine Learning) Lectures by Leslie Kaelbling:
Supervised learning is the macihine learning setup that involves a dataset organized into pairs of \(x\) and \(y\) values: \(D_n = \{(x^{(1)}, y^{(1)}), ... , (x^{(n)}, y^{(n)})\}\). Some mapping must be learned between the \(x\) and \(y\) values so that given a new \(x\) encountered in the future, a computer can accurately predict the corresponding \(y\). For example, \(x\) values might be a patient's vital signs and \(y\) values might be whether or not that patient is having a heart attack.
\(x\) values are vectors in \(d\) dimensions: \(x^{(i)} \in \mathbb{R}^d\). The set \(y\) values belong to can change depending on what problem is being approached. However, when the goal is to classify \(x\) values (whether or not a patient's vital signs indicate a heart attack or not), the set \(y\) belongs to should contain discrete numbers. In particular, if there are only two categories \(x\) can be classified as, then one is working with binary classification: \(y^{(i)} \in \{+1, -1\}\).
The relationship between \(x\) and \(y\) values that a computer learns is called a hypothesis. It is some function that takes in an \(x\) and returns a \(y\). Hypothesis also have parameters \(\Theta\), but they will be elaborated on later. A hypothesis is written as \(y = h(x; \Theta)\).
How does one know that the prediction a hypothesis makes is accurate though? Loss functions are the answer. They are written as \(L(g,a)\). The loss function takes in a hypothesis's guess \(g\). \(g\) is basically just the value the hypothesis predicts given some \(x\) it encounters. \(a\) is the correct value that the hypothesis should return. The loss function returns how sad one should be that \(g\) was guessed when \(a\) was the actual answer. Because being sad is not fun, lower loss is better. A plethora of different loss functions exist.
Now, how can one measure how well a hypothesis works? To begin, the hypothesis should work well on the data it was trained on. The training set error allows measures the average loss of the hypothesis on the training data: \(E_n(h) = {1\over{n}}\sum^{n}_{i=1}{L(h(x^{(i)}),y^{(i)})}\). The hypothesis can be likened to a student in a calculus class though. Just memorizing homework answers does not guarantee the student full marks on the final exam. Think of the training set like a bunch of homework problems. The student should be able to generalize from practice and apply skills to test questions on final exam day. To really understand how well a hypothesis will perform on new data, one should save out a portion of the training data and call it the testing set. The test error is the average of the hypothesis's losses on the test set data: \(E(h) = {1\over{n'}}\sum^{n+n'}_{i=n+1}L(h(x^{(i)}),y^{(i)})\).
The computer learns hypotheses from datasets by using learning algorithms. However, learning algorithms will be explained more extensively later.
A linear classifier is a type of hypothesis for the supervised learning setup. To visualize how linear classifiers work, one must plot all \(x\) values in \(D_n\) onto the \(\mathbb{R}^d\) space. Each axis of the \(\mathbb{R}^d\) space (i.e. \(x_1, x_2, ... , x_n\)) corresponds to a dimension of the vector \(x\). The first dimension of \(x\) is its first entry, the second dimension is its second entry, and so on. Some of the plotted points are assigned a \(y\) value of \(+1\) whereas others are assigned to \(-1\).
Once the dataset is visualized on a graph, how to write a classifier becomes really obvious! For the two-dimensional space above, one just needs to draw a line that separates \(\mathbb{R}^2\) into a \(+1\) subspace and \(-1\) subspace. In other words, one just needs to find a line such that all points that were assigned \(+1\) in the dataset sit on one side while all points that were assigned \(-1\) sit on the other side. The separator that does the job is called a linear classifier.
For a dataset with two-dimensional \(x\) vectors, linear classifiers, as described earlier, are just lines. For a dataset with three-dimensional \(x\) vectors, linear classifiers are planes. When the dataset contains \(n\)-dimensional \(x\) vectors, the general term used to describe the separator that classifies them is a hyperplane. For the \(\mathbb{R}^n\) space, hyperplanes are a space with \(n-1\) dimensions. A hyperplane has a normal that points in the direction of the \(+1\) subspace.
A linear classifier may be written formally as: \(h(x; \theta, \theta_0) = \text{sign}(\theta^T{x} + \theta_0) = \begin{cases} +1, & \theta^T x + \theta_0 > 0 \\ -1, & \text{otherwise} \end{cases}\) . Visibly, it has two parameters \( \theta\) and \( \theta_0\). Learning algorithms try to find the parameters that will construct a separator that accurately classifies \(+1\) from \(-1\). \( \theta \in \mathbb{R}^d\) and \( \theta_0\in\mathbb{R}\).
The simplest learning algorithm that one can begin with is the random linear classifier algorithm. The algorithm works by producing random parameters \(k\) times. At the end, it returns the parameters with the lowest training set error. \(k\) is called a hyperparameter. It is not a parameter of the hypothesis. Rather, it is a parameter of the learning algorithm. It impacts how training occurs.
Algorithm: Random Linear Classifier
Input: \(D_n\), \(k\)
for \(j = 1\) to \(k\):
\( \theta^{(j)} \gets \text{random}(\mathbb{R}^d) \)
\( \theta_0^{(j)} \gets \text{random}(\mathbb{R}) \)
end for
\( j^* \gets \arg\min_{j \in \{1, ..., k\}} {E_n(h(\cdot,\theta,\theta_0))} \)
return \( \theta^{(j^*)}, \theta_0^{(j^*)} \)
Input: \(D_n\), \(T\)
\(\theta = \overline{0}\)
\(\theta_0 = 0\)
For \(t = 1\) to \(T\)
For \(i=1\) to \(n\)
If \(y^{(i)}(\theta^Tx^{i}+\theta_0)\leq0\)
\(\theta = \theta + y^{(i)}x^{(i)}\)
\(\theta_0 = \theta_0+y^{(i)}\)
Return \(\theta, \theta_0\)
The algorithm iterates through the dataset \(D_n\) \(T\) times (\(T\) is a hyperparameter). At each point in the dataset, it checks if \(y^{(i)}(\theta^Tx^{(i)}+\theta_0)\leq0\). The if statement basically determines whether or not the current \(\theta\) classified the point at index \(i\) correctly. If \((x^{(i)}, y^{(i)})\) were correctly classified, then \(y^{(i)}(\theta^Tx^{(i)}+\theta_0)\) would be positive. Recall that \(\theta^Tx^{(i)}+\theta_0\) is the input to the sign function in the linear classifier hypothesis. Thus, if it is negative and \(y^{(i)}\) is also negative, their product should be positive. If both are positive, their product is positive too. If they are of different signs (an incorrect classification occurred), then the if statement would trigger some modifications. In particular, it modifies \(\theta\) to \(\theta+y^{(i)}x^{(i)}\) and \(\theta_0\) to \(\theta_0+y^{(i)}\).
The modifications that perceptron makes to the parameters \(\theta\) and \(\theta_0\) probably are not the obvious “right move.” One is probably left wondering why Rosenblatt, the inventor of perceptron, chose to set \(\theta=\theta+y^{(i)}x^{(i)}\) and \(\theta_0=\theta_0+y^{(i)}\). This brings up another interesting discussion point. Whereas the inner workings of most algorithms have been built intuitively around the problem to be solved, perceptron was simply introduced and left for scholars to analyze over the decades. Years of papers have determined that Rosenblatt's modifications are quite functional.
To better analyze how perceptron works and even introduce a theorem about it, exploring a simpler version of the algorithm is useful. Think of perceptron-through-origin as perceptron without any offset (\(\theta_0\)) parameter. Some playing with dimensions later on will show that what applies to perceptron-through-origin applies to perceptron with an offset.
Linear separability is a property of a dataset \(D_n\). \(D_n\) is linearly separable when there exists some \(\theta\) (no \(\theta_0\) because the current discussion is about perceptron-through-origin) such that \(y^{(i)}(\theta^Tx^{(i)})>0\) for all \(i\). In other words, all points in the dataset \(D_n\) are correctly classified.
The margin of a data point \((x,y)\) with repsect to a linear separator (a hyperplane) is \(y{{\theta^Tx}\over{||\theta||}}\). \({{\theta^Tx}\over{||\theta||}}\) is the signed distance from the point to the separator. If \(y\) is \(-1\) and \((x,y)\) is correctly classified, then \({{\theta^Tx}\over{||\theta||}}\) should be negative. \(y\) is the target label (either \(+1\) or \(-1\)). Therefore, a correctly classified point should have a positive margin. An incorrectly classified point should have a negative margin. A higher margin is better because the farther a point is away from the classifier, the better classified it is (not a close call).
The margin of an entire dataset with respect to a linear separator is equal to the margin of the point it contains that has the lowest margin: \(\min_{i}{y^{(i)}{{\theta^Tx^{(i)}}\over{||\theta||}}}\).
The perceptron convergence states that if (a) there exists some \(\theta^*\) such that \(y^{(i)}{{\theta^Tx^{(i)}}\over{||\theta||}}>\gamma>0\) for all \(i\) (in other words, if the margin of the dataset with respect to \(\theta^*\)) is greater than or equal to some constant \(\gamma\)) and (b) \(||x^{(i)}||\leq{R}\) for all \(i\) (in other words, when graphed, the data points are contained in a circle of radius \(R\)), then the perceptron will make at most \(\left({{R}\over{\gamma}}\right)^2\) modifications during training.
Say that \(\theta^{(k)}\) is the hypothesis produced after \(k\) modifications during training and \(\theta^*\) is the parameter such that \(y^{(i)}{{\theta^{*T}x^{(i)}}\over{||\theta^*||}}\geq\gamma>0\) for all \(i\). The angle between \(\theta^{(k)}\) and \(\theta^*\) is \(\alpha\). To show that \(\theta^{(k)}\) will converge to become \(\theta^*\) as \(k\) increases, one must show that the angle \(\alpha\) becomes smaller and smaller. As the angle becomes smaller, its cosine should become greater (i.e. \(\cos{\alpha}\) should increase).
As promised earlier, the problem of perceptron-not-through-origin can be reduced to the problem of perceptron-through-origin. The key lies in transforming the dataset used from the \( \mathbb{R}^d \) space to a \( \mathbb{R}^D \) space when \( D > d \).
Recall that the parameters of a separator not through the origin are \( \theta=[\theta_1, ... \theta_d] \) and \( \theta_0 \), a scalar. Say that these two parameters are taken to create a single one called \( \theta_{new} \): \( \theta_{new} = [\theta_1, ... \theta_d, \theta_0] \). \( \theta_{new} \) is basically just \( \theta \) with \( \theta_0 \) put in as the last entry. Now, each \( x \) value vector in the dataset must also be modified. Say that \( x_{new} = [x_1, ...,x_d,1] \). \( x_{new} \) is basically just \( x \) with \( 1 \) put in as the last entry. \( \theta^T_{new}x_{new} = \theta_1x_1 + ... + \theta_dx_d+(1)\theta_0=\theta_1x_1+...+\theta_dx_d+\theta_0=\theta^Tx+\theta_0 \). A classifier-not-through-origin in \( d \) dimensions can be turned into a classifier-through-origin in \( d+1 \) dimensions. Thus, even though the perceptron convergence theorem from before was only proved for perceptron through origin, transformation shows that the theorem also applies to perceptron-not-through-origin.
The same concept can be applied to a dataset that is not linearly separable in \( d \) dimensions. By moving the dataset into \( d+1 \), \( d+2 \), or more generally, \( d+n \) dimensions, one can make it linearly separable.
A systematic way of transforming data into higher dimensions exists. It is called polynomial basis.
| order | in general |
| 0 | \([1]\) |
| 1 | \([1, x_1, ... x_d]\) |
| 2 | \([1, x_1, ... x_d, x_1^2, x_1x_2, ... \text{all two way products}]\) |
| 3 | \([1, x_1, ... x_d, x_1^2, x_1x_2, ... \text{all two way products}, x_1^3, ... \text{all three way products}]\) |
| ... | ... |
For an example, take \([x_1, x_2]\). Using polynomial basis to transform it to the second degree yields \([1, x_1, x_2, x_1^2, x_1x_2, x^2]\).
Another type of learning algorithm is one based on optimization. In minimizing some function \(J\), called an objective function, the learning algorithm should in effect minimize the training set error on \(D_n\) of the classifier it produces.
Objective function \(J\) should be written as \(J(\Theta) = {1\over{n}}{\sum^n_{i=1}}{L(h(x^{(i)}, \Theta),y^{(i)})+\lambda{R(\Theta)}}\). The first addend is the training set error of hypothesis \(h\). The second addend is a new concept. \(R(\Theta)\) is a regularizer. It prevents the learning algorithm from being too honed in on minimizing training set error. Sometimes, learning algorithms that are too focused on minimizing training set error produce a classifier that works too specifically well on training data that it is out of touch with the real-life data it will be thrown in the future. \(\lambda\) is a positive constant (a hyperparameter) that determines to what degree the regularizer will be considered. Also, one must know that \(\Theta\) is used to generally represent all parameters of \(h\). For the scope of logistic regression, \(\Theta\) encompasses \(\theta\) and \(\theta_0\).
The type of hypothesis produced in linear regression is different from a standard linear classifier. The hypothesis in linear regression is written as \(h(x; \theta, \theta_0)=\sigma(\theta^Tx+\theta_0)\) when \(\sigma\) is the sigmoid function. The sigmoid function is defined as \(\sigma(z)={{1}\over{1+e^{-z}}}\). Why can't one use a standard linear classifier and the zero-one loss function in the objective function to be optimized? Zero-one loss only says whether or not a separator classifies a point correctly or incorrectly. On the other hand, because \(\sigma(z)\in(0,1)\) and \(\sigma(0)=0.5\), the sigmoid function says how correctly or incorrectly a separator classifies a point. These semantics are what optimization takes advantage of in working towards the optimal set of parameters.
To actually classify points using \(h(x; \theta, \theta_0)=\sigma(\theta^Tx+\theta_0)\) though, one needs to interpret the output of \(\sigma\). For instance, if the threshold is 0.5, all outputs of \(\sigma\) greater than 0.5 will be considered +1 whereas all outputs less than or equal to 0.5 will be negative. However, because the bounds of the sigmoid function's range are 0 and 1, \(y^{(i)}\in{0,1}\) in the dataset. The most common setup is for the threshold to be 0.5. In that case, wherever \(\sigma(z)=0.5\) is the linear classifier.
To visualize this phenomenon in the two dimensional space, imagine a plane curved like the sigmoid function sticking out through the third axis. The separator is where the sigmoid axis is 0.5.
The loss function that goes with the hypothesis logistic regression produces is called negative log likelihood loss. Deriving the function will allow one to make good sense of it.
Negative log likelihood loss is also called cross entropy loss or log loss.
A common regularizer is \(R(\Theta)=||\Theta||^2\). It tries to keep the norms of the parameters small during training. As a result, the learning algorithm will not try too hard to produce a classifier that does well on outlier points. The regularizer also prevents the classifier from overfitting the provided training data.
Gradient descent is a common optimization learning algorithm for logistic regression. To understand how gradient descent works, one can begin by observing it in one dimension.
Say there is some function \(f\), some starting position called \(x_{init}\), and some step size \(\eta\). The algorithm obtains the derivative of \(f\) at \(x_{init}\). It then moves in the negative direction of the derivative. The size of the step it takes is determined by \(\eta\). Eventually, after enough moving around, gradient descent should find a relative minimum of \(f\).
In the figure above, \(x\) will continue bouncing back and forth until its movements are less than some tolerance constant \(\epsilon\).
Examine the multivariable function \(f(x,y) = x^2y\). The partial derivative \(\frac{\partial{x}}{\partial{f}}\) is the derivative of \(x^2y\) when y is treated like a constant. The same is true for \(\frac{\partial{y}}{\partial{f}}\), except \(x\) is treated like a constant. Neither partial derivative tells the entire rate of change of \(f\), but they do give insight into how \(f\) changes as one of its variables does. The gradient of \(f\), written as \(\nabla{f}\) is the vector of all partial derivatives of \(f\): \(\begin{bmatrix} \frac{\partial{x}}{\partial{f}} \frac{\partial{y}}{\partial{f}} \end{bmatrix}\).
Algorithm: Gradient Descent in Multiple Dimensions
Input: \( \theta_{init} \), \( f \), \( \nabla_\theta{f} \), \( \epsilon \), \( \eta \)
\( \theta^{(0)} = \theta_{init} \)
\( t = 0 \)
While \( f(\theta^{(t)}) - f(\theta^{(t-1)}) \geq \epsilon \):
\( t = t + 1 \)
\( \theta^{(t)} = \theta^{(t-1)} - \eta \nabla{f(\theta^{(t-1)})} \)
Return \( \theta \)
Gradient descent for logistic regression uses the listed formulas.
\(J(\theta, \theta_0) = {1\over{n}}{\sum_{i=1}^n{L(\sigma(\theta^Tx+\theta_0),y^{(i)})}+\frac{\lambda}{2}{||\theta||^2}}\).
\(\nabla_\theta{J(\theta, \theta_0)}=\frac{1}{n}\sum_{i=1}^{n}{(\sigma(\theta^Tx+\theta_0),y^{(i)})-y^{(i)})x^{(i)}}+\lambda\theta\)
\(\frac{\partial J(\theta, \theta_0)}{\partial \theta_0}={1\over{n}}\sum^n_{i=1}({\sigma(\theta^Tx+\theta_0)}-y^{(i)})\)
The algorithm is the same, but the edits are the subtraction of the gradient above and the partial derivative above from \(\theta_{(t)}\) and \(\theta_0^{(t)}\) inside the loop.
Like classification, regression is a type of supervised learning setup. Thus, its dataset is still \(D_n = \{(x^{(1)}, y^{(1)}), ... ,(x^{(n)}, y^{(n)})\}\). \(x^{(i)}\in\mathbb{R}^d\), but the set \(y^{(i)}\) belongs to will change: \(y^{(i)}\in\mathbb{R}\). Whereas the \(y\) values in classification were discrete categories, the \(y\) values in regression are continuous values. For example, one might use regression to predict a company's stock price or the temperature in two days. Neither stock price nor temperature should be predicted using categories like \(\{+1, -1\}\); they are continuous values.
The hypothesis for linear regression (finding a trend line) is written as \(h(x; \theta, \theta_0) = \theta^Tx+\theta_0\) when \(\theta\in\mathbb{R}^d\) and \(\theta_0\in\mathbb{R}\).
The standard loss function that is conventionally used for linear regression is called squared loss: \(L(g, a) = (g-a)^2\).
The regression setup involving the linear regression hypothesis and squared loss function is so common that it has earned a name: ordinary least squares.
Quite naturally, upon seeing \(h(x;\theta,\theta_0)=\theta^Tx+\theta_0\), one should want to use optimization to learn the best hypothesis for the dataset \(D_n\). Thus, an objective function must be written. To highlight how important regularizers are, \(J\) will first be written without \(\lambda{R(\Theta)}\): \(J(\theta, \theta_0)=\frac{1}{n}\sum^n_{i=1}{(\theta^Tx^{(i)}-\theta_0-y^{(i)})^2}\). The optimal parameters \(\theta_0^*\) and \(\theta^*\) are \(\arg\min_{\theta, \theta_0}{J(\theta, \theta_0)}\). The calculus way of finding a minimum will work: find the derivative of the objective function, set it to 0, solve for critical points, and prove that critical points are minima. One can most definitely compute the derivative of \(J\) with respect to each entry of \(\theta\) and whatnot to produce a system of equations that can be solved for optimal parameters, but using the matrix form of the system is more elegant. Say that \(W=
\begin{bmatrix}
x^{(1)}_1 & ... & x^{(1)}_d & 1\\
\vdots{} & \ddots & & \vdots{} \\
x^{(n)}_1 & ... & x^{(n)}_d & 1
\end{bmatrix}\). Each \(x\) value in \(D_n\) is turned into a row of \(W\). Each row has a \(1\) added to its end in order to account for \(\theta_0\) (a separator-through-origin in \(d\) dimensions can be turned into a separator-not-through-origin in \(d-1\) dimensions). Also, \(T=\begin{bmatrix}
y^{(1)}\\
\vdots \\
y^{(d)}
\end{bmatrix}\). Now, one can use linear algebra to find the parameter array \(\theta_{array}\) that minimizes a rewritten objective function \(J(\theta_{array})\) instead of finding the best \(\theta\) vector and \(\theta_0\) scalar for the original \(J(\theta, \theta_0)\). \(J(\theta,\theta_0)=J(\theta_{array})=(W\theta_{array}-T)^T(W\theta_{array}-T).\) Now, one needs to find the gradient (vector of partial derivatives) of \(J(\theta_{array})\), set it to \(0\), and solve for \(\theta_{array}\). \(\nabla_{\theta_{array}}J=\frac{2}{n}W^T(W\theta_{array}-T)\). \(0=\frac{2}{n}W^T(W\theta_{array}-T)\), \(0=W^T(W\theta_{array}-T)\), \(0=W^TW\theta_{array}-W^TT\), \(W^TW\theta_{array}=W^TT\), \(W^TW\theta_{array}=W^TT\), \(W^TW\theta_{array}=W^TT\), \(W^TW\theta_{array}=W^TT\), \((W^TT)^{-1}W^TW\theta_{array}=(W^TT)^{-1}W^TT\), so \(\theta_{array}^*=(W^TT)^{-1}W^TT\). This is an example of the rare closed form solution. One does not need a fancy computer or algorithm to use a closed form solution. It simply describes how to solve for \(\theta_{array}^*\) using pencil and paper. The problem with \(\theta_{array}^*=(W^TT)^{-1}W^TT\) is that \((W^TT)\) might not be invertible such that \((W^TT)^{-1}\) can be computed. The way of overcoming this problem is to add the regularizer \(||\theta||^2\) to \(J\) to create an objective function called the ridge regression objective function. \(J_{ridge}(\theta, \theta_0)=[\frac{1}{n}{\sum_{i=1}^n{(\theta^Tx+\theta_0-y^{(i)})^2}}]+\lambda{||\theta||^2}\). The regularizer does not impact \(\theta_0\) because offsets do not have to be forced toward \(0\) to prevent overcomplicated parameters. \(\nabla_{\theta_{array}{J_{ridge}}}=\frac{2}{n}W^T(W\theta_{array}-T)+2\lambda\theta_{array}\). Solving for \(0\) yields \(\theta_{array}^*=(W^TW+n\lambda{I})^{-1}W^TT\). \(I\) is the identity matrix (the values along its diagonal are all set to 1 while the rest of the matrix is set to 0). Its diagonal of 1s being multiplied by \(n\) and \(\lambda>0\) adds weight to the diagonal (ridge) of \(W^TT\), making it invertible.\
Inverting \(W^TW+n\lambda{I}\) takes \(O(d^3n)\) time, so for high dimension datasets, gradient descent or stochastic gradient descent will be better for performance than the analytic method described above. For gradient descent or stochastic gradient descent, one must know that \(\nabla_{\theta_{array}}J_{ridge}=\frac{2}{n}\sum_{i=1}^n{(\theta_{array}^Tx^{(i)}+\theta_0-y^{(i)})x^{(i)}}+2\lambda{\theta_{array}}\) and \(\frac{\partial{J}}{\partial\theta_{array}}=\frac{2}{n}{\sum_{i=0}^{n}}(\theta^Tx^{(i)}+\theta_0-y^{(i)})\).