Machine learning, notes 2: Regression

Regression analysis refers to a set of techniques for estimating relationships among variables. This post introduces linear regression augmented by basis functions to enable non-linear adaptation, which lies at the heart of supervised learning, as will be apparent when we turn to classification. Thus, a thorough understanding of this model will be hugely beneficial. We’ll go through 2 derivations of the optimal parameters namely the method of ordinary least squares (OLS), which we briefly looked at in notes 1a, and maximum likelihood estimation (MLE). We’ll also dabble with some Python throughout the post.

Setup and objective

Given a training dataset of $N$ input variables $\mathbf{x} \in \mathbb{R}^D$ with corresponding target variables $t \in \mathbb{R}$, the objective of regression is to construct a function $h(\mathbf{x})$ that yields prediction values of $t$ for new values of $\mathbf{x}$.

The simplest linear model for regression is just known as linear regression, where the predictions are generated by

\[h\left(\mathbf{x},\mathbf{w} \right) = w_0 + w_1 x_1 + \dots + w_D x_D = w_0 + \sum_{i=1}^D w_i x_i. \quad \quad (1)\]

The first term $w_0$ is commonly called the intercept or bias parameter, and allows $h$ to adapt to a fixed offset in the dataWe’ll show exactly what this means later in this post.. If we introduce a $1$ as the first element of each input variable $\mathbf{x}$, we can rewrite $(1)$ with vector notation, i.e. if we define $\mathbf{x} = \left( 1, x_1, \dots , x_D \right)^\intercal$, we can rewrite $(1)$ as

\[h(\mathbf{x},\mathbf{w}) = \mathbf{w}^\intercal \mathbf{x},\]

where $\mathbf{w} = \left(w_0, \dots, w_D \right)^\intercal$.

In notes 1a we went over polynomial regression with a $1$-dimensional input variable $x \in \mathbb{R}$. Now we’re just doing linear regression (not polynomial), but we’re allowing our input variable to be $D$-dimensional $\mathbf{x} \in \mathbb{R}^D$, hence it becomes a vector instead of a scalar. For the sake of visualization, however, let’s stick to the same example dataset:

import numpy as np

x = np.array([-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
t = np.array([-4.9, -3.5, -2.8, 0.8, 0.3, -1.6, -1.3, 0.5, 2.1, 2.9, 5.6])

N = len(x)

Since our input variables are $1$-dimensional, we’ll have 2 parameters: $w_1$ and $w_0$, but to make sure we also find the bias parameter, we have to introduce a column of $1$s in x, like we defined $\mathbf{x} = \left( 1, x_1, \dots , x_D \right)^\intercal$. We can do this with the following piece of code

X = np.column_stack([np.ones(N), x])

Derivation and training

So, how do we train the model? We’ll look at 2 different approaches of deriving the method of training this model. Recall that training (or learning) refers to the process of estimating the parameters of our model, so when we ask how to train the model, it’s the same as asking how to estimate the values of $\mathbf{w}$.

Ordinary least squares

Like we did in notes 1a, we defined an objective function that calculated a measure of the performance of our model in terms of an error, and then we minimized this error with respect to our parameters. This means we would find the parameters that would result in the least error. We’ll use the same objective function as in notes 1a, the sum of squared errors (SSE), defined as

\[\begin{aligned} E(\mathbf{w}) &= \sum_{n=1}^N \left( t_n - h \left( \mathbf{x}_n, \mathbf{w} \right) \right)^2 \\ &= \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2, \quad \quad (2) \end{aligned}\]

and we want to find values for $\mathbf{w}$ that minimizes $E$

To recap: we want to find the parameter values $\hat{\mathbf{w}}$ that minimize our objective function, which is defined as the sum of the squared differences between $h(\mathbf{x}_i,\mathbf{w})$ and $t_i$.

\[\begin{aligned} \hat{\mathbf{w}} = \underset{\mathbf{w}}{\arg\min} E(\mathbf{w}). \end{aligned}\]

Note that $(2)$ is a quadratic function of the parameters $\mathbf{w}$. Its partial derivatives with respect to $\mathbf{w}$ will therefore be linear in $\mathbf{w}$, which means there is a unique minimum. This is a property of convex functions.

If we evaluate the SSE for values of $w_0$ and $w_1$ in a grid, then we can illustrate that our objective function has a unique minimum with a contour plot. This is shown below.

The cross is located at the minimum of our objective function - this coordinate corresponds to the values of $w_0$ (x-axis) and $w_1$ (y-axis) that produces the smallest SSE for our dataset. The contour lines show some boundaries for the SSE; we can see that the optimal values of $w_0$ and $w_1$ lie within a boundary of value $19$, meaning that the minimum value of SSE is less than $19$.

So, how do we find the minimum of $E$? Recall from the notes about extrema that we find the minimum of a function by taking the derivative, setting the derivative equal to 0, and solving for the function variable. In our case, we have to take all the partial derivatives of $E$ with respect to $w_0, \dots, w_{D}$ and set it equal to 0. Remember that all the partial derivatives of $E$ gives us the gradient $\nabla E$. To ease notation, let all our input variables be denoted by $\mathbf{X}$ with $N$ rows (one for each input variable) and $D+1$ columns (one for each feature plus one for the bias) defined as

\[\mathbf{X} = \begin{pmatrix} 1 & X_{11} & \cdots & X_{1D} \\ 1 & X_{21} & \cdots & X_{2D} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{N1} & \cdots & X_{ND} \end{pmatrix},\]

and let $\textbf{\textsf{t}} = \left( t_1, \dots, t_N \right)^\intercal$ denote all our target variables. We can now rewrite $(2)$ as

\[\begin{aligned} E(\mathbf{w}) &= \sum_{n=1}^N \left( t_n - \mathbf{w}^\intercal \mathbf{x}_n \right)^2 \\ &= (t_1 - \mathbf{w}^\intercal \mathbf{x}_1) (t_1 - \mathbf{w}^\intercal \mathbf{x}_1) + \cdots + (t_N - \mathbf{w}^\intercal \mathbf{x}_N) (t_N - \mathbf{w}^\intercal \mathbf{x}_N) \\ &= \left( \textbf{\textsf{t}} - \mathbf{X}\mathbf{w} \right)^\intercal \left( \textbf{\textsf{t}} - \mathbf{X}\mathbf{w} \right) \\ &= \left( \textbf{\textsf{t}}^\intercal - \left( \mathbf{X}\mathbf{w} \right)^\intercal \right) \left( \textbf{\textsf{t}} - \mathbf{X}\mathbf{w} \right) \\ &= \textbf{\textsf{t}}^\intercal \textbf{\textsf{t}} - \textbf{\textsf{t}}^\intercal \mathbf{X}\mathbf{w} - \left( \mathbf{X}\mathbf{w} \right)^\intercal \textbf{\textsf{t}} + \left( \mathbf{X}\mathbf{w} \right)^\intercal \mathbf{X}\mathbf{w} \\ &= \left( \mathbf{X}\mathbf{w} \right)^\intercal \mathbf{X}\mathbf{w} - 2 \left( \mathbf{X}\mathbf{w} \right)^\intercal \textbf{\textsf{t}} + \textbf{\textsf{t}}^\intercal \textbf{\textsf{t}}. \end{aligned}\]

If we now take the derivative with respect to $\mathbf{w}$, we get

\[\begin{aligned} \nabla E(\mathbf{w}) &= \frac{\partial}{\partial \mathbf{w}} \left( \left( \mathbf{X}\mathbf{w} \right)^\intercal \mathbf{X}\mathbf{w} - 2 \left( \mathbf{X}\mathbf{w} \right)^\intercal \textbf{\textsf{t}} + \textbf{\textsf{t}}^\intercal \textbf{\textsf{t}} \right) \\ &= 2 \mathbf{X}^\intercal \mathbf{X}\mathbf{w} - 2 \mathbf{X}^\intercal \textbf{\textsf{t}}, \end{aligned}\]

and setting this result equal to 0 lets us solve for $\mathbf{w}$

\[\begin{aligned} 2 \mathbf{X}^\intercal \mathbf{X}\mathbf{w} - 2 \mathbf{X}^\intercal \textbf{\textsf{t}} &= 0 \\ \mathbf{X}^\intercal \mathbf{X} \mathbf{w} &= \mathbf{X}^\intercal \textbf{\textsf{t}}\\ \mathbf{w} &= \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}, \end{aligned}\]

which are our estimated values for the parameters

To recap once again: $\left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}$ are the values of $\mathbf{w}$ that minimizes our SSE objective function defined in $(2)$.

\[\hat{\mathbf{w}} = \underset{\mathbf{w}}{\arg\min} E(\mathbf{w}) = \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}.\]

Maximum likelihood

Choosing the SSE as the objective function might seem a bit arbitrary though - for example why not just go with the sum of the errors? Why do we have to square them? To show why this is a good choice, and why the solution makes sense, we are going to derive the same solution from a probabilistic perspective using maximum likelihood estimation (MLE). To do this, we assume that the target variable $t$ is given by our function $h(\mathbf{x}, \mathbf{w})$ with a bit of noise added:

\[t = h \left( \mathbf{x},\mathbf{w} \right) + \epsilon,\]

where $\epsilon \sim \mathcal{N} \left( 0,\alpha \right)$, i.e. $\epsilon$ is a Gaussian random variable with mean 0 and variance $\alpha$. This lets us say that given an input variable $\mathbf{x}$, the corresponding target value $t$ is normally distributed with mean $h(\mathbf{x}, \mathbf{w})$ and variance $\alpha$, i.e.

\[\text{Pr}(t | \mathbf{x}, \mathbf{w}, \alpha) = \mathcal{N} \left( t | h(\mathbf{x},\mathbf{w}), \alpha \right). \quad \quad (3)\]

Let’s take a moment to understand exactly, what we’re doing. The image below illustrates what $(3)$ tells us. We are estimating parameters $\mathbf{w}$ such that our target variables $t$ are normally distributed around the output values of $h$.

We can now use the entire dataset, $\mathbf{X}$ and $\textbf{\textsf{t}}$, to write up the likelihood function by making the assumption that our data points are drawn independently from $(3)$. The likelihood function then becomes the product of $(3)$ for all our input and target variable pairs, and is a function of $\mathbf{w}$ and $\alpha$:

\[\text{Pr}(\textbf{\textsf{t}} | \mathbf{X}, \mathbf{w}, \alpha) = \prod_{i=1}^N \mathcal{N} \left( t_i | h(\mathbf{x}_i,\mathbf{w}), \alpha \right). \quad \quad (4)\]

Now we want to maximize the likelihood, which means we want to determine the values of our parameters $\mathbf{w}$ and $\alpha$ that maximizes $(4)$. This might seem dauntingly difficult, but we can make it simpler with a handy trick.Taking the log of the likelihood not only simplifies the math, but it helps computationally as well, since the product of many probabilities usually causes underflow, whereas the sum of logs doesn’t. Since the logarithm is a monotonically increasing function, maximizing the log-likelihood is equivalent to maximizing the likelihood. Taking the log of the likelihood gives us

\[\begin{aligned} \ln \text{Pr}(\textbf{\textsf{t}} | \mathbf{X}, \mathbf{w}, \alpha) &= \ln \left( \prod_{i=1}^N \mathcal{N} \left( t_i | h(\mathbf{x}_i,\mathbf{w}), \alpha \right) \right) \\ &= \sum_{i=1}^N \ln \left( \frac{1}{\sqrt{2 \pi \alpha}} \exp \left( -\frac{(t_n - \mathbf{w}^\intercal \mathbf{x}_i)^2}{2 \alpha} \right) \right) \\ &= N \ln \frac{1}{\sqrt{2 \pi \alpha}} - \sum_{i=1}^N \frac{(t_n - \mathbf{w}^\intercal \mathbf{x}_i)^2}{2 \alpha} \\ &= - \frac{N}{2} \ln 2 \pi \alpha - \frac{1}{2 \alpha} \underbrace{\sum_{i=1}^N (t_n - \mathbf{w}^\intercal \mathbf{x}_i)^2}_{\text{SSE}}. \quad \quad (5) \end{aligned}\]

Now it becomes evident why the SSE objective function is a good choice; the last term of $(5)$ is the only part dependent on $\mathbf{w}$ and is the same as SSE. Since the first term does not depend on $\mathbf{w}$, we can omit it, and since the maximum of the likelihood function with respect to $\mathbf{w}$ does not change by scaling with the positive constant $\frac{1}{2 \alpha}$, then we see maximizing the likelihood with respect to $\mathbf{w}$ is equivalent to minimizing the SSE objective function. The maximum likelihood solution for $\mathbf{w}$ is therefore the same as in our previous derivation

\[\mathbf{w}_{\text{ML}} = \left( \mathbf{X}^\intercal \mathbf{X} \right)^{-1} \mathbf{X}^\intercal \textbf{\textsf{t}}.\]

We can use the result of the maximum likelihood solution for $\mathbf{w}$ to find the value of the noise variance $\alpha$. If we insert the maximum likelihood solution for $\mathbf{w}$ in the log-likelihood, take the derivative, and set it equal to 0, then we can solve for $\alpha$

Note that we are in fact jointly maximizing the likelihood with respect to both $\mathbf{w}$ and $\alpha$, but because the maximization with respect to $\mathbf{w}$ is independent of $\alpha$, we start by finding the maximum likelihood solution for $\mathbf{w}$, and then use that result to find $\alpha$.

\[\begin{aligned} \frac{\partial}{\partial \alpha} \ln \text{Pr}(\textbf{\textsf{t}} | \mathbf{X}, \mathbf{w}_{\text{ML}}, \alpha) &= 0 \\ \frac{\partial}{\partial \alpha} \left( - \frac{N}{2} \ln 2 \pi \alpha - \frac{1}{2 \alpha} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2 \right) &= 0 \\ -\frac{N}{2\alpha} + \frac{1}{2 \alpha^2} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2 &= 0 \\ \frac{1}{\alpha} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2 &= N \\ \alpha_{\text{ML}} &= \frac{1}{N} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \mathbf{x}_i)^2. \end{aligned}\]

Python implementation

In notes 1a we implemented the OLS solution, but since we have a probabilistic model now, we make predictions that are probability distributions over $t$ instead of just point estimates. This is done by substituting the maximum likelihood solutions for $\mathbf{w}$ and $\alpha$ into $(3)$

\[\text{Pr}(t | \mathbf{x}, \mathbf{w}_{\text{ML}}, \alpha_{\text{ML}}) = \mathcal{N} \left( t | h(\mathbf{x},\mathbf{w}_{\text{ML}}), \alpha_{\text{ML}} \right).\]

We can find $\mathbf{w}$ and $\alpha$ with the following code snippet

w = np.linalg.inv(X.T @ X) @ X.T @ t
alpha = sum((t - X @ w)**2) / len(t)

Plot of the line $w_0+w_1 x$ with our estimated values for $\mathbf{w}$ along with the uncertainty $\alpha$.

Model selection

You might be wondering what linear regression is so good for considering the image above, since it’s not doing well, but now we’re going to shine light on that by looking at ways to improve the simple linear regression model.

Basis functions

We call a model linear if it’s linear in the parameters not in the input variables. However, $(1)$ is linear in both the parameters and the input variables, which limits it from adapting to nonlinear relationships. We can augment the model by replacing the input variables with nonlinear basis functions of the input variables

\[\begin{aligned} h(\mathbf{x},\mathbf{w}) &= w_0 \phi_0(\mathbf{x}) + \cdots + w_{M-1} \phi_{M-1}(\mathbf{x}) \\ &= \sum_{m=0}^{M-1} w_m \phi_m(\mathbf{x}) \\ &= \mathbf{w}^\intercal \bm{\phi} (\mathbf{x}), \end{aligned}\]

Note that we had $D+1$ parameters in the simple linear regression model, but by augmenting it with basis functions, we now have $M$ parameters, which can be larger than $D$ if need be. where we define $\bm{\phi}(\mathbf{x}) = \left( \phi_0 (\mathbf{x}), \dots, \phi_{M-1}(\mathbf{x}) \right)^\intercal$ and $\phi_0 ( {\mathbf{x}} ) = 1$ to keep the intercept $w_0$. By using nonlinear basis functions it is possible for $h$ to adapt to nonlinear relationships of $\mathbf{x}$, which we will see shortly - we call these models linear basis function models.

We already looked at one example of basis functions in notes 1a, where we augmented the simple linear regression model with basis functions of powers of $x$, i.e. $\phi_i (x) = x^i$. Another common basis function is the Gaussian

\[\phi_i (\mathbf{x}) = \exp \left( - \gamma_i \| \bm{\mu}_i -\mathbf{x} \|_2^2 \right).\]

Following the same derivation as before, we find the maximum likelihood solutions to be

\[\mathbf{w}_{\text{ML}} = \left( \mathbf{\Phi}^\intercal \mathbf{\Phi} \right)^{-1} \mathbf{\Phi}^\intercal \textbf{\textsf{t}} \quad \text{and} \quad \alpha_{\text{ML}} = \frac{1}{N} \sum_{i=1}^N (t_n - \mathbf{w}_{\text{ML}}^\intercal \bm{\phi}(\mathbf{x}_i))^2,\]

where

\[\mathbf{\Phi} = \begin{pmatrix} \phi_0 (\mathbf{x}_1) & \phi_1 (\mathbf{x}_1) & \cdots & \phi_{M-1} (\mathbf{x}_1) \\ \phi_0 (\mathbf{x}_2) & \phi_1 (\mathbf{x}_2) & \cdots & \phi_{M-1} (\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots\\ \phi_0 (\mathbf{x}_N) & \phi_1 (\mathbf{x}_N) & \cdots & \phi_{M-1} (\mathbf{x}_N) \end{pmatrix}.\]

Illustration of the effect of using $M-1$ Gaussian basis functions plus the intercept.

The Gaussian basis function for the plot above was implemented as below, where $\mu_i=\frac{i}{M}$ and $\gamma_i = 1$ for $i = 1, \dots, M-1$.

def gaussian_basis(x, mu, gamma=1):
    return np.ex\text{Pr}(-gamma * np.linalg.norm(mu-x)**2)

Regularization

We briefly ran into the concept of regularization in the previous notes, which we described as a technique of preventing overfitting. If we look back at the objective function we defined earlier, augmented with basis functions, we can introduce a regularization term

\[E(\mathbf{w}) = \sum_{i=1}^N (t_n - \mathbf{w}^\intercal \bm{\phi}(\mathbf{x}_i))^2 + \underbrace{\lambda \sum_{j=0}^{M-1} | w_j |^q}_{\text{regularization}},\]

where $q > 0$ denotes the type of regularization, and $\lambda$ controls the extent of regularization, i.e. how much do we care about the error from the data in relation to the regularization. The most common values of $q$ are $1$ and $2$, which are called $L_1$ regularization and $L_2$ regularization respectively. We call it lasso regression when we use $L_1$ regularization, and ridge regression when we use $L_2$ regularization.

The objective function of ridge regression

\[\begin{aligned} E(\mathbf{w}) &= \sum_{i=1}^N (t_n - \mathbf{w}^\intercal \bm{\phi}(\mathbf{x}_i))^2 + \lambda \sum_{j=0}^{M-1} | w_j |^2\\ &= \sum_{i=1}^N (t_n - \mathbf{w}^\intercal \bm{\phi}(\mathbf{x}_i))^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \end{aligned}\]

is especially convenient as it is a quadratic function of $\mathbf{w}$ and therefore has a unique global minimum. The solution to which is

\[\hat{\mathbf{w}} = \left( \lambda \mathbf{I} + \mathbf{\Phi}^\intercal \mathbf{\Phi} \right)^{-1} \mathbf{\Phi}^\intercal \textbf{\textsf{t}},\]

where $\alpha$ stays the same as without regularization, since the regularization term has no influence on it.

When we introduce regularization, the process of model selection goes from finding the appropriate number of basis functions to finding the appropriate value for the regularization parameter $\lambda$.

Illustration of changing the value of the regularization parameter $\lambda$, while keeping the number of basis functions $M=8$ constant. Even though we overfitted earlier when $M=8$, our effective complexity is now controlled by the regularization instead, and the model will not overfit if $\lambda$ is large enough. Note also that as the regularization parameter is increased, the uncertainty increases as well.

Summary

We can find the parameters for a linear regression through ordinary least squares or maximum likelihood estimation.
Usually in linear regression we have a scalar parameter that is not multiplied by the input called the intercept or bias denoted $w_0$.
The process of estimating the values of the parameters is called the training or learning process.
Since the logarithm is a monotonically increasing function, maximizing the likelihood function is the same as maximizing the log-likelihood function.
What makes a model linear is that it’s linear in the parameters not the inputs.
We can augment linear regression with basis functions yielding linear basis function models.
Polynomial regression is a linear basis function model.
Regularization is a technique of preventing overfitting.
There are different kinds of regularization in linear regression such as $L_1$ and $L_2$ regularization.