Jakub Arnold's Blog


Linear Regression - Least Squares Without Orthogonal Projection

There are multiple ways one can arrive at the least squares solution to linear regression. I’ve always seen the one using orthogonality, but there is another way which I’d say is even simpler, especially if you’ve done any calculus. Let’s define the problem first.

Given a matrix \(N \times M\) matrix \(X\) of inputs, and a vector \(y\) of length \(N\) containing the outputs, the goal is to find a weight vector \(w\) of length \(M\) such that:

$$ X w \approx y $$

The reason we’re using a \(\approx\) instead of \(=\) is that we’re not expecting to fit the line exactly through are training examples, as real world data will contain some form of noise.

To find a best possible fit we’ll create a loss function which tells us how well our line fits the data, and then try to minimize the loss. A common choice for regression is the sum of squared errors loss (denoted \(L\)), which is defined as:

$$ L = \sum_{i = 1}^{N} \left( X_iw - y_i \right)^2 $$

We can also write this in vector notation using a squared L2 norm

$$ L = || Xw - y ||^2 $$

Now here comes the fun part. Because our loss \(L\) is a convex function, it only has a single global minimum, for which we can solve analytically by simply taking a derivative with respect to \(w\) and setting that equal to zero. Before we get into that, let’s re-write the loss \(L\) to a form which is more suitable for differentiation:

$$ \begin{align} L = || Xw - y ||^2 &= (Xw - y)^T (Xw - y) \\\\ &= (y^T - w^T X^T) (Xw - y) \\\\ &= y^T X w - y^T y - w^T X^T X w \\\\ &= y^T X w + w^T X^T y - y^T y - w^T X^T X w \\\\ &= y^T X w + (y^T X w)^T - y^T y - w^T X^T X w & \text{transpose of a scalar is a scalar} \\\\ &= y^T X w + y^T X w - y^T y - w^T X^T X w \\\\ &= 2 y^T X w - y^T y - w^T X^T X w \\\\ \end{align} $$

Before moving any further, let us derive a few vector derivative rules (no pun intended). First, the \(i\)-th row of \(Ax\) is defined as follows:

$$ \left( Ax \right)_i = \sum_{j=1}^{M} A_{ij} x_j = A_{i1} x_1 + A_{i2} x_2 + \dots + A_{iM} x_M
$$

Now if we take a derivative with respect to \(x_j\) we’d get:

$$ \begin{align} \frac{\partial}{\partial x_j} \left( Ax \right)_i &= \frac{\partial}{\partial x_j} \left(A_{i1} x_1 + A_{i2} x_2 + \dots + A_{iM} x_M \right) \\\\ &= \frac{\partial}{\partial x_j} \left(A_{i1} x_1 + A_{i2} x_2 + \dots + A_{ij} x_j + \dots + A_{iM} x_M \right) \\\\ &= \frac{\partial}{\partial x_j} A_{i1} x_1 + \frac{\partial}{\partial x_j} A_{i2} x_2 + \dots + \frac{\partial}{\partial x_j} A_{ij} x_j + \dots + \frac{\partial}{\partial x_j} A_{iM} x_M \\\\ &= 0 + 0 + \dots + \frac{\partial}{\partial x_j} A_{ij} x_j + \dots + 0 \\\\ &= A_{ij} \end{align} $$

So this means if we take the \(i\)-th row of the matrix and derive it by the \(j\)-th element in \(x\), we get back \(A_{ij}\). As a result, we get to a nice simple equation:

$$ \frac{d}{dx} Ax = A $$

While nice, this doesn’t get us very far. We also need to figure out what happens in the case when the vector is on the left as a row vector, as in \(x^T A\)

$$ (x^T A)_i = \sum_{j = 1}^{M} x_{j}^T A_{ji} $$

Giving us the following partial derivative:

$$ \frac{\partial}{\partial x_j} (x^T A)_i = A^T $$

And finally the interesting part:

$$ \begin{align} \frac{\partial x^T A x}{\partial x_i} &= \frac{\partial}{\partial x_i} \left( \sum_{j,k} x_j B_{jk} x_k \right) \\\\ &= \sum_{j} x_j B_{ji} + \sum_{k} B_{ik} x_k \\\\ &= \sum_{k} B^T_{ik} x_k + \sum_{k} B_{ik} x_k \\\\ &= \sum_{k} (B^T_{ik} + B_{ik}) x_k \\\\ &= \sum_{k} (B^T + B)_{ik} x_k \\\\ \end{align} $$

Giving us the final:

$$ \frac{d}{dx} x^T B x = (B^T + B) x $$

Which means we can take our loss function and take the derivative with respect to \(w\):

$$ \begin{align} \frac{d}{dw} L &= \frac{d}{dw} (2 y^T X w - y^T y - w^T X^T X w) \\\\ &= 2 y^T X - 0 - w^T ((X^T X)^T + X^T X) & X^T X \ \text{is symmetrical}\\\\ &= 2 y^T X - w^T (X^T X + X^T X) \\\\ &= 2 y^T X - 2 w^T X^T X \\\\ \end{align} $$

Now we want this to be equal to \(0\) to find the minimum, which gives us the following equation:

$$ \begin{align} 0 &= 2 y^T X - 2 w^T X^T X \\\\ 2 w^T X^T X &= 2 y^T X \\\\ w^T &= y^T X (X^T X)^{-1} \\\\ w &= (X^T X)^{-1} X^T y \\\\ \end{align} $$

And there we go, it was a bit of work but we managed to derive the normal equation without the use of orthogonal projection.