Linear Regression - Least Squares Without Orthogonal Projection

5 min read • Published: April 24, 2018

There are multiple ways one can arrive at the least squares solution to linear regression. I’ve always seen the one using orthogonality, but there is another way which I’d say is even simpler, especially if you’ve done any calculus. Let’s define the problem first.

Given a matrix N×MN \times M matrix XX of inputs, and a vector yy of length NN containing the outputs, the goal is to find a weight vector ww of length MM such that:

XwyX w \approx y

The reason we’re using a \approx instead of == is that we’re not expecting to fit the line exactly through are training examples, as real world data will contain some form of noise.

To find a best possible fit we’ll create a loss function which tells us how well our line fits the data, and then try to minimize the loss. A common choice for regression is the sum of squared errors loss (denoted LL), which is defined as:

L=i=1N(Xiwyi)2L = \sum_{i = 1}^{N} \left( X_iw - y_i \right)^2

We can also write this in vector notation using a squared L2 norm

L=Xwy2L = || Xw - y ||^2

Now here comes the fun part. Because our loss LL is a convex function, it only has a single global minimum, for which we can solve analytically by simply taking a derivative with respect to ww and setting that equal to zero. Before we get into that, let’s re-write the loss LL to a form which is more suitable for differentiation:

L=Xwy2=(Xwy)T(Xwy)=(yTwTXT)(Xwy)=yTXwyTywTXTXw=yTXw+wTXTyyTywTXTXw=yTXw+(yTXw)TyTywTXTXwtranspose of a scalar is a scalar=yTXw+yTXwyTywTXTXw=2yTXwyTywTXTXw\begin{aligned} L = || Xw - y ||^2 &= (Xw - y)^T (Xw - y) \\ &= (y^T - w^T X^T) (Xw - y) \\ &= y^T X w - y^T y - w^T X^T X w \\ &= y^T X w + w^T X^T y - y^T y - w^T X^T X w \\ &= y^T X w + (y^T X w)^T - y^T y - w^T X^T X w & \text{transpose of a scalar is a scalar} \\ &= y^T X w + y^T X w - y^T y - w^T X^T X w \\ &= 2 y^T X w - y^T y - w^T X^T X w \\ \end{aligned}

Before moving any further, let us derive a few vector derivative rules (no pun intended). First, the ii-th row of AxAx is defined as follows:

(Ax)i=j=1MAijxj=Ai1x1+Ai2x2++AiMxM\left( Ax \right)_i = \sum_{j=1}^{M} A_{ij} x_j = A_{i1} x_1 + A_{i2} x_2 + \dots + A_{iM} x_M

Now if we take a derivative with respect to xjx_j we’d get:

xj(Ax)i=xj(Ai1x1+Ai2x2++AiMxM)=xj(Ai1x1+Ai2x2++Aijxj++AiMxM)=xjAi1x1+xjAi2x2++xjAijxj++xjAiMxM=0+0++xjAijxj++0=Aij\begin{aligned} \frac{\partial}{\partial x_j} \left( Ax \right)_i &= \frac{\partial}{\partial x_j} \left(A_{i1} x_1 + A_{i2} x_2 + \dots + A_{iM} x_M \right) \\ &= \frac{\partial}{\partial x_j} \left(A_{i1} x_1 + A_{i2} x_2 + \dots + A_{ij} x_j + \dots + A_{iM} x_M \right) \\ &= \frac{\partial}{\partial x_j} A_{i1} x_1 + \frac{\partial}{\partial x_j} A_{i2} x_2 + \dots + \frac{\partial}{\partial x_j} A_{ij} x_j + \dots + \frac{\partial}{\partial x_j} A_{iM} x_M \\ &= 0 + 0 + \dots + \frac{\partial}{\partial x_j} A_{ij} x_j + \dots + 0 \\ &= A_{ij} \end{aligned}

So this means if we take the ii-th row of the matrix and derive it by the jj-th element in xx, we get back AijA_{ij}. As a result, we get to a nice simple equation:

ddxAx=A\frac{d}{dx} Ax = A

While nice, this doesn’t get us very far. We also need to figure out what happens in the case when the vector is on the left as a row vector, as in xTAx^T A

(xTA)i=j=1MxjTAji(x^T A)_i = \sum_{j = 1}^{M} x_{j}^T A_{ji}

Giving us the following partial derivative:

xj(xTA)i=AT\frac{\partial}{\partial x_j} (x^T A)_i = A^T

And finally the interesting part:

xTAxxi=xi(j,kxjBjkxk)=jxjBji+kBikxk=kBikTxk+kBikxk=k(BikT+Bik)xk=k(BT+B)ikxk\begin{aligned} \frac{\partial x^T A x}{\partial x_i} &= \frac{\partial}{\partial x_i} \left( \sum_{j,k} x_j B_{jk} x_k \right) \\ &= \sum_{j} x_j B_{ji} + \sum_{k} B_{ik} x_k \\ &= \sum_{k} B^T_{ik} x_k + \sum_{k} B_{ik} x_k \\ &= \sum_{k} (B^T_{ik} + B_{ik}) x_k \\ &= \sum_{k} (B^T + B)_{ik} x_k \\ \end{aligned}

Giving us the final:

ddxxTBx=(BT+B)x\frac{d}{dx} x^T B x = (B^T + B) x

Which means we can take our loss function and take the derivative with respect to ww:

ddwL=ddw(2yTXwyTywTXTXw)=2yTX0wT((XTX)T+XTX)XTX is symmetrical=2yTXwT(XTX+XTX)=2yTX2wTXTX\begin{aligned} \frac{d}{dw} L &= \frac{d}{dw} (2 y^T X w - y^T y - w^T X^T X w) \\ &= 2 y^T X - 0 - w^T ((X^T X)^T + X^T X) & X^T X \ \text{is symmetrical}\\ &= 2 y^T X - w^T (X^T X + X^T X) \\ &= 2 y^T X - 2 w^T X^T X \\ \end{aligned}

Now we want this to be equal to 00 to find the minimum, which gives us the following equation:

0=2yTX2wTXTX2wTXTX=2yTXwT=yTX(XTX)1w=(XTX)1XTy\begin{aligned} 0 &= 2 y^T X - 2 w^T X^T X \\ 2 w^T X^T X &= 2 y^T X \\ w^T &= y^T X (X^T X)^{-1} \\ w &= (X^T X)^{-1} X^T y \\ \end{aligned}

And there we go, it was a bit of work but we managed to derive the normal equation without the use of orthogonal projection.

Share on Twitter and Facebook

Discussion of "Linear Regression - Least Squares Without Orthogonal Projection"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..