There are multiple ways one can arrive at the least squares solution to linear regression. I’ve always seen the one using orthogonality, but there is another way which I’d say is even simpler, especially if you’ve done any calculus. Let’s define the problem first.
Given a matrix matrix of inputs, and a vector of length containing the outputs, the goal is to find a weight vector of length such that:
The reason we’re using a instead of is that we’re not expecting to fit the line exactly through are training examples, as real world data will contain some form of noise.
To find a best possible fit we’ll create a loss function which tells us how well our line fits the data, and then try to minimize the loss. A common choice for regression is the sum of squared errors loss (denoted ), which is defined as:
We can also write this in vector notation using a squared L2 norm
Now here comes the fun part. Because our loss is a convex function, it only has a single global minimum, for which we can solve analytically by simply taking a derivative with respect to and setting that equal to zero. Before we get into that, let’s re-write the loss to a form which is more suitable for differentiation:
Before moving any further, let us derive a few vector derivative rules (no pun intended). First, the -th row of is defined as follows:
Now if we take a derivative with respect to we’d get:
So this means if we take the -th row of the matrix and derive it by the -th element in , we get back . As a result, we get to a nice simple equation:
While nice, this doesn’t get us very far. We also need to figure out what happens in the case when the vector is on the left as a row vector, as in
Giving us the following partial derivative:
And finally the interesting part:
Giving us the final:
Which means we can take our loss function and take the derivative with respect to :
Now we want this to be equal to to find the minimum, which gives us the following equation:
And there we go, it was a bit of work but we managed to derive the normal equation without the use of orthogonal projection.
Share on Twitter and Facebook
Discussion of "Linear Regression - Least Squares Without Orthogonal Projection"
If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.
If you'd prefer to reach out to me via email, my address is loading ..