Linear Regression - least squares with orthogonal projection

4 min read • Published: July 01, 2018

Compared to the previous article where we simply used vector derivatives we’ll now try to derive the formula for least squares simply by the properties of linear transformations and the four fundamental subspaces of linear algebra. These are:

  • Kernel Ker(A)Ker(A): The set of all solutions to Ax=0Ax = 0. Sometimes we can say nullspace N(A)N(A) instead of kernel.
  • Image Im(A)Im(A): The set of all right sides bb, for which there is a solution Ax=bAx = b. We’ll show that this is equal to the column space C(A)C(A), which is the span of the column vectors in AA.
  • Row space R(A)R(A): Span of the row vectors in AA, sometimes also referred to as Im(AT)Im(A^T) (the image of ATA^T). We can also refer to this as C(AT)C(A^T), because since Im(A)=C(A)Im(A) = C(A), then Im(AT)=C(AT)=R(A)Im(A^T) = C(A^T) = R(A).
  • Left kernel Ker(AT)Ker(A^T) (or left nullspace): The set of all solutions to ATx=0A^T x = 0. The name comes from left multiplying by xx, specifically the set of solutions to xTA=0Tx^T A = 0^T.

For this derivation we assume that Ker(A)R(A)Ker(A) \perp R(A) and Im(A)Ker(AT)Im(A) \perp Ker(A^T).

When A is not invertible (could be rectangular), there is no exact solution to Ax=bAx = b, because bb has a component in Ker(AT)Ker(A^T), which is outside the range of AA (literally). We can define b=bi+bnb = b_i + b_n where bib_i is the ortogonal projection of bb onto Im(A)Im(A), and bnb_n is the ortogonal projection of bb onto Ker(AT)Ker(A^T). In other words, bibnb_i \perp b_n.

The above is valid, because we assume Im(A)Ker(AT)Im(A) \perp Ker(A^T), and that span(Im(A)Ker(AT))=rng(A)span(Im(A) \cup Ker(A^T)) = rng(A), in other words that Im(A)Im(A) and Ker(AT)Ker(A^T) together generate the whole range of our linear mapping AA. Now just using basic algebra:

b=bi+bnb=Ax+bnleft multiply by ATATb=ATAx+ATbnsince bnKer(AT), we know ATbn=0ATb=ATAxATA is invertible, see note below(ATA)1ATb=xfinally, we get the normal equation\begin{aligned} b &= b_i + b_n \\ b &= Ax + b_n & \text{left multiply by $A^T$} \\ A^T b &= A^T A x + A^T b_n & \text{since $b_n \in Ker(A^T)$, we know $A^T b_n = 0$} \\ A^T b &= A^T A x & \text{$A^T A$ is invertible, see note below} \\ (A^T A)^{-1} A^T b &= x & \text{finally, we get the normal equation} \end{aligned}

Here we used the fact that ATAA^T A is always a symmetric positive semi-definite matrix, and in case we have linearly independent columns, it is actually positive-definite, which means it is also invertible. This is actually easy to show.

First we show that ATAA^T A is symmetric. This is easy to see, because (ATA)ij(A^T A)_{ij} is just the dot product of ii-th row of ATA^T with the jthj-th column of AA. Note that ii-th row of ATA^T is actually ii-th column of AA. From this we see that (ATA)ij=(ATA)ji(A^T A)_{ij} = (A^T A)_{ji}, because dot product is symmetric.

Now we show that ATAA^T A is positive semi-definite. For an arbitrary matrix MM, we say that MM is positive semi-definite if and only if xTMx0x^T M x \geq 0 for all xRx \in \mathbb{R}. We can directly substitute ATAA^T A and use the same trick as below:

xTATAx=(Ax)TAx=Ax20x^T A^T A x = (A x)^T A x = ||A x||^2 \geq 0

Since ATAA^T A satisfies the definition directly, it is positive-semidefinite. \square

There is also another very nice way to show that ATAA^T A is invertible, without showing that it is positive semi-definite.

Lemma Ker(ATA)=Ker(A)Ker(A^T A) = Ker(A).

Starting with Ker(ATA)Ker(A)Ker(A^T A) \supseteq Ker(A), this follows immediately from Ax=0    AT(Ax)=0Ax = 0 \implies A^T (Ax) = 0.

Next Ker(ATA)Ker(A)Ker(A^T A) \subseteq Ker(A): ATAx=0A^T A x = 0, left multiply by xTx^T and we get:

0=xTATAx=(Ax)TAx=Ax2.0 = x^T A^T A x = (A x)^T A x = || Ax ||^2.

Since the L2L_2 norm is zero only if the vector is zero, we get that any vector xx for which ATAx=0A^T A x = 0, it is also true that Ax2=0|| Ax ||^2 = 0, which can only be true when Ax=0A x = 0, and hence xKer(A)x \in Ker(A). \square

Because Ker(ATA)=Ker(A)Ker(A^T A) = Ker(A), we also know that rank(ATA)=rank(A)rank(A^T A) = rank(A), which means if AA has linearly independent columns, ATAA^T A is invertible, because it has a full rank (this is because ATAA^T A is square and has the same number of rows/columns as AA has columns).

Share on Twitter and Facebook

Discussion of "Linear Regression - least squares with orthogonal projection"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..