NLA MT25, Useful miscellany


Flashcards

What is

\[\begin{bmatrix} I _ m & X \\ 0 & I _ n \end{bmatrix}^{-1}\]

?

\[\begin{bmatrix} I _ m & -X \\ 0 & I _ n \end{bmatrix}\]

Suppose that $ \vert \vert X \vert \vert < 1$ in a submultiplicative norm. How can you rewrite $(I - X)^{-1}$?

\[(I - X)^{-1} = I + X + X^2 + X^3 + \cdots\]

(submultiplicativity is useful so you can say $ \vert \vert X^k \vert \vert \le \vert \vert X \vert \vert ^k \to 0$).

@exam~

Suppose that $Q \in \mathbb R^{m \times n}$ is orthogonal (note that $Q$ might be rectangular!). What does this mean, and what does this not mean?

\[Q^\top Q = I _ n\]

This does not mean that

\[Q Q^\top = I _ n\]

(although this is true for square matrices, and more generally $A^{-1} A = I _ n$ implies $A A^{-1} = I _ n$, it is not true in general for rectangular matrices).

@State a useful identity about the spectral norm of $I - P$ where $P$ is a projector ($P^2 = P$).

Suppose $P \in \mathbb R^{n \times n}$ is a projector $(P^2 = P)$ with $P \ne 0, I$. Then

\[\|I-P\| _ 2 = \|P\| _ 2.\]

What are the eigenvalues of the $n \times n$ all-ones matrix $J _ n = \mathbf 1 \mathbf 1^\top$ (every entry equal to $1$)?

$J _ n$ is symmetric and rank-$1$, so it has a single nonzero eigenvalue. The eigenvalues are:

  • $n$ (multiplicity $1$), with eigenvector $\mathbf 1 = (1, \ldots, 1)^\top$, since $J _ n \mathbf 1 = n \mathbf 1$.
  • $0$ (multiplicity $n - 1$), with eigenspace $\mathbf 1^\perp = {x : \sum _ i x _ i = 0}$.

Rank-$1$ forces a single nonzero eigenvalue, and $\mathrm{Trace}(J _ n) = n$ fixes its value as $n$.

When is $A \in \mathbb F^{n \times n}$ diagonalisable over $\mathbb F$? State the equivalent conditions.

Diagonalisable over $\mathbb F$ means $A = S \Lambda S^{-1}$ for some invertible $S$ and diagonal $\Lambda$ with entries in $\mathbb F$. The following are equivalent:

  • $\mathbb F^n$ has a basis of eigenvectors of $A$, i.e. $\bigoplus _ \lambda \ker(A - \lambda I) = \mathbb F^n$.
  • The characteristic polynomial splits over $\mathbb F$, and for every eigenvalue the geometric multiplicity equals the algebraic multiplicity.
  • The minimal polynomial $m _ A$ is a product of distinct linear factors in $\mathbb F[t]$ (squarefree and split). This is the most usable criterion.
  • Every Jordan block of $A$ has size $1$.

Field caveat: over $\mathbb C$ the minimal-polynomial criterion is just “$m _ A$ squarefree”, since splitting is automatic. Over $\mathbb R$ you additionally need all eigenvalues real: a rotation has $m _ A = t^2 + 1$, squarefree but not split over $\mathbb R$, so it is $\mathbb C$-diagonalisable but not $\mathbb R$-diagonalisable. Having $n$ distinct eigenvalues is sufficient but not necessary. Normal (resp. Hermitian/symmetric) matrices satisfy the stronger property of unitary (resp. orthogonal) diagonalisability.


See also other cards on this topic:

@State the push-through identity, give a one-line proof, and name its most common application.

For $A \in \mathbb R^{m \times n}$ and $B \in \mathbb R^{n \times m}$ (with the relevant inverses existing):

\[(I _ m + AB)^{-1} A = A (I _ n + BA)^{-1}.\]

Proof (one line): from $A(I _ n + BA) = A + ABA = (I _ m + AB)A$, left-multiply by $(I _ m + AB)^{-1}$ and right-multiply by $(I _ n + BA)^{-1}$.

Most common application, the ridge-regression dual: with $A = X^\top$, $B = X$ and $\lambda > 0$, a short rescaling gives

\[(X^\top X + \lambda I _ D)^{-1} X^\top = X^\top (X X^\top + \lambda I _ N)^{-1}.\]

For data $X \in \mathbb R^{N \times D}$ the LHS solves a $D \times D$ system and the RHS an $N \times N$ one, so you pick whichever dimension is smaller. The RHS also exposes the inner-product matrix $X X^\top$, which opens the door for kernel methods (∆muffin-worry-push).

Source: standard NLA identity; ridge instance used in Notes - Machine Learning MT23, Linear regressionU.

@Define the Wirtinger derivatives and explain why they are the right tool for optimising a real-valued function of a complex variable.

Problem: we want to minimise $J(W)$ where $J : \mathbb C \to \mathbb R$, for example $J(W) = \vert 1 - aW \vert ^2$. The ordinary complex derivative $\partial J / \partial W$ does not exist because a non-constant real-valued function of a complex variable cannot satisfy the Cauchy-Riemann equations.

Workaround (Wirtinger): write $W = u + iv$ with $u, v \in \mathbb R$, then formally treat $W = u + iv$ and $W^\ast = u - iv$ as independent variables. Define

\[\frac{\partial}{\partial W} = \frac{1}{2}\left( \frac{\partial}{\partial u} - i \frac{\partial}{\partial v} \right), \qquad \frac{\partial}{\partial W^\ast} = \frac{1}{2}\left( \frac{\partial}{\partial u} + i \frac{\partial}{\partial v} \right).\]

Under this convention the usual chain and product rules apply, with $W$ and $W^\ast$ behaving as independent symbols: $\partial W / \partial W = 1$, $\partial W^\ast / \partial W = 0$, $\partial W / \partial W^\ast = 0$, $\partial W^\ast / \partial W^\ast = 1$.

Why it works for optimisation: for real-valued $J$, a critical point in $\mathbb C \cong \mathbb R^2$ requires $\partial J / \partial u = \partial J / \partial v = 0$. These two real conditions combine into the single complex condition

\[\frac{\partial J}{\partial W^\ast} = 0.\]

The conjugate condition $\partial J / \partial W = 0$ is equivalent and gives the same critical points. The convention is to use $W^\ast$, matching the fact that $\partial J / \partial W^\ast$ points in the direction of steepest ascent of $J$ in the complex plane.

Practical rule: to differentiate $J(W, W^\ast)$ with respect to $W^\ast$, treat $W$ as constant. The two examples used in the Wiener-filter derivation (∆wiener-derivation):

\[\frac{\partial}{\partial W^\ast} \vert W \vert ^2 = \frac{\partial}{\partial W^\ast}(W W^\ast) = W, \qquad \frac{\partial}{\partial W^\ast} \vert 1 - aW \vert ^2 = \frac{\partial}{\partial W^\ast}((1 - aW)(1 - a^\ast W^\ast)) = -a^\ast(1 - aW).\]

Source: standard signal-processing tool (Wirtinger 1927). Not in the NLA lecture notes but referenced for Wiener filtering and complex-valued optimisation.

Bite-sized

Transpose of a product: $(AB)^\top = $ $B^\top A^\top$. Inverse of a product (when both are invertible): $(AB)^{-1} = $ $B^{-1} A^{-1}$.

Source: Lecture 1, Some useful results slide and §1.7 of the lecture notes.

@bite~

For square matrices, $AB = I \implies $ $BA = I$. Non-trivial: square invertibility is two-sided, but only because the matrices are square (and finite-dimensional).

Source: Lecture 1, Some useful results slide and §1.7 of the lecture notes.

@bite~

Trace identities: $\mathrm{Trace}(XY) = $ $\mathrm{Trace}(YX)$ whenever both products are square, and for any $B \in \mathbb R^{m \times n}$, $\ \vert B\ \vert _ F^2 = $ $\mathrm{Trace}(B^\top B)$.

Source: Lecture 1, Some useful results slide and §1.7 of the lecture notes.

@bite~

For a diagonalisable $A = X \Lambda X^{-1}$ and a polynomial $p$ (e.g. $p(z) = 1 - \alpha z$, so $p(A) = I - \alpha A$), what bounds $|p(A)| _ 2$ in terms of the eigenvalues, and when is it an equality?

Since $p(A) = X\, p(\Lambda)\, X^{-1}$ with $p(\Lambda) = \text{diag}(p(\lambda _ 1), \ldots, p(\lambda _ n))$, submultiplicativity gives

\[\|p(A)\| _ 2 \le \|X\| _ 2\, \|X^{-1}\| _ 2\, \|p(\Lambda)\| _ 2 = \kappa _ 2(X)\, \max _ i \vert p(\lambda _ i) \vert .\]

In particular $|I - \alpha A| _ 2 \le \kappa _ 2(X)\, \max _ i \vert 1 - \alpha\lambda _ i \vert $.

Equality (the $\kappa _ 2(X)$ factor becomes $1$) when $A$ is normal — in particular symmetric or Hermitian — since then $A = U \Lambda U^\ast$ with $U$ unitary and $\kappa _ 2(U) = 1$:

\[\|p(A)\| _ 2 = \max _ i \vert p(\lambda _ i) \vert , \qquad \|I - \alpha A\| _ 2 = \max _ i \vert 1 - \alpha\lambda _ i \vert .\]

This is the standard move reducing a matrix contraction factor to a scalar “polynomial small on the spectrum” problem (stationary iterations, GMRES, CG).

Source: Lecture 12, §12.1 of the lecture notes (GMRES convergence, $|p(A)| _ 2 \le \kappa _ 2(X)\max _ {z \in \lambda(A)} \vert p(z) \vert $) and Lecture 13, §13.3 (symmetric case gives equality).

@bite~




Related posts