NLA MT25, Useful miscellany
Flashcards
What is
\[\begin{bmatrix}
I _ m & X \\ 0 & I _ n
\end{bmatrix}^{-1}\]
?
Suppose that $ \vert \vert X \vert \vert < 1$ in a submultiplicative norm. How can you rewrite $(I - X)^{-1}$?
(submultiplicativity is useful so you can say $ \vert \vert X^k \vert \vert \le \vert \vert X \vert \vert ^k \to 0$).
Suppose that $Q \in \mathbb R^{m \times n}$ is orthogonal (note that $Q$ might be rectangular!). What does this mean, and what does this not mean?
This does not mean that
\[Q Q^\top = I _ n\](although this is true for square matrices, and more generally $A^{-1} A = I _ n$ implies $A A^{-1} = I _ n$, it is not true in general for rectangular matrices).
@State a useful identity about the spectral norm of $I - P$ where $P$ is a projector ($P^2 = P$).
Suppose $P \in \mathbb R^{n \times n}$ is a projector $(P^2 = P)$ with $P \ne 0, I$. Then
\[\|I-P\| _ 2 = \|P\| _ 2.\]What are the eigenvalues of the $n \times n$ all-ones matrix $J _ n = \mathbf 1 \mathbf 1^\top$ (every entry equal to $1$)?
$J _ n$ is symmetric and rank-$1$, so it has a single nonzero eigenvalue. The eigenvalues are:
- $n$ (multiplicity $1$), with eigenvector $\mathbf 1 = (1, \ldots, 1)^\top$, since $J _ n \mathbf 1 = n \mathbf 1$.
- $0$ (multiplicity $n - 1$), with eigenspace $\mathbf 1^\perp = {x : \sum _ i x _ i = 0}$.
Rank-$1$ forces a single nonzero eigenvalue, and $\mathrm{Trace}(J _ n) = n$ fixes its value as $n$.
When is $A \in \mathbb F^{n \times n}$ diagonalisable over $\mathbb F$? State the equivalent conditions.
Diagonalisable over $\mathbb F$ means $A = S \Lambda S^{-1}$ for some invertible $S$ and diagonal $\Lambda$ with entries in $\mathbb F$. The following are equivalent:
- $\mathbb F^n$ has a basis of eigenvectors of $A$, i.e. $\bigoplus _ \lambda \ker(A - \lambda I) = \mathbb F^n$.
- The characteristic polynomial splits over $\mathbb F$, and for every eigenvalue the geometric multiplicity equals the algebraic multiplicity.
- The minimal polynomial $m _ A$ is a product of distinct linear factors in $\mathbb F[t]$ (squarefree and split). This is the most usable criterion.
- Every Jordan block of $A$ has size $1$.
Field caveat: over $\mathbb C$ the minimal-polynomial criterion is just “$m _ A$ squarefree”, since splitting is automatic. Over $\mathbb R$ you additionally need all eigenvalues real: a rotation has $m _ A = t^2 + 1$, squarefree but not split over $\mathbb R$, so it is $\mathbb C$-diagonalisable but not $\mathbb R$-diagonalisable. Having $n$ distinct eigenvalues is sufficient but not necessary. Normal (resp. Hermitian/symmetric) matrices satisfy the stronger property of unitary (resp. orthogonal) diagonalisability.
See also other cards on this topic:
- Definition (eigenbasis): Notes - Linear Algebra II HT23, Eigenvectors and eigenvaluesU, “what it means for $T$ to be diagonalisable”.
- Minimal-polynomial criterion and its proof: Notes - Linear Algebra MT23, Primary decomposition theoremU, “deduce from $m _ T(x)$ whether $T$ is diagonalisable”.
- Jordan-form criterion: Notes - Linear Algebra MT23, Jordan normal formU, “tell from the JNF whether diagonalisable” (and the algebraic/geometric multiplicity card).
- Unitary/orthogonal variants: ∆normal-iff-unitarily-diagonalisable, ∆symmetric-iff-orthogonally-diagonalisable.
@State the push-through identity, give a one-line proof, and name its most common application.
For $A \in \mathbb R^{m \times n}$ and $B \in \mathbb R^{n \times m}$ (with the relevant inverses existing):
\[(I _ m + AB)^{-1} A = A (I _ n + BA)^{-1}.\]Proof (one line): from $A(I _ n + BA) = A + ABA = (I _ m + AB)A$, left-multiply by $(I _ m + AB)^{-1}$ and right-multiply by $(I _ n + BA)^{-1}$.
Most common application, the ridge-regression dual: with $A = X^\top$, $B = X$ and $\lambda > 0$, a short rescaling gives
\[(X^\top X + \lambda I _ D)^{-1} X^\top = X^\top (X X^\top + \lambda I _ N)^{-1}.\]For data $X \in \mathbb R^{N \times D}$ the LHS solves a $D \times D$ system and the RHS an $N \times N$ one, so you pick whichever dimension is smaller. The RHS also exposes the inner-product matrix $X X^\top$, which opens the door for kernel methods (∆muffin-worry-push).
@Define the Wirtinger derivatives and explain why they are the right tool for optimising a real-valued function of a complex variable.
Problem: we want to minimise $J(W)$ where $J : \mathbb C \to \mathbb R$, for example $J(W) = \vert 1 - aW \vert ^2$. The ordinary complex derivative $\partial J / \partial W$ does not exist because a non-constant real-valued function of a complex variable cannot satisfy the Cauchy-Riemann equations.
Workaround (Wirtinger): write $W = u + iv$ with $u, v \in \mathbb R$, then formally treat $W = u + iv$ and $W^\ast = u - iv$ as independent variables. Define
\[\frac{\partial}{\partial W} = \frac{1}{2}\left( \frac{\partial}{\partial u} - i \frac{\partial}{\partial v} \right), \qquad \frac{\partial}{\partial W^\ast} = \frac{1}{2}\left( \frac{\partial}{\partial u} + i \frac{\partial}{\partial v} \right).\]Under this convention the usual chain and product rules apply, with $W$ and $W^\ast$ behaving as independent symbols: $\partial W / \partial W = 1$, $\partial W^\ast / \partial W = 0$, $\partial W / \partial W^\ast = 0$, $\partial W^\ast / \partial W^\ast = 1$.
Why it works for optimisation: for real-valued $J$, a critical point in $\mathbb C \cong \mathbb R^2$ requires $\partial J / \partial u = \partial J / \partial v = 0$. These two real conditions combine into the single complex condition
\[\frac{\partial J}{\partial W^\ast} = 0.\]The conjugate condition $\partial J / \partial W = 0$ is equivalent and gives the same critical points. The convention is to use $W^\ast$, matching the fact that $\partial J / \partial W^\ast$ points in the direction of steepest ascent of $J$ in the complex plane.
Practical rule: to differentiate $J(W, W^\ast)$ with respect to $W^\ast$, treat $W$ as constant. The two examples used in the Wiener-filter derivation (∆wiener-derivation):
\[\frac{\partial}{\partial W^\ast} \vert W \vert ^2 = \frac{\partial}{\partial W^\ast}(W W^\ast) = W, \qquad \frac{\partial}{\partial W^\ast} \vert 1 - aW \vert ^2 = \frac{\partial}{\partial W^\ast}((1 - aW)(1 - a^\ast W^\ast)) = -a^\ast(1 - aW).\]Bite-sized
Transpose of a product: $(AB)^\top = $ $B^\top A^\top$. Inverse of a product (when both are invertible): $(AB)^{-1} = $ $B^{-1} A^{-1}$.
For square matrices, $AB = I \implies $ $BA = I$. Non-trivial: square invertibility is two-sided, but only because the matrices are square (and finite-dimensional).
Trace identities: $\mathrm{Trace}(XY) = $ $\mathrm{Trace}(YX)$ whenever both products are square, and for any $B \in \mathbb R^{m \times n}$, $\ \vert B\ \vert _ F^2 = $ $\mathrm{Trace}(B^\top B)$.
For a diagonalisable $A = X \Lambda X^{-1}$ and a polynomial $p$ (e.g. $p(z) = 1 - \alpha z$, so $p(A) = I - \alpha A$), what bounds $|p(A)| _ 2$ in terms of the eigenvalues, and when is it an equality?
Since $p(A) = X\, p(\Lambda)\, X^{-1}$ with $p(\Lambda) = \text{diag}(p(\lambda _ 1), \ldots, p(\lambda _ n))$, submultiplicativity gives
\[\|p(A)\| _ 2 \le \|X\| _ 2\, \|X^{-1}\| _ 2\, \|p(\Lambda)\| _ 2 = \kappa _ 2(X)\, \max _ i \vert p(\lambda _ i) \vert .\]In particular $|I - \alpha A| _ 2 \le \kappa _ 2(X)\, \max _ i \vert 1 - \alpha\lambda _ i \vert $.
Equality (the $\kappa _ 2(X)$ factor becomes $1$) when $A$ is normal — in particular symmetric or Hermitian — since then $A = U \Lambda U^\ast$ with $U$ unitary and $\kappa _ 2(U) = 1$:
\[\|p(A)\| _ 2 = \max _ i \vert p(\lambda _ i) \vert , \qquad \|I - \alpha A\| _ 2 = \max _ i \vert 1 - \alpha\lambda _ i \vert .\]This is the standard move reducing a matrix contraction factor to a scalar “polynomial small on the spectrum” problem (stationary iterations, GMRES, CG).