This is a map of every main result and method in the course. Computer Vision MT25 has two halves welded together by one idea. The first half is classical: an image is a sampled 2D function, and everything (filtering, restoration, features, geometry) is calculus, linear algebra and Fourier analysis on that function. The second half is learned: the hand-crafted pipeline (SIFT descriptor, HOG+SVM detector, intensity-based flow) is replaced end-to-end by a CNN or transformer, but the tasks (classification, detection, segmentation, correspondence, generation) and the evaluation machinery (IoU, AP, FID) are unchanged. The recurring slogan is that old ideas keep coming back: backward-warp interpolation reappears in RoIAlign, the SVD null-space trick solves homographies / the fundamental matrix / camera calibration identically, brightness-constancy powers both Lucas–Kanade and optical flow, and the contrastive/InfoNCE loss is just softmax cross-entropy with similarities as logits.
Images as functions
Image representation
An image is two things at once: a discrete array (H, W, C) indexed matrix-major ([0,0] top-left, first axis = row = $y$), and samples from a continuous 2D function $f(x, y)$. The array is what we implement; the function is what we reason about — it is what lets us write filtering, geometric transformations, derivatives and Fourier analysis as calculus, and it is the representation Lectures 2-3 are built on.
- Two representations∆image-representations-array-vs-function, ∆bite-images-as-continuous-functions-rationale — array vs samples-from-$f(x,y)$; why the functional view is the one we reason with.
- Array conventions (bite)∆bite-image-array-shape-convention, ∆bite-image-indexing-row-column-convention, ∆bite-grayscale-vs-colour-channels, ∆bite-pixel-intensity-range —
(H,W,C)BGR/RGB; row-major indexing; $[0,255]$ vs normalised $[0,1]$.
Source: Lecture 2, Images as Pixels / Images as Functions slides.
Sampling, aliasing and the Nyquist–Shannon theorem
Sampling records $f$ at discrete points; reconstruction guesses between them. Undersampling causes aliasing — high-frequency content “travels in disguise” as a lower frequency (the rotating-wheel / moiré / disintegrating-checkerboard phenomenon). The Nyquist–Shannon theorem is the precise cutoff: sample at $\ge 2 f _ {\max}$ for perfect reconstruction. The fix is to low-pass filter before subsampling, which is the single justification behind blur-then-downsample and the entire anti-aliasing story.
- Definitions∆sampling-definition, ∆reconstruction-definition, ∆aliasing-definition, ∆bite-aliasing-masquerade — sampling/reconstruction; aliasing as signal “in disguise”.
- Nyquist–Shannon (+ rate factor)∆nyquist-shannon-theorem, ∆bite-nyquist-rate-factor — $\ge 2 f _ {\max}$; equivalently sample period $\le 1/(2 f _ {\max})$.
- Anti-aliasing∆aliasing-prevention, ∆bite-anti-aliasing-low-pass-rationale, ∆undersampling-visualisation — sample more / low-pass first; why Gaussian is the standard pre-filter.
- Real-world artefacts / interpolation list∆bite-aliasing-real-world-examples, ∆bite-interpolation-methods-list — moiré / false colour / disintegrating texture; NN $\subset$ bilinear $\subset$ bicubic.
Source: Lecture 2, Nyquist–Shannon Sampling Theorem / Aliasing in the Wild slides.
Subsampling and interpolation
Naive subsampling (drop every non-multiple-of-$2^n$ pixel) aliases in high-frequency regions; blur first. Upsampling is interpolation: the bilinear weight identity (midpoint = average of two corners, centre = $(A+B+C+D)/4$, well-defined either order) generalises to the area-weighted formula $f(x,y)=\sum w _ {ij}(\text{corner})$. The choice between nearest-neighbour (preserves discrete labels), bilinear (no out-of-range overshoot, cheap), and bicubic ($C^1$, smooth) is task-driven and reappears verbatim in segmentation upsampling and RoIAlign.
- Subsample = blur-then-drop∆naive-subsampling-aliasing-problem, ∆blur-then-subsample-trick, ∆bite-subsample-blur-first — aliasing in high-freq regions; low-pass fix.
- Bilinear interpolation (+ general formula)∆bilinear-cross-derivation, ∆generalised-bilinear-interpolation, ∆bite-bilinear-midpoint-and-centre — corner-average derivation; area-weighted $w _ {ij}$.
- Interpolation properties / method choice∆nearest-neighbour-interpolation, ∆nearest-neighbour-no-new-values, ∆linear-interpolation-range-property, ∆cubic-interpolation-differentiable, ∆bite-bilinear-vs-bicubic, ∆bite-interpolation-method-choice — which scheme for labels / photo upscaling / rotation.
Source: Lecture 2, Subsampling / Bilinear Interpolation / Useful Interpolation Properties slides; Problem Sheet I Q2.
Image transformations: the three-category taxonomy
Every image operation is one of three: pointwise $g=t(f)$ (negation, contrast $af+b$, gamma) — acts on the range; geometric $g(x,y)=f(T(x,y))$ (translation/scale/rotation/shear/homography) — acts on the domain; filtering $g=F(N(x,y))$ — acts on a neighbourhood. This taxonomy organises Lectures 2-3 and is a recurring problem-sheet question. Geometric transforms are applied by a backward warp (iterate over target pixels, look up the source via $T^{-1}$, interpolate) to avoid the holes/pile-ups of forward warps.
- Three categories∆point-geom-filter-distinction, ∆bite-three-image-transformation-categories — pointwise / geometric / filtering; range / domain / neighbourhood.
- Pointwise examples∆gamma-correction-filter, ∆bite-image-negation, ∆bite-contrast-adjustment-affine — $f^\gamma$; $1-f$; $af+b$.
- Warps∆forward-warp-problem, ∆backward-warp-definition, ∆bite-backward-warp-interpolation, ∆bite-forward-vs-backward-warp — holes vs inverse-and-interpolate.
- Optical-axis motions∆optical-axis-rotation-translation, ∆bite-affine-vs-perspective-camera-motions — rotation/zoom about the optical axis are affine; off-axis tilt is a homography.
Source: Lecture 2, Image Transformations / Forward / Backward Warps slides; Lecture 5, Optical Axis slide.
Affine transforms and homogeneous coordinates
Embedding $(x,y)\mapsto(x,y,1)$ lets translation, scaling, rotation and shear all become a single $3\times3$ matrix multiply, so transforms compose by matrix product (non-commutative — order matters). Affine maps (6 DoF) preserve collinearity, parallelism and convexity; a homography (8 DoF, defined up to scale) drops the bottom row’s $(0,0,1)$, gaining a perspective term that maps points-at-infinity to finite vanishing points so parallelism is not preserved. Point–line duality via the cross product ($\ell=\mathbf p _ 1\times\mathbf p _ 2$, intersection $=\ell _ 1\times\ell _ 2$) and the homogeneous-system SVD trick recur in SIFT, MVG and camera calibration.
- Homogeneous coords / affine matrices∆homogeneous-coordinates-representation, ∆translation-affine-matrix, ∆uniform-scaling-affine-matrix, ∆rotation-affine-matrix, ∆horizontal-shearing-affine-matrix, ∆vertical-shearing-affine-matrix, ∆bite-affine-transformation-matrices, ∆bite-2d-rotation-matrix — the explicit $3\times3$ forms.
- Preserved properties / DoF∆affine-transformation-preservation, ∆bite-affine-preserved-properties, ∆bite-homography-vs-affine-dof, ∆bite-affine-vs-homography-parallelism — collinearity/parallelism/convexity; 6 vs 8 DoF; why parallelism fails under a homography.
- Composition order / points at infinity∆bite-affine-composition-order, ∆bite-points-at-infinity, ∆bite-homogeneous-to-euclidean — $RT\ne TR$; $z=0$ ideal points; divide by last entry.
- Homography solve∆homography-definition, ∆homography-min-correspondences, ∆homography-linear-system-solution, ∆homography-image-overlay-example, ∆backward-warp-via-inverse — 4 correspondences (no 3 collinear); stack-and-SVD null space.
- Cross product / point–line duality∆cross-product, ∆line-in-homogeneous-coordinates, ∆homogeneous-equivalence-notation, ∆bite-projective-point-line-cross-product, ∆bite-svd-trick-for-homogeneous-systems — $[a] _ \times$; $\ell^\top x=0$; the universal $A\pmb p=0$, $\ \vert \pmb p\ \vert =1$ recipe.
Source: Lecture 2, Affine Transformation Examples / Combining Transformations slides; Lecture 5, Homography slides; Problem Sheet II Q1.
Convolution
Discrete convolution $(f\ast g)[x]=\sum _ u f[u]\,g[x-u]$ is the implementation of every filter; with finite support a centring shift $\tfrac{M-1}{2}$ aligns the kernel on the output pixel — which forces odd kernel sizes ($3,5,7$) and symmetric padding, the convention CNNs inherit. It is commutative (hence the right-to-left flip), costs $O(N^2 M^2)$ in 2D naively, and is reduced to $O(N^2\log N)$ by the FFT (next section).
- 1D / finite-support / 2D definitions∆1d-convolution-definition, ∆finite-support-1d-convolution, ∆2d-convolution-definition, ∆convolution-example-computation, ∆discrete-function-bracket-notation — the sums; worked example.
- Centring / odd kernels / commutativity∆bite-convolution-centring-shift, ∆bite-odd-kernel-size-justification, ∆bite-convolution-commutative, ∆bite-convolution-kernel-vs-input-naming — why the $\tfrac{M-1}{2}$ shift and odd $M$.
- Boundary / cost∆bite-convolution-boundary-padding, ∆bite-convolution-naive-cost — zero-pad/reflect/wrap; $O(NM)$ (1D), $O(N^2M^2)$ (2D).
Source: Lecture 2, Discrete Convolution (with Finite Support) slides; Problem Sheet I Q3.
Filtering: blur, Gaussian, bilateral, median
A filter replaces a pixel by a function of its neighbourhood. The box (mean) blur is energy-preserving but blocky (sinc spectrum, ringing); the Gaussian blur weights by distance, is separable ($O(k)$ not $O(k^2)$), and has a sidelobe-free Gaussian spectrum — the reason it is the default low-pass. The bilateral filter multiplies the spatial Gaussian by an intensity Gaussian to get edge-preserving smoothing; the median filter is non-linear (not a convolution) and the standard salt-and-pepper / outlier remover.
- Blur / Gaussian / box∆simple-blur-formula, ∆gaussian-blur-formula, ∆box-blur-kernel, ∆bite-gaussian-vs-box-blur, ∆bite-mean-filter-normalisation, ∆bite-gaussian-separability — why Gaussian beats box; separability; $\sum f=1$.
- Bilateral∆bilateral-filter-formula, ∆bilateral-vs-gaussian-visualisation, ∆bite-bilateral-filter-sigma-parameters — $w _ g w _ s$; $\sigma _ g$ extent vs $\sigma _ s$ edge-sensitivity.
- Median∆median-filtering-outlier-removal, ∆simple-vs-gaussian-blur-visualisation, ∆bite-median-filter-nonlinear-outliers — non-linear; impulse-noise removal.
Source: Lecture 2, Averaging / Gaussian Blur / Bilateral Filter / Median Filtering slides.
Frequency domain
The Fourier transform (1D, DFT, 2D)
$F(u)=\langle f,\psi _ u\rangle$ writes any signal as a weighted combination of the orthonormal complex-exponential basis $\psi _ u(t)=e^{i2\pi ut}$; $ \vert F(u) \vert $ is the amplitude, $\arg F(u)$ the phase. The DFT is the matrix multiply $F=Uf$ (periodic output); the 2D basis $e^{i2\pi(ux+vy)}$ gives the magnitude plots where centre = low frequency, radius = high frequency, and a vertical image edge appears as a horizontal spectral streak. Translation leaves $ \vert F \vert $ unchanged (only phase), rotation rotates it, scaling inverse-scales it.
- 1D FT / basis / amplitude–phase∆fourier-transform-1d-definition, ∆fourier-basis-orthonormality, ∆fourier-as-weighted-basis, ∆fourier-amplitude-and-phase, ∆fourier-real-function-symmetries, ∆bite-eulers-formula — $\langle f,\psi _ u\rangle$; $A= \vert F \vert $, $\phi=\arg F$; real-signal conjugate symmetry.
- DFT∆discrete-fourier-transform-definition, ∆dft-as-matrix-multiplication, ∆dft-periodicity — $F=Uf$, $f=\tfrac1N U^\ast F$; periodic.
- 2D FT / interpreting magnitude∆2d-fourier-basis-function, ∆2d-fourier-basis-visualisation, ∆2d-fourier-magnitude-plot, ∆fourier-example-rectangle, ∆fourier-example-grid-pattern, ∆bite-fourier-spectrum-frequency-vs-radius, ∆bite-fourier-perpendicular-orientation, ∆bite-fourier-dc-component, ∆bite-natural-vs-manmade-power-spectra — radius = frequency; perpendicular orientation; DC = mean; $1/f^2$ natural spectra.
- Transform behaviour∆bite-fourier-transform-translation-rotation-scaling, ∆bite-fourier-of-delta-translation — translation = phase only; rotation rotates; scaling inverse-scales.
Source: Lecture 3, The Fourier Transform / DFT / 2D Basis Functions / Interpreting DFTs slides.
Convolution theorem and frequency-domain filtering
$\mathcal F\{f\ast g\}=\mathcal F\{f\}\,\mathcal F\{g\}$: spatial convolution = pointwise frequency multiplication, so a large-kernel convolution drops from $O(N^2)$ to $O(N\log N)$ via the FFT (CUDA picks the cheaper route automatically). It explains box-vs-Gaussian ringing (sinc sidelobes vs clean Gaussian envelope), enables low/high-pass filtering by masking the spectrum, periodic-noise removal by notching out spectral peaks (the 1966 lunar-orbiter scan-lines), and the hybrid-image construction.
- Theorem / FFT speed-up∆convolution-theorem, ∆convolution-theorem-fft-speedup, ∆fft-time-complexity, ∆bite-fft-convolution-asymptotic-speedup — $O(N^2)\to O(N\log N)$.
- Box vs Gaussian ringing∆bite-box-vs-gaussian-ringing-fourier — sinc sidelobes cause Gibbs ringing.
- Low/high-pass, notch, hybrid∆low-high-pass-filter-via-fft, ∆periodic-pattern-removal-fft, ∆bite-periodic-noise-removal-recipe, ∆bite-hybrid-images-oliva-torralba — mask the spectrum; notch periodic peaks; Einstein/Marilyn.
Source: Lecture 3, Convolution Theorem / Removing Periodic Patterns / Hybrid Images slides.
Image restoration
Degradation model and naive deconvolution
Restoration (unlike enhancement) models the degradation: $g = d\ast f + n$ with $d$ the point-spread function. Naive Fourier-domain deconvolution $\hat F = G/D$ catastrophically amplifies noise because $D$ is tiny at high frequencies — the canonical “what went wrong” failure (the deblurred image is pure high-frequency grid noise).
- Setup / PSF / naive failure∆restoration-vs-enhancement, ∆image-degradation-model, ∆naive-deconvolution-via-fft, ∆naive-deconvolution-noise-amplification, ∆bite-point-spread-function-terminology, ∆bite-naive-deconvolution-noise-amplification-why, ∆bite-degradation-types — $g=d\ast f+n$; division-by-tiny-$D$ blows up noise.
Source: Lecture 4, Modelling Degradation / What went wrong? slides.
The Wiener filter
The Wiener filter $W=\dfrac{D^\ast}{ \vert D \vert ^2 S + K}$ is the MMSE linear estimator $\min _ W\mathbb E\ \vert f-\hat f\ \vert ^2$; in its inverse-filter-with-shrinkage form it acts like $1/D$ where SNR is high and shrinks toward zero where noise dominates — exactly de-emphasising the high frequencies that wreck naive deconvolution (natural images’ $1/f^2$ spectrum makes this an automatic win). Motion blur is a line-segment PSF removed by rotate-to-horizontal → estimate length → build PSF → Wiener.
- Definition / SNR form / derivation∆wiener-filter-definition, ∆wiener-filter-snr-form, ∆wiener-derivation, ∆bite-wiener-shrinkage-interpretation, ∆bite-wiener-versus-naive-natural-images — MMSE objective; inverse-filter × shrinkage.
- Limitations / motion blur∆wiener-filter-deblurring-visualisation, ∆motion-blur-line-segment-filter, ∆motion-blur-visualisation, ∆motion-blur-removal-algorithm, ∆bite-wiener-filter-limitations, ∆bite-motion-blur-removal-rotate-trick — boundary artefacts, unknown PSF; the rotate trick.
Source: Lecture 4, The Wiener Filter / Motion Blur and WFs slides; Problem Sheet I Q4.
Generative (inverse-problem) degradation model
Generalising to $g = Af + n$, restoration becomes $\hat f=\arg\min _ f\ \vert g-Af\ \vert ^2+\lambda p(f)$ with a prior $p$ — the TV ($L _ 1$ gradient) prior preserves edges where Tikhonov ($L _ 2$) would blur them. The same template gives super-resolution (multiple sub-pixel-shifted images densify the sampling — a Nyquist argument, the Mars-lander example) and blind deblurring (joint $\min _ {f,h}$).
- Inverse problem / TV prior∆generative-degradation-model, ∆generative-degradation-recovery, ∆tv-smoothness-regulariser, ∆bite-tv-vs-tikhonov-regulariser — $\min\ \vert g-Af\ \vert ^2+\lambda p(f)$; $L _ 1$ edges vs $L _ 2$ blur.
- Super-resolution / blind deblurring∆estimating-degradation-matrix-A, ∆bite-super-resolution-shannon-rationale, ∆bite-blind-deblurring-joint-optimisation — multi-image Nyquist densification; joint $\min _ {f,h}$.
Source: Lecture 4, Inverse Problem / Super Resolution / Blind Deblurring slides.
Features, correspondences and SIFT
The correspondence problem
“The three most important problems in computer vision: correspondence, correspondence, correspondence” (Kanade). Direct image subtraction fails under viewpoint change; matching needs local descriptors robust to small geometric/photometric changes, then geometric verification (RANSAC) to keep only matches consistent with one global transform. Optical flow, stereo, tracking, retrieval and multi-view 3D are all correspondence problems sharing this machinery.
- Definition / why subtraction fails / verification∆correspondence-problem-definition, ∆bite-kanade-correspondence-quote, ∆bite-image-subtraction-fails-for-correspondences, ∆bite-brute-force-feature-matching, ∆bite-geometric-verification-ransac, ∆bite-correspondence-applications-list — brute-force NN matching + RANSAC; the shared-problem list.
Source: Lecture 5, Finding Correspondences / Feature Matching slides.
SIFT: detection, description, matching
Three stages: (1) detect keypoints as scale-space extrema of the Laplacian-of-Gaussian (a blob detector) over $(x,y,\sigma)$; (2) describe each by rotating to its dominant orientation then concatenating $4\times4\times8=128$ edge-orientation histograms — making it rotation/scale/translation/illumination-robust; (3) match by nearest-descriptor. A good keypoint is a blob/corner (the Christ Church A/B/C/D worked example); SIFT (Lowe 2004, ~77k cites) is the archetypal hand-crafted representation that LIFT later learns.
- Pipeline / keypoint quality∆sift-three-stages, ∆keypoint-stability-properties, ∆sift-planar-assumption, ∆bite-sift-citation-and-impact, ∆bite-good-keypoint-properties — three stages; what makes a stable keypoint.
- Detection (LoG / scale)∆keypoint-scale-definition, ∆sift-keypoint-energy-minima, ∆sift-energy-from-log-convolution, ∆laplacian-of-gaussian-definition, ∆log-1d-visualisation, ∆bite-log-as-blob-detector, ∆bite-characteristic-scale-via-log-extremum — LoG blob detector; $3\times3\times3$ extremum in $(x,y,\sigma)$.
- Descriptor∆sift-descriptor-construction, ∆bite-sift-descriptor-dimension, ∆bite-sift-rotation-invariance — 128-D orientation histogram; rotate-to-dominant for invariance.
- Matching∆brute-force-sift-matching — nearest-Euclidean, top-$N$.
Source: Lecture 5, Scale Invariant Feature Transform / Keypoints / SIFT Descriptor slides.
Visual words and bag-of-words retrieval
$k$-means the SIFT descriptors of a corpus into $K$ “visual words”; each image becomes a sparse $K$-histogram. Brute-force feature matching of $10^{10}$ images is $\sim10^{18}$ ops (infeasible); BoW reduces image comparison to a sparse $O(K)$ dot product, making web-scale retrieval (precompute vocabulary + histograms, query → histogram → cosine top-$M$ → optional geometric re-rank) feasible.
- Visual words / retrieval pipeline∆visual-words-technique, ∆visual-words-image-retrieval, ∆bite-bag-of-visual-words-retrieval-pipeline, ∆bite-brute-force-matching-cost-blow-up — $k$-means vocabulary; the offline/online pipeline; the $10^{18}$-ops motivation.
Source: Lecture 5, K-means Clustering / Visual Words / Image Retrieval slides; cf. Notes - Machine Learning MT23, k-means clusteringU.
Classification and classical ML
Image classification: embeddings, $k$-NN, the two-step pipeline
Classical recognition is two steps: compute an embedding $\phi:\mathbb R^{H\times W\times3}\to\mathbb R^d$ (FFT / BoW / HOG) then learn a classifier (SVM / kernel SVM / random forest) on it — deep learning fuses and learns both steps, the 2012 AlexNet ImageNet jump. $k$-NN is the strong no-training baseline (naturally multi-class, single hyperparameter $k$, improves with data); accuracy is misleading under class imbalance, motivating precision/recall.
- Two-step pipeline / embeddings∆feature-extractor-definition, ∆image-embeddings-and-classifiers-examples, ∆bite-two-step-image-classification, ∆bite-imagenet-2012-deep-learning-takeover, ∆bite-cifar10-spec — embed-then-classify; AlexNet 2012; CIFAR-10.
- $k$-NN∆knn-classification-algorithm, ∆knn-k-accuracy-tradeoff, ∆bite-knn-strengths-as-baseline — majority vote; the $k$ trade-off (cf. Notes - Machine Learning MT23, k-nearest neighboursU).
- Accuracy / when it misleads∆accuracy-and-expected-accuracy, ∆bite-accuracy-misleading-imbalanced-classes — the 90%-majority trap.
Source: Lecture 6, Image Classification in 2 Steps / Nearest Neighbour / Accuracy slides.
Linear SVM, softmax and cross-entropy
The maximum-margin SVM places the boundary as far from the data as possible; the hinge loss $\max(0,1-y f)$ is a convex 0-1 surrogate whose sparse-support-vector property and $L _ 2$ regulariser give the standard formulation. Softmax (with temperature $\tau$: $\tau\to0$ argmax, $\tau\to\infty$ uniform) turns scores into a calibrated distribution; cross-entropy collapses to $-f _ {GT}+\log\sum e^{f _ j}$ for one-hot targets — the loss-design backbone of every classifier in the course.
- Max-margin SVM / hinge∆bite-svm-max-margin-and-support-vectors, ∆bite-svm-hinge-loss-and-l2, ∆bite-hinge-loss-svm-surrogate, ∆bite-multiclass-linear-classifier-vector-form — margin $2/\ \vert w\ \vert $; hinge as convex surrogate (full theory in Notes - Machine Learning MT23, Support vector machinesU).
- Softmax / temperature / cross-entropy∆bite-softmax-with-temperature, ∆softmax-with-temperature-properties, ∆bite-cross-entropy-simplification, ∆bite-cross-entropy-numerical-stability, ∆one-vs-all-softmax-combination — $\tau$ limits; one-hot collapse; log-sum-exp stability; 1-vs-all calibration caveat.
Source: Lecture 6, Maximum Margin / Soft-max / Cross-Entropy Loss slides; Problem Sheet II Q2-Q3; cf. Notes - Machine Learning MT23, Cross-entropy lossU.
Precision, recall and average precision
Precision $=\mathrm{TP}/(\mathrm{TP}+\mathrm{FP})$, recall $=\mathrm{TP}/(\mathrm{TP}+\mathrm{FN})$ trade off as the confidence threshold sweeps, tracing the PR curve; AP is the area under it, and mAP the triple average over recall levels, IoU thresholds and classes. AP ignores true negatives, so it survives the heavy class imbalance of detection where accuracy collapses — the evaluation engine for detection and segmentation.
- Definitions / confusion matrix∆precision-recall-definition, ∆confusion-matrix-error-rates, ∆bite-tp-fp-fn-tn-stakeholder-framing — TP/FP/FN/TN; FPR/FNR/error-rate.
- PR curve / AP / mAP triple average∆precision-recall-curve-explanation, ∆ap-metric-definition, ∆ap-triple-average-definition, ∆bite-precision-recall-tradeoff, ∆bite-pr-curve-sweep-parameter, ∆bite-ap-vs-accuracy-for-detection, ∆bite-ap-innermost-average-meaning, ∆bite-map-triple-average-formula — average over recall × IoU × class; why AP beats accuracy.
Source: Lecture 6, Precision and Recall slide; Lecture 10, Average Precision slides.
Deep networks and training
Neural-network building blocks
Three primitives recur in every modern architecture: residual connections $y=f(x)+x$ (learn the change, gradient flows through the identity), normalisation (BatchNorm rescales by batch statistics — train/test asymmetry is a notorious bug source; LayerNorm/Instance/Group differ only in which axes they average, and transformers use LayerNorm for batch-independence), and transfer learning (pretrain on a large dataset, fine-tune with a lower LR / frozen backbone — justified by the universality of first-layer filters).
- Residual connections∆residual-connections-rationale, ∆bite-residual-connection-purpose — learn $f(x)=y-x$; $\partial y/\partial x=I+\partial f/\partial x$.
- BatchNorm∆batch-normalisation, ∆batchnorm-placement, ∆batchnorm-test-time-statistics, ∆batchnorm-bug-source, ∆batchnorm-relu-zero-activations, ∆bite-batchnorm-formula, ∆bite-batchnorm-train-vs-test, ∆bite-batchnorm-ema-at-test, ∆bite-batchnorm-fused-at-test, ∆bite-batchnorm-needs-large-batches, ∆bite-batchnorm-can-learn-identity, ∆bite-relu-half-zero-after-bn — formula, EMA at test, fuse-into-conv, can learn identity.
- Normalisation-layer family∆normalisation-layers-comparison, ∆bite-normalisation-layer-axes, ∆bite-layernorm-vs-batchnorm-in-transformers — which axes Batch/Layer/Instance/Group average; why transformers use LayerNorm.
- Transfer learning∆transfer-learning-definition, ∆bite-finetuning-lower-lr, ∆bite-first-layer-filters-task-independent, ∆bite-transfer-learning-recipe — lower LR / freeze backbone; Gabor-like first layer is task-independent.
Source: Lecture 8, Residual Connections / Batch-Normalisation / Transfer Learning slides; cf. Notes - Machine Learning MT23, Neural networksU.
Optimisers and loss design
SGD-with-momentum $\Delta w _ t=\rho\Delta w _ {t-1}-\lambda g _ t$ accelerates along consistent gradients and damps oscillations; Adam adds per-coordinate scaling via bias-corrected first/second moments. Loss design: $L _ 1$ (robust, constant gradient) vs $L _ 2$ (outlier-sensitive); smooth-$L _ 1$ glues the $L _ 2$ centre to the $L _ 1$ tail (the default box-regression loss); multi-task losses sum task losses with a balancing $\lambda$ (the R-CNN classification + regression template).
- SGD-momentum / Adam∆sgd-with-momentum, ∆adam-update-rule, ∆bite-sgd-momentum-update, ∆bite-adam-update-rules — momentum; Adam’s $m _ t,v _ t$, bias correction, per-coordinate LR.
- Regression / smooth-$L _ 1$ / multi-task∆smooth-l1-loss-definition, ∆multi-task-loss-r-cnn, ∆bite-l1-vs-l2-regression-loss, ∆bite-smooth-l1-rationale, ∆bite-multi-task-loss-structure — $L _ 1$ vs $L _ 2$; smooth-$L _ 1$ continuity; $\mathcal L _ A+\lambda\mathcal L _ B$.
Source: Lecture 7, Optimization Methods / Loss Function — Types slides; Lecture 10, Multi-task Loss slide.
Regularisation, learning curves and the bias–variance frame
Dropout zeros neurons with $p\approx0.5$ at train time (rescaled, off at test) preventing co-adaptation — an ensemble-of-subnetworks view. $L _ 1$/$L _ 2$ penalties and early stopping are the other regularisers. Learning curves are read in the bias–variance frame: the human↔train gap is bias (fix: bigger model / longer), train↔val is variance (fix: more data / augmentation / regularisation); a separate trainval set diagnoses distribution shift; characteristic LR-vs-loss shapes (explode / plateau / stuck / good) are the standard diagnostic.
- Dropout∆dropout-definition-and-purpose, ∆bite-dropout-probability, ∆bite-dropout-placement, ∆bite-dropout-train-vs-test, ∆bite-dropout-co-adaptation, ∆bite-dropout-citation — $p=0.5$; train/test asymmetry; co-adaptation/ensemble.
- $L _ 1$/$L _ 2$ / early stopping∆bite-l1-l2-regularisation, ∆bite-early-stopping-as-regulariser, ∆bite-manual-lr-annealing-on-plateau — $\tfrac12\alpha w^2$ vs $\alpha \vert w \vert $; stop on val-loss upturn.
- Bias–variance / curve diagnosis∆bite-bias-variance-train-val-gap-framework, ∆overfitting-diagnosis, ∆variance-gap-fixes, ∆bias-gap-fixes, ∆trainval-vs-val-purpose, ∆bite-trainval-distribution-shift-diagnosis, ∆bite-high-variance-mitigations, ∆exploding-gradients-diagnosis, ∆low-learning-rate-diagnosis, ∆high-lr-stuck-in-local-min-diagnosis, ∆bite-learning-rate-curves-diagnosis, ∆bite-exploding-gradient-curve — human/train = bias, train/val = variance; LR-curve shapes.
Source: Lecture 7, Dropout / Regularization / Validation vs Training / Learning Rate Selection slides.
Convolutional networks
CNN architecture and the receptive field
CNNs build invariance to shift/scale/small distortion from three properties: local connections, weight sharing, spatial subsampling. The output-size formula $h _ \text{out}=\frac{h _ \text{in}-h+2p}{s}+1$, $K$ filters → $K$ output channels and $K$ biases, $h\cdot w\cdot d$ weights per filter, and the receptive-field recurrence $r _ \ell=r _ {\ell-1}+(k _ \ell-1)j _ {\ell-1}$, $j _ \ell=j _ {\ell-1}s _ \ell$ are the examinable bookkeeping. Pooling adds parameter-free shift-invariance; a non-linearity between convs is essential or the stack collapses to one linear map.
- Invariance mechanisms / architecture∆cnn-invariance-motivation, ∆cnn-invariance-mechanisms, ∆cnn-architecture-composition, ∆fc-as-convolution — local + shared + subsample; FC as full-size conv.
- Conv layer sizing∆conv-output-dim-formula, ∆conv-output-dim-example, ∆conv-zero-padding-size, ∆conv-layer-hyperparameters, ∆conv-bias-per-filter, ∆bite-conv-layer-parameter-count — the size formula; padding; per-filter bias; parameter count.
- Pooling / non-linearity∆pooling-output-dim-formula, ∆pooling-overlap-condition, ∆bite-pooling-shift-invariance, ∆bite-why-nonlinearity-between-convs, ∆bite-relu-formula, ∆bite-relu-vs-sigmoid-tanh, ∆bite-activation-function-family — shift-invariance; ReLU non-saturation; the activation zoo.
- Receptive field∆receptive-field-recurrence — $r _ \ell=r _ {\ell-1}+(k _ \ell-1)j _ {\ell-1}$, $j _ \ell=j _ {\ell-1}s _ \ell$ (Problem Sheet III Q1).
Source: Lecture 7, Convolutional Layer / Pooling / Activation Function slides; cf. Notes - Machine Learning MT23, Convolutional neural networksU.
Named CNN architectures
The historical chain: Neocognitron (Fukushima 1980) → LeNet-5 (LeCun 1998, first successful CNN) → AlexNet (Krizhevsky 2012, ReLU+GPU+dropout, won ImageNet, interpretable 96 first-layer filters) → VGG (smaller convs, deeper) → ResNet (R18/50/101/152, residual blocks). Each step’s idea recurs downstream.
- The architecture chain∆neocognitron-fukushima, ∆lenet-filter-size-limitation, ∆vgg-vs-lenet, ∆alexnet, ∆resnet, ∆bite-lenet-historical-attribution, ∆bite-alexnet-first-layer-filters, ∆bite-resnet-variants — Neocognitron → LeNet → AlexNet → VGG → ResNet.
Source: Lecture 7, Neocognitron / LeNet-5 / VGG / ResNet slides.
Attention and transformers
Scaled dot-product and multi-head attention
$\mathrm{Attn}(Q,K,V)=\mathrm{softmax}(QK^\top/\sqrt d)V$: queries score keys, the $\sqrt d$ keeps the score variance at 1 (since $q^\top k$ has variance $d$) so softmax stays unsaturated. Multi-head runs $H$ projections in parallel + concat + linear, analogous to multiple conv filters. Attention is permutation-equivariant (hence positional encodings) and $O(N^2)$ in tokens — the cost that motivates ViViT factorisation.
- Attention formula / $\sqrt d$∆scaled-dot-product-attention, ∆scaled-dot-product-sqrt-d-argument, ∆attention-input-output-tokens, ∆bite-self-attention-complexity — $\mathrm{softmax}(QK^\top/\sqrt d)V$; variance argument; $O(N^2)$.
- Multi-head / cross-attention∆multi-head-attention, ∆bite-cross-attention-q-k-v-sources — $H$ heads + concat + projection; $Q$ from decoder, $K,V$ from encoder.
- Permutation equivariance / positional encoding∆transformer-permutation-equivariance, ∆transformer-positional-encoding, ∆bite-vit-learned-positional-encoding — add $P(i)$; ViT learns it.
Source: Lecture 8, Attention / Self-Attention / Positional Encoding slides; Paper - Attention Is All You Need (2017)U.
Transformer blocks, ViT and architecture variants
The encoder block is Self-Attn → Add+Norm → FFN → Add+Norm; the decoder adds masked self-attention and encoder–decoder cross-attention. ViT (Dosovitskiy 2021) tokenises an image into $16\times16$ patches + CLS token + learned 2D positional encoding and runs a pure encoder — beating CNNs only past ~300M pretraining images. The variant taxonomy (encoder-only / encoder-decoder / decoder-only / dual-encoder / hybrid) places ViT, BERT, GPT, T5, DETR, CLIP, Stable Diffusion and Flamingo on one map.
- Encoder / decoder blocks∆transformer-encoder-block, ∆transformer-decoder-block, ∆bite-residual-connection-purpose — MHA→Add+Norm→FFN→Add+Norm; masked + cross-attention.
- ViT∆vision-transformer-architecture, ∆bite-vit-patch-size, ∆bite-vit-jft-data-scale — patch+CLS+pos-enc; $16\times16$; ~300M-image crossover.
- Variant taxonomy∆transformer-architecture-variants — encoder-only / enc-dec / decoder-only / dual-encoder / hybrid.
Source: Lecture 8, The Encoder-Decoder Transformer / Vision Transformers slides.
Interpreting vision models
Explanation taxonomy and the Clever Hans warning
Three explanation families — post-hoc analysis (no performance cost, but local), transparent models (semantic by construction, task-specific), learned explanations (semantic, but need meta-explanation) — each with pro/con trade-offs. The Clever Hans effect (the horse-with-copyright-notice classifier hitting 90% by reading spurious cues) is the central warning that test accuracy alone does not validate a model; GDPR Art. 13.2(f) gives the legal “right to explanation”.
- Taxonomy / pros-cons / Clever Hans / GDPR∆explanation-aspects-recipient-content-purpose, ∆explainability-three-approaches, ∆post-hoc-pros-cons, ∆transparent-models-pros-cons, ∆learned-explanations-pros-cons, ∆bite-clever-hans-horse-copyright, ∆bite-gdpr-right-to-explanation, ∆bite-pascal-voc-spec — the three approaches; spurious-correlation warning.
Source: Lecture 9, Clever Hans / Explainability / Taxonomy of Approaches slides.
Visualising weights, activations and attributions
Visualise weights (first-layer filters are interpretable Gabors; deeper not) or activations (PCA / t-SNE — with the caveat t-SNE is stochastic and distance-distorting). Attribution is black-box (occlusion — slow, fill-value-dependent) or white-box (gradient $ \vert \nabla _ x f \vert _ 1$). Input reconstruction by gradient ascent yields adversarial noise unless regularised. ROAR (remove-and-retrain) shows vanilla gradient can be worse than random; sanity checks (randomise weights / labels) must change the saliency map.
- Weights / activations∆resnet-last-layer-pca-visualisation, ∆activations-pca-visualisation, ∆tsne-definition, ∆bite-first-layer-filters-interpretable, ∆bite-tsne-caveats — first-layer Gabors; PCA/t-SNE caveats.
- Attribution∆occlusion-method, ∆black-square-occlusion-problem, ∆gradient-method-interpretability, ∆input-reconstruction-for-interpretability, ∆bite-blackbox-vs-whitebox-attribution, ∆bite-input-reconstruction-adversarial-noise, ∆bite-attention-visualisation-confirmation-bias — occlusion vs gradient; gradient-ascent → adversarial noise; attention-map confirmation bias.
- ROAR / sanity checks∆roar-attribution-benchmark, ∆saliency-sanity-checks, ∆bite-roar-shows-vanilla-gradient-weak, ∆bite-saliency-randomise-weights-sanity-check — remove-and-retrain; randomise-weights test.
Source: Lecture 9, Visualising Weights/Activations / Attribution / ROAR / Sanity Checks slides.
Object detection
Sliding window, HOG+SVM, IoU
Detection = classification + localisation as $(x,y,w,h,c)$. The pre-deep pipeline is a sliding-window binary classifier; Dalal–Triggs (CVPR 2005) used HOG features + linear SVM for pedestrians (the SVM weights visualise as a person silhouette — linear template matching, and HOG is reproducible as a CNN). IoU $=\frac{ \vert GT\cap P \vert }{ \vert GT\cup P \vert }$ jointly scores centre/size/aspect (biased to large objects) and is the match criterion feeding AP.
- Detection / sliding window / HOG+SVM∆object-detection-vs-classification, ∆sliding-window-approach, ∆sliding-window-problems, ∆dalal-triggs-pedestrian-hog-svm, ∆hog-as-cnn-implementation, ∆svm-weight-visualisation-hog, ∆bite-dalal-triggs-citation, ∆bite-bounding-box-parameterisation — $(x,y,w,h,c)$; HOG+SVM template.
- IoU∆iou-formula, ∆iou-bias-large-objects, ∆false-positive-better-fit-prediction, ∆ap-evaluation-algorithm, ∆bite-iou-vs-centre-distance — area-overlap metric; the AP-evaluation algorithm.
Source: Lecture 10, Pedestrian Detection / HOG / Intersection over Union slides.
Speed-ups: NMS, hard-negative mining, cascades, proposals
Detection is class-imbalanced (mostly background), so: NMS removes duplicate detections; hard-negative mining (bootstrapping) feeds false positives back as negatives; cascaded classifiers (Viola–Jones) reject easy negatives early; object proposals (Selective Search, ~2000 regions, >95% recall) replace exhaustive sliding windows. ImageNet’s lack of a background class is why naive proposal-classification floods false positives.
- NMS / mining / cascade / proposals∆nms-definition, ∆hard-negative-mining, ∆cascaded-classifiers, ∆object-proposals-definition, ∆selective-search-visualisation, ∆imagenet-no-background-class-problem, ∆bite-viola-jones-cascade, ∆bite-selective-search-citation — the four speed-up ideas.
Source: Lecture 10, Non-Maximum Suppression / Bootstrapping / Cascaded Classifiers / Selective Search slides.
The R-CNN family, YOLO and DETR
R-CNN (Selective Search → warp → CNN → SVM, slow: one CNN/proposal) → Fast R-CNN (one CNN pass, RoIPool, FC head) → Faster R-CNN (learned RPN with multi-scale anchors + IoU assignment + offset regression replaces Selective Search). One-stage YOLO predicts boxes on a grid in a single pass with its five-term loss ($\sqrt{w},\sqrt h$ to balance box sizes; $\lambda _ \text{noobj}$ for imbalance). DETR (Carion 2020) reframes detection as set prediction with object queries + bipartite matching — fully end-to-end, no NMS.
- R-CNN family∆r-cnn-architecture, ∆fast-r-cnn-insight, ∆fast-r-cnn-architecture, ∆roi-pooling-purpose, ∆roipool-transform-details, ∆rpn-definition, ∆rpn-and-faster-r-cnn-structure, ∆r-cnn-family, ∆r-cnn-pros-cons, ∆bounding-box-regression-definition, ∆bounding-box-offset-prediction, ∆bite-rcnn-citation, ∆bite-anchor-box-detail, ∆bite-faster-rcnn-is-fast-rcnn-plus-rpn — the three-generation chain; RoIPool; RPN anchors.
- YOLO∆two-stage-vs-one-stage, ∆yolo-architecture, ∆yolo-loss, ∆bite-yolo-sqrt-wh-justification — grid prediction; the five-term loss; $\sqrt{w},\sqrt h$.
- DETR∆detr-architecture, ∆bite-detr-citation — object queries + bipartite matching, NMS-free.
Source: Lecture 10, R-CNN / Fast R-CNN / Faster R-CNN / YOLO / DETR slides.
Segmentation
Semantic segmentation, FCN and upsampling
Classification/detection/segmentation are the same spatial-labelling task at image/region/pixel scale. Per-pixel prediction needs an encoder–decoder (downsample then upsample back to input resolution), since a classification backbone’s final map is too coarse. Upsampling primitives: nearest-neighbour, bed-of-nails, bilinear, max-unpooling (stored indices), and the learnable transposed convolution (grid artefacts fixed by a following conv).
- Granularity / FCN / sliding window∆classification-detection-segmentation-spectrum, ∆semantic-segmentation-definition, ∆segmentation-per-class-iou-evaluation, ∆sliding-window-segmentation, ∆sliding-window-segmentation-drawbacks, ∆fully-convolutional-network, ∆bite-segmentation-needs-upsampling-decoder — one task at three scales; why a decoder.
- Upsampling methods∆unpooling-nearest-neighbour, ∆unpooling-bed-of-nails, ∆unpooling-bilinear-interpolation, ∆max-unpooling, ∆transposed-convolution-example, ∆transposed-convolution-grid-artefacts, ∆transposed-convolution-learnable-weights, ∆bite-upsampling-methods-comparison — the five primitives.
Source: Lecture 11, Fully Convolutional Networks / Upsampling: Unpooling / Transposed Convolution slides.
U-Net, instance/panoptic segmentation and SAM
U-Net’s signature is concatenation skip connections (not residual addition) carrying low-level detail into the decoder. Instance segmentation = detect + per-region mask: Mask R-CNN adds a mask branch, with RoIAlign (bilinear, no quantisation) replacing RoIPool because half-pixel misalignment that washes out for box classification ruins masks. MaskFormer reframes segmentation as $N$ mask-query classification; SAM (Kirillov 2023, 1B masks) adds a promptable interface; keypoint/pose prediction reuses the mask head as heatmaps + differentiable soft-argmax.
- U-Net∆unet-architecture, ∆bite-unet-concatenation-vs-addition — concatenation skips; detail + semantics.
- Instance / Mask R-CNN / RoIAlign∆instance-segmentation-definition, ∆semantic-vs-instance-visualisation, ∆instance-segmentation-approach, ∆mask-r-cnn-architecture, ∆roialign-vs-roipool, ∆thing-definition, ∆stuff-definition, ∆panoptic-segmentation-definition, ∆bite-roialign-fixes-misalignment-for-masks, ∆bite-mask-rcnn-mask-head-binary-ce, ∆bite-segmentation-task-taxonomy — things vs stuff; RoIAlign fixes mask misalignment.
- MaskFormer / SAM∆maskformer-architecture, ∆sam-architecture, ∆sam-humans-in-the-loop, ∆bite-maskformer-citation, ∆bite-sam-citation-and-scale — mask-query classification; promptable + humans-in-the-loop.
- Keypoints / dense captioning∆keypoint-vs-instance-segmentation, ∆keypoint-masks-pose-estimation, ∆heatmap-to-keypoint-weighted-sum, ∆dense-captioning-task, ∆bite-differentiable-keypoint-regression, ∆bite-human-pose-17-keypoints — heatmap → soft-argmax; 17 COCO keypoints.
Source: Lecture 11, U-Net / Mask R-CNN / RoIAlign / MaskFormer / SAM / Keypoints slides.
Video, optical flow and tracking
Video processing: fusion, 3D conv, two-stream, ViViT
Video is too big (~11 GB/min HD) so train on short low-res clips, test on overlapping clips and average. Temporal fusion strategies (per-frame / late / early / slow via 3D conv) — single-frame is a surprisingly strong baseline due to object-bias of action recognition. Two-stream nets fuse RGB + stacked optical flow. ViViT (Arnab 2021) tokenises spatio-temporal patches; full attention is $O((THW)^2)$, so the three factorisation strategies (factorised encoder / self-attention / dot-product) split space and time, dropping cost to $O(T^2HW+T(HW)^2)$.
- Windowed processing / fusion∆video-training-clip-strategy, ∆per-frame-classification, ∆object-bias-of-action-recognition, ∆late-fusion-visualisation, ∆late-fusion-mechanisms, ∆late-fusion-motion-loss, ∆early-fusion-visualisation, ∆slow-fusion-with-3d-convs, ∆bite-video-data-volume, ∆bite-video-windowed-processing, ∆bite-fusion-strategies-taxonomy, ∆bite-video-fusion-results, ∆bite-3d-convolution-mechanics — clip strategy; single-frame baseline; 3D conv.
- Two-stream∆two-stream-optical-flow-architecture, ∆two-stream-input-shapes, ∆bite-two-stream-citation-and-results — RGB + flow streams; UCF-101 numbers.
- ViViT / factorised attention∆transformer-cnn-video-architecture, ∆spatio-temporal-tokens-pure-transformer, ∆vivit-factorised-attention-strategies, ∆temporal-attention, ∆spatial-attention, ∆factorised-attention-complexity, ∆tubelets-definition, ∆bite-vivit-citation, ∆bite-factorised-attention-cost-savings, ∆bite-tubelets-definition, ∆bite-autoad-multi-modal-video — three factorisations; $O(T^2HW+T(HW)^2)$.
Source: Lecture 12, Windowed Video Processing / Fusion Approaches / Two-Stream / ViViT slides.
Optical flow
Brightness constancy + first-order Taylor + divide-by-$\Delta t$ gives the motion-constraint equation $\nabla I^\top\pmb\mu=-\partial _ t I$ — one equation, two unknowns (the aperture problem), resolved by a smoothness regulariser (Horn–Schunck 1981) or a local patch (Lucas–Kanade). Modern learned flow (FlowNet 2015, RAFT 2020: correlation volume + GRU iterative refinement) is trained on synthetic data because real ground truth is impossible — sim2real works because flow is low-level.
- Motion constraint / derivation / aperture∆optical-flow-definition, ∆motion-constraint-equation, ∆motion-constraint-derivation, ∆horn-schunck-objective, ∆optical-flow-smoothness-effect, ∆intensity-based-optical-flow-problems, ∆bite-brightness-constancy-assumption, ∆bite-aperture-problem, ∆bite-motion-constraint-derivation-strategy, ∆bite-horn-schunck-uses-gradient, ∆bite-horn-schunck-citation — Taylor + brightness constancy; aperture problem; smoothness term.
- Learned flow∆bite-flownet-citation, ∆bite-raft-citation, ∆bite-optical-flow-synthetic-datasets, ∆bite-optical-flow-vs-point-tracking-vs-stereo — FlowNet/RAFT; synthetic ground truth; flow vs point-tracking vs stereo.
Source: Lecture 13, Optical Flow — The Beginnings / Regularisation slides; Lecture 1 history slides.
Template tracking and Lucas–Kanade
Naive template matching is too slow ($O(\#\text{image}\times\#\text{template})$); LK (1981) reformulates it as iterative Gauss–Newton: linearise the warped residual, solve the normal equations $\pmb M\Delta\pmb p=\pmb b$, update $\pmb p\leftarrow\pmb p+\Delta\pmb p$. It works for any differentiable warp $W(\pmb x,\pmb p)$ (translation needs a $2\times2$ system = the structure tensor — invertible only if there is 2D gradient content, the aperture problem again). Drift (template picks up background) is mitigated by a fixed template / segmentation mask / re-detection (TLD); modern point trackers (PIPs/RAFT/CoTracker) replace pixel with feature matching.
- Template energy / LK derivation∆template-tracking-energy, ∆template-tracking-slow, ∆lk-energy-reformulation, ∆lk-general-problem-setup, ∆lk-energy-taylor-expansion, ∆lk-update-rule-statement, ∆lk-update-rule-derivation, ∆bite-lk-historical-attribution, ∆bite-lk-vs-exhaustive-search, ∆bite-lk-update-derivation-strategy, ∆bite-lk-translation-2x2-system — Gauss–Newton; $\pmb M\Delta\pmb p=\pmb b$; structure tensor.
- General warps / drift∆tracker-drift-problem, ∆global-group-tracking-fix, ∆bite-lk-any-differentiable-warp, ∆bite-lk-jacobian-translation-scaling-example, ∆bite-lk-inverse-compositional, ∆bite-lk-drift-and-mitigations — Jacobian per warp; inverse-compositional speed-up; drift fixes.
- Modern / 2D-vs-3D tracking∆bite-2d-vs-3d-tracking, ∆bite-tracking-template-options, ∆bite-modern-point-trackers, ∆bite-cotracker-family, ∆bite-siamfc-citation, ∆bite-tld-framework — SiamFC/PIPs/CoTracker; tracking-by-detection.
Source: Lecture 13, Recap: Template Tracking / Generalised LK Tracking / LK Tracker Insights slides; Problem Sheet I.
Multiple-view geometry
Camera models and calibration
A pinhole projects $(x,y,z)\mapsto(fx/z,fy/z)$ — perspective, with orthographic / weak-perspective the $f\to\infty$ limit. The full projection is $\pmb x\cong \mathbf K[\mathbf R\mid\pmb t]\mathbf X=\mathbf P\mathbf X$ ($\mathbf K$ intrinsic: focal/principal-point/pixel-scaling; $[\mathbf R\mid\pmb t]$ extrinsic). $\mathbf P$ has 11 DoF, so 6 correspondences give a linear solution via the same stack-and-SVD null-space trick (no factored $\mathbf K[\mathbf R\mid\pmb t]$); non-linear calibration minimises reprojection error and can model radial distortion.
- Pinhole / orthographic / single-view ambiguity∆single-view-ambiguity, ∆camera-coordinate-system, ∆pinhole-projection-formula, ∆orthographic-as-perspective-special-case, ∆orthographic-projection-matrix, ∆normalised-coordinate-system, ∆bite-pinhole-image-inversion, ∆bite-vanishing-points, ∆bite-sphere-projects-to-ellipse, ∆bite-weak-perspective-camera — perspective vs orthographic; vanishing points.
- Calibration / projection matrix∆camera-calibration-decomposition, ∆principal-point-definition, ∆camera-projection-matrix-entries, ∆linear-calibration-derivation, ∆linear-calibration-limitation, ∆nonlinear-vs-linear-calibration, ∆bite-projection-matrix-dof, ∆bite-linear-calibration-minimum-correspondences, ∆bite-linear-calibration-svd-solution, ∆bite-linear-calibration-strategy, ∆bite-reprojection-error-formula, ∆bite-nonlinear-calibration-extras, ∆bite-pixel-scaling-factors-units — $\mathbf P=\mathbf K[\mathbf R\mid\pmb t]$; 11 DoF, 6 correspondences; reprojection error.
Source: Lecture 14, Perspective Projection / Camera Calibration slides; Problem Sheet IV Q2-Q3.
Epipolar geometry: essential and fundamental matrices
Coplanarity of $(\pmb x',R\pmb x,\pmb t)$ gives $\pmb x'^\top E\pmb x=0$ with $E=[\pmb t] _ \times R$ (calibrated, rank 2, 5 DoF); dropping known $\mathbf K$ gives $\pmb x'^\top F\pmb x=0$ with $F=\mathbf K'^{-\top}E\mathbf K^{-1}$ (rank 2, 7 DoF). $E\pmb x$ is the epipolar line, $Ee=0$ the epipole. The constraint is necessary not sufficient (reduces a 2D search to the 1D epipolar line). The eight-point algorithm estimates $F$ by — once more — stack-and-SVD, then rank-2 enforcement by truncating the smallest singular value (Eckart–Young).
- Setup / epipoles / lines∆mvg-problem-definition, ∆epipolar-baseline-definition, ∆epipoles-and-special-case-parallel, ∆epipolar-plane-and-lines, ∆matching-point-on-epipolar-line, ∆epipolar-lines-visualisation, ∆epipolar-constraint-statement, ∆bite-epipolar-geometry-three-cases, ∆bite-epipolar-constraint-necessary-not-sufficient — baseline/epipole/epipolar line; three special cases.
- Essential / fundamental matrix∆two-camera-projection-matrices, ∆normalised-image-coordinates-conversion, ∆essential-matrix-four-derivations, ∆essential-matrix-epipolar-lines, ∆essential-matrix-epipoles, ∆essential-matrix-rank-dof, ∆fundamental-matrix-from-uncalibrated, ∆fundamental-matrix-epipolar-lines-and-epipoles, ∆fundamental-matrix-rank-dof, ∆bite-essential-matrix-historical-attribution, ∆bite-essential-matrix-rank2-justification, ∆bite-fundamental-from-essential — $E=[\pmb t] _ \times R$, rank 2 / 5 DoF; $F$ rank 2 / 7 DoF.
- Eight-point algorithm∆eight-point-algorithm-derivation, ∆bite-eight-point-minimum-correspondences, ∆bite-eight-point-rank2-enforcement — stack-and-SVD + rank-2 truncation.
Source: Lecture 15, Epipolar Geometry / The Essential Matrix / The Fundamental Matrix / Eight Point Algorithm slides; Problem Sheet IV Q1.
Neural rendering: rendering equation, BRDF and NeRF
The rendering equation $L _ o=L _ e+\int _ \Omega f _ r\,L _ i(\omega _ i\cdot\pmb n)\,d\omega _ i$ with the physically-plausible BRDF (positivity / reciprocity / energy conservation). NeRF (Mildenhall 2020) is an MLP $F _ \Omega:(x,y,z,\theta,\phi)\to(r,g,b,\sigma)$ trained by analysis-by-synthesis: differentiable volume rendering $C=\sum T _ i\alpha _ i c _ i$ ($\alpha _ i=1-e^{-\sigma _ i\Delta t}$, $T _ i=\prod _ {j<i}(1-\alpha _ j)$) against multi-view images. 3D Gaussian Splatting is the real-time alternative; a 2D-diffusion prior + textual inversion gives single-image 3D (with the Janus front-face problem).
- Rendering equation / BRDF∆rendering-equation-statement, ∆brdf-definition-and-properties, ∆bite-brdf-three-properties-physical-justification, ∆bite-photosculpture-1850 — $L _ o=L _ e+L _ r$; the three BRDF properties.
- NeRF / volume rendering∆nerf-problem-setup, ∆nerf-loss-via-volume-rendering, ∆volume-rendering-colour-formula, ∆bite-nerf-citation-and-input-dim, ∆bite-nerf-three-components, ∆bite-volume-rendering-opacity-formula, ∆bite-transmittance-product-formula, ∆bite-nerf-training-ray-pipeline, ∆bite-3d-gaussian-splatting — 5D MLP; $C=\sum T _ i\alpha _ i c _ i$; Gaussian splatting.
- Diffusion-prior 3D∆textual-inversion, ∆diffusion-as-prior-for-3d, ∆janus-problem — learn $\langle e\rangle$; SDS-style 3D; the Janus bias.
Source: Lecture 15, The Rendering Equation / Neural Radiance Fields / Diffusion as a Prior slides; Lecture 1 history slides.
Generative models
Taxonomy, autoregressive models and VQ-VAE
Generative = learn $p(\pmb x)$ to sample from; the taxonomy splits explicit (tractable: autoregressive; approximate: VAE) vs implicit (direct: GAN, diffusion). Autoregressive factorises $p(\pmb x)=\prod p(x _ i\mid x _ {<i})$ with masked convolutions (PixelCNN mask A/B) to prevent peeking; VQ-VAE quantises to a discrete codebook (straight-through gradient), enabling DALL-E’s token-space autoregression.
- Setup / taxonomy∆generative-model-setup, ∆discriminative-vs-generative-vs-conditional, ∆bayes-rule-for-generative-models, ∆generative-explicit-vs-implicit, ∆generative-models-taxonomy, ∆bite-gan-taxonomy-entry, ∆bite-implicit-vs-explicit-evaluation — explicit vs implicit; why implicit is hard to evaluate.
- Autoregressive / VQ-VAE / DALL-E∆autoregressive-image-generation-idea, ∆autoregressive-image-factorisation, ∆autoregressive-architecture-visualisation, ∆autoregressive-mask-need, ∆autoregressive-cnn-masks, ∆vq-vae-architecture, ∆dalle-token-space-autoregressive, ∆flow-based-generative-model, ∆bite-vae-vs-vq-vae, ∆bite-dalle-token-space-and-scale — $\prod p(x _ i\mid x _ {<i})$; masking; discrete codebook.
Source: Lecture 16, Taxonomy / Autoregressive / VQ-VAE / DALL-E slides; cf. Notes - Machine Learning MT23, Generative modelsU.
Diffusion models and Stable Diffusion
The forward process is a fixed Gaussian noising chain; only the reverse denoiser is learned ($\ \vert f(x _ t,t)-\epsilon\ \vert ^2$). The closed form $x _ t=\sqrt{\bar\alpha _ t}x _ 0+\sqrt{1-\bar\alpha _ t}\,\epsilon$ with $\bar\alpha _ t=\prod\alpha _ i$ enables one-step noising; the DDPM sampling step iterates $T\approx1000$ times (predicting noise beats predicting $x _ 0$). Latent diffusion (Stable Diffusion) runs the chain in a VQ-VAE latent (~$8\times$ compression), conditions every step via cross-attention to the prompt, and injects $t$ via a Fourier-basis embedding. Evaluation uses FID = Fréchet distance between Inception-feature Gaussians.
- Diffusion mechanics∆diffusion-model-overview, ∆diffusion-forward-reverse-visualisation, ∆diffusion-model-aim, ∆diffusion-noise-schedule-q, ∆diffusion-joint-markov-product, ∆x-in-terms-of-noise, ∆ddpm-sampling-step, ∆diffusion-noise-vs-x0-objectives, ∆bite-diffusion-fixed-forward-learned-reverse, ∆bite-diffusion-cumulative-alpha, ∆bite-ddpm-typical-step-count — fixed forward / learned reverse; closed form; DDPM step; noise vs $x _ 0$.
- Latent / Stable Diffusion∆latent-diffusion-idea, ∆stable-diffusion-latent-diffusion, ∆stable-diffusion-overview, ∆stable-diffusion-cross-attention, ∆stable-diffusion-time-dependency, ∆bite-latent-diffusion-compression-factor, ∆bite-stable-diffusion-training-scale, ∆bite-silu-activation — VQ-VAE latent + cross-attention + Fourier time embedding.
- Evaluation (FID)∆fid-formula, ∆bite-frechet-distance-dog-and-owner, ∆bite-fid-uses-inception-features — Wasserstein-2 between Inception-feature Gaussians.
Source: Lecture 16, Diffusion Models / Latent Diffusion / Stable Diffusion / FID slides.
Representation and unsupervised learning
Representation-learning losses
Learn a general-purpose embedding, not a task output. Cosine-similarity loss forces absolute targets (collapses without negatives); triplet loss enforces the weaker relative margin $\mathcal S(\phi,\phi^+)>\mathcal S(\phi,\phi^-)+\epsilon$; the contrastive (InfoNCE) loss is exactly softmax cross-entropy with similarities as logits and the other batch elements as implicit negatives. CLIP (400M image–text pairs) trains a dual encoder with symmetric contrastive loss; SigLIP swaps the softmax for a per-pair sigmoid for multi-GPU scaling.
- Loss zoo∆representation-vs-task-learning, ∆representation-supervised-vs-unsupervised, ∆cosine-similarity-loss-formula, ∆cosine-similarity-loss-problem, ∆triplet-loss-definition, ∆triplet-loss-speedup-batch, ∆contrastive-loss, ∆contrastive-loss-as-cross-entropy, ∆supervised-rep-learning-data, ∆bite-cosine-similarity-formula, ∆bite-negatives-prevent-collapse, ∆bite-triplet-vs-cosine-similarity-flexibility, ∆bite-contrastive-loss-batch-implicit-negatives, ∆bite-infonce-name-clarification — cosine / triplet / InfoNCE; negatives prevent collapse; InfoNCE = softmax-CE.
- Learned descriptors / CLIP / SigLIP∆handcrafted-representation-limitations, ∆sift-as-handcrafted-example, ∆learn-keypoint-descriptor-recipe, ∆patch-matching-three-architectures, ∆sift-vs-lift, ∆learned-vs-handcrafted-descriptors, ∆clip-loss-visualisation, ∆clip-pseudocode, ∆siglip-pseudocode, ∆siglip-vs-clip, ∆ranking-loss-definition, ∆representation-learning-for-retrieval, ∆bite-clip-citation-and-scale, ∆bite-clip-apple-ipod-adversarial, ∆bite-siglip-sigmoid-loss, ∆bite-multi-modal-recipe, ∆bite-selavi-audio-visual, ∆bite-representation-learning-retrieval-rationale — LIFT vs SIFT; CLIP/SigLIP pseudocode; retrieval rationale.
Source: Lecture 17, Representation Learning Losses / Contrastive Loss / CLIP / SigLIP slides.
Unsupervised learning: pretext tasks, SimCLR, DINO
The seven learning-signal techniques (recovery / bottleneck / dataset / invariance / equivariance / transformation-estimation / generative) organise self-supervision. Pretext tasks: rotation prediction (Gidaris 2018, with the global-average-pool collapse trap), jigsaw, context prediction, inpainting, colourisation. SimCLR maximises agreement between two augmented views via NT-Xent with a non-linear projection head (absorbs the invariance pressure so $h$ stays rich). DINO is self-distillation with a teacher EMA, preventing collapse by centring + sharpening. Unsupervised classification needs Hungarian matching (cluster→label, $O(N^3)$); SCAN and self-labelling-by-clustering (optimal transport, equal-size constraint) are the named methods.
- Setup / seven signals / pretext tasks∆unsupervised-learning-setup, ∆seven-learning-signal-techniques, ∆image-corruption-recovery-tasks, ∆bottleneck-restrictions-options, ∆context-prediction-task, ∆rotation-prediction-task, ∆weak-supervision-definition, ∆bite-rotation-prediction-citation, ∆bite-rotation-prediction-gap-collapse, ∆bite-classical-pretext-tasks, ∆bite-invariance-vs-equivariance, ∆bite-unsupervised-vs-self-supervised — the seven techniques; rotation/jigsaw/context/inpaint/colourise.
- SimCLR / DINO∆simclr-framework, ∆dino-architecture, ∆dino-architecture-why-random-transformation-or-crop, ∆dino-centring-and-sharpening, ∆dino-features-usefulness, ∆bite-simclr-citation-and-best-augmentation, ∆bite-simclr-projection-head-rationale, ∆bite-dino-centring-sharpening-roles — NT-Xent + projection head; teacher-EMA + centring/sharpening.
- Unsupervised classification∆hungarian-matching, ∆self-labelling-by-clustering, ∆bite-hungarian-matching-complexity, ∆bite-scan-three-step, ∆bite-self-labelling-citation-and-balance — $\min _ P\mathrm{Tr}(PC)$; SCAN; self-labelling optimal transport.
Source: Lecture 18, Learning Signals / Transformation Estimation / SimCLR / DINO / SCAN / Self-labelling slides.
Vision and language
Tokenisation, LMs and alignment
Sub-word BPE tokenisation = the optimisation “assign strings to a budget of $N$ tokens minimising total token count”. Alignment: RLHF three steps (SFT → reward model from rankings → PPO with a KL penalty) and the simpler DPO (a single supervised loss with its $\sigma(\hat r _ l-\hat r _ w)$-weighted gradient, no reward model, no RL). Both vulnerable to jailbreaks (grandma role-play; visual adversarial perturbations).
- Tokenisation / RLHF / DPO / jailbreaks∆tokenisation-optimisation-task, ∆rlhf-reward-model-architecture, ∆bite-bpe-tokenisation, ∆bite-rlhf-three-step, ∆bite-dpo-loss-and-gradient, ∆bite-jailbreaks-examples, ∆bite-visual-adversarial-jailbreak — BPE objective; RLHF 3-step; DPO loss/gradient; jailbreaks.
Source: Lecture 19, Words to Tokens / ChatGPT Training / DPO / Jailbreaks slides.
Grounding, VQA and vision-language models
Referring expressions / VQA / visual grounding tasks. Pre-transformer MAttNet decomposes into subject/location/relationship modules; MDETR is image–text transformer + bipartite matching; UNITER pretrains on MLM + MRM + WRA/ITM; GLIP does phrase grounding via word–region alignment. CLIP’s three-step zero-shot (contrastive pretrain → “A photo of a {class}” text classifier → similarity argmax) matches supervised ResNet-50 on ImageNet without ImageNet. Flamingo = frozen vision encoder + Perceiver Resampler + GATED XATTN-DENSE into a frozen LM; Tsimpoukelli’s frozen-LM few-shot maps images into the LM’s text-embedding space.
- Tasks∆referring-expressions-task, ∆visual-question-answering-task, ∆visual-grounding-definition — the three V+L tasks.
- V+L models∆bite-mattnet-three-modules, ∆bite-mdetr-architecture, ∆bite-uniter-three-objectives, ∆bite-glip-phrase-grounding, ∆bite-clip-three-step-zero-shot, ∆bite-flamingo-architecture, ∆bite-frozen-lm-few-shot — MAttNet/MDETR/UNITER/GLIP; CLIP zero-shot; Flamingo / frozen-LM.
Source: Lecture 19, Referring Expressions / VQA / UNITER / CLIP / Flamingo slides.
Ethics, bias and privacy
Formalising fairness
Two fairness criteria: independence ($R\perp A$) and separation ($R\perp A\mid Y$, = error-rate parity). The impossibility result: for binary $Y$ with $A,R$ both dependent on $Y$, independence and separation cannot both hold — “there is no single mathematical definition of fairness, and you cannot satisfy all of them” (Narayanan’s 21 definitions). “No fairness through unawareness”: dropping a sensitive feature fails because proxies correlate.
- Fairness criteria / impossibility∆no-fairness-through-unawareness, ∆fairness-independence-definition, ∆fairness-separation-definition, ∆separation-implies-error-rate-parity, ∆independence-separation-incompatibility, ∆bite-error-rate-fpr-fnr-formulas, ∆bite-fpr-fnr-stakeholder-tradeoff, ∆bite-many-fairness-definitions-and-politics — independence vs separation; the two-criterion impossibility.
Source: Lecture 20, Formalizing Fairness slides.
Harms, case studies and dataset accountability
Allocative (unfair resource allocation) vs representational harms (denigration / stereotype / recognition / under-representation / ex-nomination). Case studies with the actual numbers: COMPAS (Black FPR ~44.9% vs White ~23.5%, even though race is not an input), Gender Shades (dark-female error ~20-35% vs light-male ~0-1%), bias amplification (Zhao 2017, CNN predictions more skewed than training data). Accountability: Datasheets for Datasets, Model Cards (the CLIP “any deployed use is out of scope” example), CelebA subjective-attribute critique, PASS (1.4M human-free images), red-circle visual prompt engineering.
- Harm taxonomy∆allocative-harms-definition, ∆representational-harms-types, ∆representational-harms-classification-example, ∆gender-bias-definition, ∆bias-amplification-problem, ∆bite-zhao-2017-bias-amplification — allocative vs the five representational harms; bias amplification.
- Case studies∆bite-compas-fpr-fnr-disparity, ∆bite-gender-shades-disparity — COMPAS; Gender Shades intersectionality.
- Accountability∆bite-datasheets-and-model-cards, ∆bite-clip-model-card-out-of-scope, ∆bite-celeba-attribute-critique, ∆bite-pass-dataset-no-humans, ∆bite-laion-5b-csam-incident, ∆bite-red-circle-visual-prompt-engineering — datasheets/model cards; CelebA critique; PASS; LAION-5B; red circles.
Source: Lecture 20, Allocative/Representational Harms / COMPAS / Gender Shades / Datasheets / Model Cards slides.
Methods
Classical image pipeline
Image as $f(x,y)$ → pointwise/geometric/filter operations; geometric transforms by backward warp + interpolation; large convolutions via the FFT; restoration by Wiener filter or $\min\ \vert g-Af\ \vert ^2+\lambda p(f)$.
- Filter / warp / restore∆bite-three-image-transformation-categories, ∆bite-forward-vs-backward-warp, ∆convolution-theorem-fft-speedup, ∆wiener-filter-definition, ∆generative-degradation-recovery — the classical toolbox.
Source: Lectures 2-4.
Feature matching and retrieval
SIFT (LoG detect → 128-D describe → NN match) + RANSAC verification; bag-of-visual-words ($k$-means vocabulary → sparse histogram → cosine retrieval).
- SIFT + BoW∆sift-three-stages, ∆bite-geometric-verification-ransac, ∆bite-bag-of-visual-words-retrieval-pipeline — detect/describe/match + RANSAC + BoW retrieval.
Source: Lecture 5.
Deep recognition pipeline
Embed-then-classify fused end-to-end: CNN/ViT backbone + task head, trained with SGD-momentum/Adam, cross-entropy / smooth-$L _ 1$ / multi-task losses, BatchNorm/LayerNorm + residual connections + dropout, transfer learning from a large pretrained backbone.
- Train a deep net∆bite-two-step-image-classification, ∆adam-update-rule, ∆bite-transfer-learning-recipe, ∆bite-multi-task-loss-structure — the modern recognition recipe.
Source: Lectures 6-8.
Detection and segmentation pipelines
Two-stage (Faster R-CNN: backbone + RPN + RoIAlign + heads) or one-stage (YOLO grid + five-term loss); DETR set prediction; segmentation by encoder–decoder / U-Net / Mask R-CNN / MaskFormer / SAM; evaluate with AP/mAP and per-class IoU.
- Detect / segment∆r-cnn-family, ∆yolo-loss, ∆detr-architecture, ∆unet-architecture, ∆mask-r-cnn-architecture, ∆ap-triple-average-definition — the detector/segmenter families + AP.
Source: Lectures 10-11.
Motion and geometry
Optical flow (motion constraint + smoothness, or RAFT) and Lucas–Kanade Gauss–Newton tracking; camera calibration / fundamental-matrix estimation by the universal stack-and-SVD-then-rank-2 recipe; NeRF analysis-by-synthesis via differentiable volume rendering.
- Flow / track / reconstruct∆motion-constraint-equation, ∆lk-update-rule-statement, ∆eight-point-algorithm-derivation, ∆linear-calibration-derivation, ∆bite-nerf-three-components — the geometry/motion algorithms.
Source: Lectures 13-15.
Generation and self-supervision
Diffusion sampling (fixed forward, learned reverse, DDPM step, latent + cross-attention); contrastive/InfoNCE representation learning; SimCLR/DINO self-supervision; CLIP zero-shot transfer.
- Generate / self-supervise∆ddpm-sampling-step, ∆stable-diffusion-overview, ∆contrastive-loss-as-cross-entropy, ∆simclr-framework, ∆bite-clip-three-step-zero-shot — diffusion + contrastive + zero-shot.
Source: Lectures 16-19.