Computer Vision MT25, Overview of results and methods


This is a map of every main result and method in the course. Computer Vision MT25 has two halves welded together by one idea. The first half is classical: an image is a sampled 2D function, and everything (filtering, restoration, features, geometry) is calculus, linear algebra and Fourier analysis on that function. The second half is learned: the hand-crafted pipeline (SIFT descriptor, HOG+SVM detector, intensity-based flow) is replaced end-to-end by a CNN or transformer, but the tasks (classification, detection, segmentation, correspondence, generation) and the evaluation machinery (IoU, AP, FID) are unchanged. The recurring slogan is that old ideas keep coming back: backward-warp interpolation reappears in RoIAlign, the SVD null-space trick solves homographies / the fundamental matrix / camera calibration identically, brightness-constancy powers both Lucas–Kanade and optical flow, and the contrastive/InfoNCE loss is just softmax cross-entropy with similarities as logits.

Images as functions

Image representation

An image is two things at once: a discrete array (H, W, C) indexed matrix-major ([0,0] top-left, first axis = row = $y$), and samples from a continuous 2D function $f(x, y)$. The array is what we implement; the function is what we reason about — it is what lets us write filtering, geometric transformations, derivatives and Fourier analysis as calculus, and it is the representation Lectures 2-3 are built on.

Source: Lecture 2, Images as Pixels / Images as Functions slides.

Sampling, aliasing and the Nyquist–Shannon theorem

Sampling records $f$ at discrete points; reconstruction guesses between them. Undersampling causes aliasing — high-frequency content “travels in disguise” as a lower frequency (the rotating-wheel / moiré / disintegrating-checkerboard phenomenon). The Nyquist–Shannon theorem is the precise cutoff: sample at $\ge 2 f _ {\max}$ for perfect reconstruction. The fix is to low-pass filter before subsampling, which is the single justification behind blur-then-downsample and the entire anti-aliasing story.

Source: Lecture 2, Nyquist–Shannon Sampling Theorem / Aliasing in the Wild slides.

Subsampling and interpolation

Naive subsampling (drop every non-multiple-of-$2^n$ pixel) aliases in high-frequency regions; blur first. Upsampling is interpolation: the bilinear weight identity (midpoint = average of two corners, centre = $(A+B+C+D)/4$, well-defined either order) generalises to the area-weighted formula $f(x,y)=\sum w _ {ij}(\text{corner})$. The choice between nearest-neighbour (preserves discrete labels), bilinear (no out-of-range overshoot, cheap), and bicubic ($C^1$, smooth) is task-driven and reappears verbatim in segmentation upsampling and RoIAlign.

Source: Lecture 2, Subsampling / Bilinear Interpolation / Useful Interpolation Properties slides; Problem Sheet I Q2.

Image transformations: the three-category taxonomy

Every image operation is one of three: pointwise $g=t(f)$ (negation, contrast $af+b$, gamma) — acts on the range; geometric $g(x,y)=f(T(x,y))$ (translation/scale/rotation/shear/homography) — acts on the domain; filtering $g=F(N(x,y))$ — acts on a neighbourhood. This taxonomy organises Lectures 2-3 and is a recurring problem-sheet question. Geometric transforms are applied by a backward warp (iterate over target pixels, look up the source via $T^{-1}$, interpolate) to avoid the holes/pile-ups of forward warps.

Source: Lecture 2, Image Transformations / Forward / Backward Warps slides; Lecture 5, Optical Axis slide.

Affine transforms and homogeneous coordinates

Embedding $(x,y)\mapsto(x,y,1)$ lets translation, scaling, rotation and shear all become a single $3\times3$ matrix multiply, so transforms compose by matrix product (non-commutative — order matters). Affine maps (6 DoF) preserve collinearity, parallelism and convexity; a homography (8 DoF, defined up to scale) drops the bottom row’s $(0,0,1)$, gaining a perspective term that maps points-at-infinity to finite vanishing points so parallelism is not preserved. Point–line duality via the cross product ($\ell=\mathbf p _ 1\times\mathbf p _ 2$, intersection $=\ell _ 1\times\ell _ 2$) and the homogeneous-system SVD trick recur in SIFT, MVG and camera calibration.

Source: Lecture 2, Affine Transformation Examples / Combining Transformations slides; Lecture 5, Homography slides; Problem Sheet II Q1.

Convolution

Discrete convolution $(f\ast g)[x]=\sum _ u f[u]\,g[x-u]$ is the implementation of every filter; with finite support a centring shift $\tfrac{M-1}{2}$ aligns the kernel on the output pixel — which forces odd kernel sizes ($3,5,7$) and symmetric padding, the convention CNNs inherit. It is commutative (hence the right-to-left flip), costs $O(N^2 M^2)$ in 2D naively, and is reduced to $O(N^2\log N)$ by the FFT (next section).

Source: Lecture 2, Discrete Convolution (with Finite Support) slides; Problem Sheet I Q3.

Filtering: blur, Gaussian, bilateral, median

A filter replaces a pixel by a function of its neighbourhood. The box (mean) blur is energy-preserving but blocky (sinc spectrum, ringing); the Gaussian blur weights by distance, is separable ($O(k)$ not $O(k^2)$), and has a sidelobe-free Gaussian spectrum — the reason it is the default low-pass. The bilateral filter multiplies the spatial Gaussian by an intensity Gaussian to get edge-preserving smoothing; the median filter is non-linear (not a convolution) and the standard salt-and-pepper / outlier remover.

Source: Lecture 2, Averaging / Gaussian Blur / Bilateral Filter / Median Filtering slides.

Frequency domain

The Fourier transform (1D, DFT, 2D)

$F(u)=\langle f,\psi _ u\rangle$ writes any signal as a weighted combination of the orthonormal complex-exponential basis $\psi _ u(t)=e^{i2\pi ut}$; $ \vert F(u) \vert $ is the amplitude, $\arg F(u)$ the phase. The DFT is the matrix multiply $F=Uf$ (periodic output); the 2D basis $e^{i2\pi(ux+vy)}$ gives the magnitude plots where centre = low frequency, radius = high frequency, and a vertical image edge appears as a horizontal spectral streak. Translation leaves $ \vert F \vert $ unchanged (only phase), rotation rotates it, scaling inverse-scales it.

Source: Lecture 3, The Fourier Transform / DFT / 2D Basis Functions / Interpreting DFTs slides.

Convolution theorem and frequency-domain filtering

$\mathcal F\{f\ast g\}=\mathcal F\{f\}\,\mathcal F\{g\}$: spatial convolution = pointwise frequency multiplication, so a large-kernel convolution drops from $O(N^2)$ to $O(N\log N)$ via the FFT (CUDA picks the cheaper route automatically). It explains box-vs-Gaussian ringing (sinc sidelobes vs clean Gaussian envelope), enables low/high-pass filtering by masking the spectrum, periodic-noise removal by notching out spectral peaks (the 1966 lunar-orbiter scan-lines), and the hybrid-image construction.

Source: Lecture 3, Convolution Theorem / Removing Periodic Patterns / Hybrid Images slides.

Image restoration

Degradation model and naive deconvolution

Restoration (unlike enhancement) models the degradation: $g = d\ast f + n$ with $d$ the point-spread function. Naive Fourier-domain deconvolution $\hat F = G/D$ catastrophically amplifies noise because $D$ is tiny at high frequencies — the canonical “what went wrong” failure (the deblurred image is pure high-frequency grid noise).

Source: Lecture 4, Modelling Degradation / What went wrong? slides.

The Wiener filter

The Wiener filter $W=\dfrac{D^\ast}{ \vert D \vert ^2 S + K}$ is the MMSE linear estimator $\min _ W\mathbb E\ \vert f-\hat f\ \vert ^2$; in its inverse-filter-with-shrinkage form it acts like $1/D$ where SNR is high and shrinks toward zero where noise dominates — exactly de-emphasising the high frequencies that wreck naive deconvolution (natural images’ $1/f^2$ spectrum makes this an automatic win). Motion blur is a line-segment PSF removed by rotate-to-horizontal → estimate length → build PSF → Wiener.

Source: Lecture 4, The Wiener Filter / Motion Blur and WFs slides; Problem Sheet I Q4.

Generative (inverse-problem) degradation model

Generalising to $g = Af + n$, restoration becomes $\hat f=\arg\min _ f\ \vert g-Af\ \vert ^2+\lambda p(f)$ with a prior $p$ — the TV ($L _ 1$ gradient) prior preserves edges where Tikhonov ($L _ 2$) would blur them. The same template gives super-resolution (multiple sub-pixel-shifted images densify the sampling — a Nyquist argument, the Mars-lander example) and blind deblurring (joint $\min _ {f,h}$).

Source: Lecture 4, Inverse Problem / Super Resolution / Blind Deblurring slides.

Features, correspondences and SIFT

The correspondence problem

“The three most important problems in computer vision: correspondence, correspondence, correspondence” (Kanade). Direct image subtraction fails under viewpoint change; matching needs local descriptors robust to small geometric/photometric changes, then geometric verification (RANSAC) to keep only matches consistent with one global transform. Optical flow, stereo, tracking, retrieval and multi-view 3D are all correspondence problems sharing this machinery.

Source: Lecture 5, Finding Correspondences / Feature Matching slides.

SIFT: detection, description, matching

Three stages: (1) detect keypoints as scale-space extrema of the Laplacian-of-Gaussian (a blob detector) over $(x,y,\sigma)$; (2) describe each by rotating to its dominant orientation then concatenating $4\times4\times8=128$ edge-orientation histograms — making it rotation/scale/translation/illumination-robust; (3) match by nearest-descriptor. A good keypoint is a blob/corner (the Christ Church A/B/C/D worked example); SIFT (Lowe 2004, ~77k cites) is the archetypal hand-crafted representation that LIFT later learns.

Source: Lecture 5, Scale Invariant Feature Transform / Keypoints / SIFT Descriptor slides.

Visual words and bag-of-words retrieval

$k$-means the SIFT descriptors of a corpus into $K$ “visual words”; each image becomes a sparse $K$-histogram. Brute-force feature matching of $10^{10}$ images is $\sim10^{18}$ ops (infeasible); BoW reduces image comparison to a sparse $O(K)$ dot product, making web-scale retrieval (precompute vocabulary + histograms, query → histogram → cosine top-$M$ → optional geometric re-rank) feasible.

Source: Lecture 5, K-means Clustering / Visual Words / Image Retrieval slides; cf. Notes - Machine Learning MT23, k-means clusteringU.

Classification and classical ML

Image classification: embeddings, $k$-NN, the two-step pipeline

Classical recognition is two steps: compute an embedding $\phi:\mathbb R^{H\times W\times3}\to\mathbb R^d$ (FFT / BoW / HOG) then learn a classifier (SVM / kernel SVM / random forest) on it — deep learning fuses and learns both steps, the 2012 AlexNet ImageNet jump. $k$-NN is the strong no-training baseline (naturally multi-class, single hyperparameter $k$, improves with data); accuracy is misleading under class imbalance, motivating precision/recall.

Source: Lecture 6, Image Classification in 2 Steps / Nearest Neighbour / Accuracy slides.

Linear SVM, softmax and cross-entropy

The maximum-margin SVM places the boundary as far from the data as possible; the hinge loss $\max(0,1-y f)$ is a convex 0-1 surrogate whose sparse-support-vector property and $L _ 2$ regulariser give the standard formulation. Softmax (with temperature $\tau$: $\tau\to0$ argmax, $\tau\to\infty$ uniform) turns scores into a calibrated distribution; cross-entropy collapses to $-f _ {GT}+\log\sum e^{f _ j}$ for one-hot targets — the loss-design backbone of every classifier in the course.

Source: Lecture 6, Maximum Margin / Soft-max / Cross-Entropy Loss slides; Problem Sheet II Q2-Q3; cf. Notes - Machine Learning MT23, Cross-entropy lossU.

Precision, recall and average precision

Precision $=\mathrm{TP}/(\mathrm{TP}+\mathrm{FP})$, recall $=\mathrm{TP}/(\mathrm{TP}+\mathrm{FN})$ trade off as the confidence threshold sweeps, tracing the PR curve; AP is the area under it, and mAP the triple average over recall levels, IoU thresholds and classes. AP ignores true negatives, so it survives the heavy class imbalance of detection where accuracy collapses — the evaluation engine for detection and segmentation.

Source: Lecture 6, Precision and Recall slide; Lecture 10, Average Precision slides.

Deep networks and training

Neural-network building blocks

Three primitives recur in every modern architecture: residual connections $y=f(x)+x$ (learn the change, gradient flows through the identity), normalisation (BatchNorm rescales by batch statistics — train/test asymmetry is a notorious bug source; LayerNorm/Instance/Group differ only in which axes they average, and transformers use LayerNorm for batch-independence), and transfer learning (pretrain on a large dataset, fine-tune with a lower LR / frozen backbone — justified by the universality of first-layer filters).

Source: Lecture 8, Residual Connections / Batch-Normalisation / Transfer Learning slides; cf. Notes - Machine Learning MT23, Neural networksU.

Optimisers and loss design

SGD-with-momentum $\Delta w _ t=\rho\Delta w _ {t-1}-\lambda g _ t$ accelerates along consistent gradients and damps oscillations; Adam adds per-coordinate scaling via bias-corrected first/second moments. Loss design: $L _ 1$ (robust, constant gradient) vs $L _ 2$ (outlier-sensitive); smooth-$L _ 1$ glues the $L _ 2$ centre to the $L _ 1$ tail (the default box-regression loss); multi-task losses sum task losses with a balancing $\lambda$ (the R-CNN classification + regression template).

Source: Lecture 7, Optimization Methods / Loss Function — Types slides; Lecture 10, Multi-task Loss slide.

Regularisation, learning curves and the bias–variance frame

Dropout zeros neurons with $p\approx0.5$ at train time (rescaled, off at test) preventing co-adaptation — an ensemble-of-subnetworks view. $L _ 1$/$L _ 2$ penalties and early stopping are the other regularisers. Learning curves are read in the bias–variance frame: the human↔train gap is bias (fix: bigger model / longer), train↔val is variance (fix: more data / augmentation / regularisation); a separate trainval set diagnoses distribution shift; characteristic LR-vs-loss shapes (explode / plateau / stuck / good) are the standard diagnostic.

Source: Lecture 7, Dropout / Regularization / Validation vs Training / Learning Rate Selection slides.

Convolutional networks

CNN architecture and the receptive field

CNNs build invariance to shift/scale/small distortion from three properties: local connections, weight sharing, spatial subsampling. The output-size formula $h _ \text{out}=\frac{h _ \text{in}-h+2p}{s}+1$, $K$ filters → $K$ output channels and $K$ biases, $h\cdot w\cdot d$ weights per filter, and the receptive-field recurrence $r _ \ell=r _ {\ell-1}+(k _ \ell-1)j _ {\ell-1}$, $j _ \ell=j _ {\ell-1}s _ \ell$ are the examinable bookkeeping. Pooling adds parameter-free shift-invariance; a non-linearity between convs is essential or the stack collapses to one linear map.

Source: Lecture 7, Convolutional Layer / Pooling / Activation Function slides; cf. Notes - Machine Learning MT23, Convolutional neural networksU.

Named CNN architectures

The historical chain: Neocognitron (Fukushima 1980) → LeNet-5 (LeCun 1998, first successful CNN) → AlexNet (Krizhevsky 2012, ReLU+GPU+dropout, won ImageNet, interpretable 96 first-layer filters) → VGG (smaller convs, deeper) → ResNet (R18/50/101/152, residual blocks). Each step’s idea recurs downstream.

Source: Lecture 7, Neocognitron / LeNet-5 / VGG / ResNet slides.

Attention and transformers

Scaled dot-product and multi-head attention

$\mathrm{Attn}(Q,K,V)=\mathrm{softmax}(QK^\top/\sqrt d)V$: queries score keys, the $\sqrt d$ keeps the score variance at 1 (since $q^\top k$ has variance $d$) so softmax stays unsaturated. Multi-head runs $H$ projections in parallel + concat + linear, analogous to multiple conv filters. Attention is permutation-equivariant (hence positional encodings) and $O(N^2)$ in tokens — the cost that motivates ViViT factorisation.

Source: Lecture 8, Attention / Self-Attention / Positional Encoding slides; Paper - Attention Is All You Need (2017)U.

Transformer blocks, ViT and architecture variants

The encoder block is Self-Attn → Add+Norm → FFN → Add+Norm; the decoder adds masked self-attention and encoder–decoder cross-attention. ViT (Dosovitskiy 2021) tokenises an image into $16\times16$ patches + CLS token + learned 2D positional encoding and runs a pure encoder — beating CNNs only past ~300M pretraining images. The variant taxonomy (encoder-only / encoder-decoder / decoder-only / dual-encoder / hybrid) places ViT, BERT, GPT, T5, DETR, CLIP, Stable Diffusion and Flamingo on one map.

Source: Lecture 8, The Encoder-Decoder Transformer / Vision Transformers slides.

Interpreting vision models

Explanation taxonomy and the Clever Hans warning

Three explanation families — post-hoc analysis (no performance cost, but local), transparent models (semantic by construction, task-specific), learned explanations (semantic, but need meta-explanation) — each with pro/con trade-offs. The Clever Hans effect (the horse-with-copyright-notice classifier hitting 90% by reading spurious cues) is the central warning that test accuracy alone does not validate a model; GDPR Art. 13.2(f) gives the legal “right to explanation”.

Source: Lecture 9, Clever Hans / Explainability / Taxonomy of Approaches slides.

Visualising weights, activations and attributions

Visualise weights (first-layer filters are interpretable Gabors; deeper not) or activations (PCA / t-SNE — with the caveat t-SNE is stochastic and distance-distorting). Attribution is black-box (occlusion — slow, fill-value-dependent) or white-box (gradient $ \vert \nabla _ x f \vert _ 1$). Input reconstruction by gradient ascent yields adversarial noise unless regularised. ROAR (remove-and-retrain) shows vanilla gradient can be worse than random; sanity checks (randomise weights / labels) must change the saliency map.

Source: Lecture 9, Visualising Weights/Activations / Attribution / ROAR / Sanity Checks slides.

Object detection

Sliding window, HOG+SVM, IoU

Detection = classification + localisation as $(x,y,w,h,c)$. The pre-deep pipeline is a sliding-window binary classifier; Dalal–Triggs (CVPR 2005) used HOG features + linear SVM for pedestrians (the SVM weights visualise as a person silhouette — linear template matching, and HOG is reproducible as a CNN). IoU $=\frac{ \vert GT\cap P \vert }{ \vert GT\cup P \vert }$ jointly scores centre/size/aspect (biased to large objects) and is the match criterion feeding AP.

Source: Lecture 10, Pedestrian Detection / HOG / Intersection over Union slides.

Speed-ups: NMS, hard-negative mining, cascades, proposals

Detection is class-imbalanced (mostly background), so: NMS removes duplicate detections; hard-negative mining (bootstrapping) feeds false positives back as negatives; cascaded classifiers (Viola–Jones) reject easy negatives early; object proposals (Selective Search, ~2000 regions, >95% recall) replace exhaustive sliding windows. ImageNet’s lack of a background class is why naive proposal-classification floods false positives.

Source: Lecture 10, Non-Maximum Suppression / Bootstrapping / Cascaded Classifiers / Selective Search slides.

The R-CNN family, YOLO and DETR

R-CNN (Selective Search → warp → CNN → SVM, slow: one CNN/proposal) → Fast R-CNN (one CNN pass, RoIPool, FC head) → Faster R-CNN (learned RPN with multi-scale anchors + IoU assignment + offset regression replaces Selective Search). One-stage YOLO predicts boxes on a grid in a single pass with its five-term loss ($\sqrt{w},\sqrt h$ to balance box sizes; $\lambda _ \text{noobj}$ for imbalance). DETR (Carion 2020) reframes detection as set prediction with object queries + bipartite matching — fully end-to-end, no NMS.

Source: Lecture 10, R-CNN / Fast R-CNN / Faster R-CNN / YOLO / DETR slides.

Segmentation

Semantic segmentation, FCN and upsampling

Classification/detection/segmentation are the same spatial-labelling task at image/region/pixel scale. Per-pixel prediction needs an encoder–decoder (downsample then upsample back to input resolution), since a classification backbone’s final map is too coarse. Upsampling primitives: nearest-neighbour, bed-of-nails, bilinear, max-unpooling (stored indices), and the learnable transposed convolution (grid artefacts fixed by a following conv).

Source: Lecture 11, Fully Convolutional Networks / Upsampling: Unpooling / Transposed Convolution slides.

U-Net, instance/panoptic segmentation and SAM

U-Net’s signature is concatenation skip connections (not residual addition) carrying low-level detail into the decoder. Instance segmentation = detect + per-region mask: Mask R-CNN adds a mask branch, with RoIAlign (bilinear, no quantisation) replacing RoIPool because half-pixel misalignment that washes out for box classification ruins masks. MaskFormer reframes segmentation as $N$ mask-query classification; SAM (Kirillov 2023, 1B masks) adds a promptable interface; keypoint/pose prediction reuses the mask head as heatmaps + differentiable soft-argmax.

Source: Lecture 11, U-Net / Mask R-CNN / RoIAlign / MaskFormer / SAM / Keypoints slides.

Video, optical flow and tracking

Video processing: fusion, 3D conv, two-stream, ViViT

Video is too big (~11 GB/min HD) so train on short low-res clips, test on overlapping clips and average. Temporal fusion strategies (per-frame / late / early / slow via 3D conv) — single-frame is a surprisingly strong baseline due to object-bias of action recognition. Two-stream nets fuse RGB + stacked optical flow. ViViT (Arnab 2021) tokenises spatio-temporal patches; full attention is $O((THW)^2)$, so the three factorisation strategies (factorised encoder / self-attention / dot-product) split space and time, dropping cost to $O(T^2HW+T(HW)^2)$.

Source: Lecture 12, Windowed Video Processing / Fusion Approaches / Two-Stream / ViViT slides.

Optical flow

Brightness constancy + first-order Taylor + divide-by-$\Delta t$ gives the motion-constraint equation $\nabla I^\top\pmb\mu=-\partial _ t I$ — one equation, two unknowns (the aperture problem), resolved by a smoothness regulariser (Horn–Schunck 1981) or a local patch (Lucas–Kanade). Modern learned flow (FlowNet 2015, RAFT 2020: correlation volume + GRU iterative refinement) is trained on synthetic data because real ground truth is impossible — sim2real works because flow is low-level.

Source: Lecture 13, Optical Flow — The Beginnings / Regularisation slides; Lecture 1 history slides.

Template tracking and Lucas–Kanade

Naive template matching is too slow ($O(\#\text{image}\times\#\text{template})$); LK (1981) reformulates it as iterative Gauss–Newton: linearise the warped residual, solve the normal equations $\pmb M\Delta\pmb p=\pmb b$, update $\pmb p\leftarrow\pmb p+\Delta\pmb p$. It works for any differentiable warp $W(\pmb x,\pmb p)$ (translation needs a $2\times2$ system = the structure tensor — invertible only if there is 2D gradient content, the aperture problem again). Drift (template picks up background) is mitigated by a fixed template / segmentation mask / re-detection (TLD); modern point trackers (PIPs/RAFT/CoTracker) replace pixel with feature matching.

Source: Lecture 13, Recap: Template Tracking / Generalised LK Tracking / LK Tracker Insights slides; Problem Sheet I.

Multiple-view geometry

Camera models and calibration

A pinhole projects $(x,y,z)\mapsto(fx/z,fy/z)$ — perspective, with orthographic / weak-perspective the $f\to\infty$ limit. The full projection is $\pmb x\cong \mathbf K[\mathbf R\mid\pmb t]\mathbf X=\mathbf P\mathbf X$ ($\mathbf K$ intrinsic: focal/principal-point/pixel-scaling; $[\mathbf R\mid\pmb t]$ extrinsic). $\mathbf P$ has 11 DoF, so 6 correspondences give a linear solution via the same stack-and-SVD null-space trick (no factored $\mathbf K[\mathbf R\mid\pmb t]$); non-linear calibration minimises reprojection error and can model radial distortion.

Source: Lecture 14, Perspective Projection / Camera Calibration slides; Problem Sheet IV Q2-Q3.

Epipolar geometry: essential and fundamental matrices

Coplanarity of $(\pmb x',R\pmb x,\pmb t)$ gives $\pmb x'^\top E\pmb x=0$ with $E=[\pmb t] _ \times R$ (calibrated, rank 2, 5 DoF); dropping known $\mathbf K$ gives $\pmb x'^\top F\pmb x=0$ with $F=\mathbf K'^{-\top}E\mathbf K^{-1}$ (rank 2, 7 DoF). $E\pmb x$ is the epipolar line, $Ee=0$ the epipole. The constraint is necessary not sufficient (reduces a 2D search to the 1D epipolar line). The eight-point algorithm estimates $F$ by — once more — stack-and-SVD, then rank-2 enforcement by truncating the smallest singular value (Eckart–Young).

Source: Lecture 15, Epipolar Geometry / The Essential Matrix / The Fundamental Matrix / Eight Point Algorithm slides; Problem Sheet IV Q1.

Neural rendering: rendering equation, BRDF and NeRF

The rendering equation $L _ o=L _ e+\int _ \Omega f _ r\,L _ i(\omega _ i\cdot\pmb n)\,d\omega _ i$ with the physically-plausible BRDF (positivity / reciprocity / energy conservation). NeRF (Mildenhall 2020) is an MLP $F _ \Omega:(x,y,z,\theta,\phi)\to(r,g,b,\sigma)$ trained by analysis-by-synthesis: differentiable volume rendering $C=\sum T _ i\alpha _ i c _ i$ ($\alpha _ i=1-e^{-\sigma _ i\Delta t}$, $T _ i=\prod _ {j<i}(1-\alpha _ j)$) against multi-view images. 3D Gaussian Splatting is the real-time alternative; a 2D-diffusion prior + textual inversion gives single-image 3D (with the Janus front-face problem).

Source: Lecture 15, The Rendering Equation / Neural Radiance Fields / Diffusion as a Prior slides; Lecture 1 history slides.

Generative models

Taxonomy, autoregressive models and VQ-VAE

Generative = learn $p(\pmb x)$ to sample from; the taxonomy splits explicit (tractable: autoregressive; approximate: VAE) vs implicit (direct: GAN, diffusion). Autoregressive factorises $p(\pmb x)=\prod p(x _ i\mid x _ {<i})$ with masked convolutions (PixelCNN mask A/B) to prevent peeking; VQ-VAE quantises to a discrete codebook (straight-through gradient), enabling DALL-E’s token-space autoregression.

Source: Lecture 16, Taxonomy / Autoregressive / VQ-VAE / DALL-E slides; cf. Notes - Machine Learning MT23, Generative modelsU.

Diffusion models and Stable Diffusion

The forward process is a fixed Gaussian noising chain; only the reverse denoiser is learned ($\ \vert f(x _ t,t)-\epsilon\ \vert ^2$). The closed form $x _ t=\sqrt{\bar\alpha _ t}x _ 0+\sqrt{1-\bar\alpha _ t}\,\epsilon$ with $\bar\alpha _ t=\prod\alpha _ i$ enables one-step noising; the DDPM sampling step iterates $T\approx1000$ times (predicting noise beats predicting $x _ 0$). Latent diffusion (Stable Diffusion) runs the chain in a VQ-VAE latent (~$8\times$ compression), conditions every step via cross-attention to the prompt, and injects $t$ via a Fourier-basis embedding. Evaluation uses FID = Fréchet distance between Inception-feature Gaussians.

Source: Lecture 16, Diffusion Models / Latent Diffusion / Stable Diffusion / FID slides.

Representation and unsupervised learning

Representation-learning losses

Learn a general-purpose embedding, not a task output. Cosine-similarity loss forces absolute targets (collapses without negatives); triplet loss enforces the weaker relative margin $\mathcal S(\phi,\phi^+)>\mathcal S(\phi,\phi^-)+\epsilon$; the contrastive (InfoNCE) loss is exactly softmax cross-entropy with similarities as logits and the other batch elements as implicit negatives. CLIP (400M image–text pairs) trains a dual encoder with symmetric contrastive loss; SigLIP swaps the softmax for a per-pair sigmoid for multi-GPU scaling.

Source: Lecture 17, Representation Learning Losses / Contrastive Loss / CLIP / SigLIP slides.

Unsupervised learning: pretext tasks, SimCLR, DINO

The seven learning-signal techniques (recovery / bottleneck / dataset / invariance / equivariance / transformation-estimation / generative) organise self-supervision. Pretext tasks: rotation prediction (Gidaris 2018, with the global-average-pool collapse trap), jigsaw, context prediction, inpainting, colourisation. SimCLR maximises agreement between two augmented views via NT-Xent with a non-linear projection head (absorbs the invariance pressure so $h$ stays rich). DINO is self-distillation with a teacher EMA, preventing collapse by centring + sharpening. Unsupervised classification needs Hungarian matching (cluster→label, $O(N^3)$); SCAN and self-labelling-by-clustering (optimal transport, equal-size constraint) are the named methods.

Source: Lecture 18, Learning Signals / Transformation Estimation / SimCLR / DINO / SCAN / Self-labelling slides.

Vision and language

Tokenisation, LMs and alignment

Sub-word BPE tokenisation = the optimisation “assign strings to a budget of $N$ tokens minimising total token count”. Alignment: RLHF three steps (SFT → reward model from rankings → PPO with a KL penalty) and the simpler DPO (a single supervised loss with its $\sigma(\hat r _ l-\hat r _ w)$-weighted gradient, no reward model, no RL). Both vulnerable to jailbreaks (grandma role-play; visual adversarial perturbations).

Source: Lecture 19, Words to Tokens / ChatGPT Training / DPO / Jailbreaks slides.

Grounding, VQA and vision-language models

Referring expressions / VQA / visual grounding tasks. Pre-transformer MAttNet decomposes into subject/location/relationship modules; MDETR is image–text transformer + bipartite matching; UNITER pretrains on MLM + MRM + WRA/ITM; GLIP does phrase grounding via word–region alignment. CLIP’s three-step zero-shot (contrastive pretrain → “A photo of a {class}” text classifier → similarity argmax) matches supervised ResNet-50 on ImageNet without ImageNet. Flamingo = frozen vision encoder + Perceiver Resampler + GATED XATTN-DENSE into a frozen LM; Tsimpoukelli’s frozen-LM few-shot maps images into the LM’s text-embedding space.

Source: Lecture 19, Referring Expressions / VQA / UNITER / CLIP / Flamingo slides.

Ethics, bias and privacy

Formalising fairness

Two fairness criteria: independence ($R\perp A$) and separation ($R\perp A\mid Y$, = error-rate parity). The impossibility result: for binary $Y$ with $A,R$ both dependent on $Y$, independence and separation cannot both hold — “there is no single mathematical definition of fairness, and you cannot satisfy all of them” (Narayanan’s 21 definitions). “No fairness through unawareness”: dropping a sensitive feature fails because proxies correlate.

Source: Lecture 20, Formalizing Fairness slides.

Harms, case studies and dataset accountability

Allocative (unfair resource allocation) vs representational harms (denigration / stereotype / recognition / under-representation / ex-nomination). Case studies with the actual numbers: COMPAS (Black FPR ~44.9% vs White ~23.5%, even though race is not an input), Gender Shades (dark-female error ~20-35% vs light-male ~0-1%), bias amplification (Zhao 2017, CNN predictions more skewed than training data). Accountability: Datasheets for Datasets, Model Cards (the CLIP “any deployed use is out of scope” example), CelebA subjective-attribute critique, PASS (1.4M human-free images), red-circle visual prompt engineering.

Source: Lecture 20, Allocative/Representational Harms / COMPAS / Gender Shades / Datasheets / Model Cards slides.

Methods

Motion and geometry

Optical flow (motion constraint + smoothness, or RAFT) and Lucas–Kanade Gauss–Newton tracking; camera calibration / fundamental-matrix estimation by the universal stack-and-SVD-then-rank-2 recipe; NeRF analysis-by-synthesis via differentiable volume rendering.

Source: Lectures 13-15.