# Notes - Computer Vision MT25, Loss function design

> Source: https://ollybritton.com/notes/uni/part-c/mt25/computer-vision/notes/loss-function-design/ · Updated: 2025-11-11 · Tags: uni, notes

- [Course - Computer Vision MT25](https://ollybritton.com/notes/uni/part-c/mt25/computer-vision/)
	- [Notes - Computer Vision MT25, Neural networks](https://ollybritton.com/notes/uni/part-c/mt25/computer-vision/notes/neural-networks/)
	- [Notes - Computer Vision MT25, Convolutional neural networks](https://ollybritton.com/notes/uni/part-c/mt25/computer-vision/notes/convolutional-neural-networks/)

### Flashcards
@Define the smooth $L_1$ loss.::

$$
\text{smooth}_{L_1}(x) = \begin{cases}
0.5x^2 &\text{if }|x| < 1 \\
|x| - 0.5 &\text{otherwise}
\end{cases}
$$

The loss for R-CNN looks something like the following:
$$
L(y, \hat P, b, \hat b) = -\log \hat P(y) + \lambda \mathbb I[y \ge 1] L_\text{reg}(b, \hat b)
$$
where:

- $y$ is the ground truth class of an object in ground truth bounding box $b$
- $\hat P$ are the predicted class probabilities and $\hat b$ is the predicted box
- $L_\text{reg}$ is a loss for the bounding box regression

In what sense is this a multi-task loss?::

You are combining two separate loss functions, one for the class prediction task and one for the bounding box regression task.

### Bite-sized

The $L_1$ regression loss is $\mathcal L_1(\pmb y, \pmb y^\ast) = \tfrac{1}{n}\sum_i |y_i - y_i^\ast|$. The $L_2$ regression loss is $\mathcal L_2(\pmb y, \pmb y^\ast) = \tfrac{1}{n}\sum_i (y_i - y_i^\ast)^2$. Key difference: $L_1$ has constant gradient $\pm 1$ everywhere except at 0, making it *robust to outliers*; $L_2$ has gradient proportional to the residual, so individual outliers can dominate the loss .

**Source**: Lecture 7, **Loss Function -- Types** slide.

@bite~

@Justify the design of the smooth $L_1$ loss
$$\mathrm{smooth}_{L_1}(x) = \begin{cases} 0.5 x^2 & |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases}$$
in terms of combining the best behaviours of $L_1$ and $L_2$.

::

- For *small residuals* ($|x| < 1$): behaves quadratically like $L_2$ — gradient $\to 0$ as $x \to 0$, giving smooth convergence to a minimum and avoiding the gradient discontinuity of $L_1$ at zero.
- For *large residuals* ($|x| \ge 1$): behaves linearly like $L_1$ — gradient is bounded at $\pm 1$, so outliers cannot dominate training.

The constant offset $-0.5$ ensures the function is continuous at the changeover point $|x| = 1$: both branches evaluate to $0.5$ there.

This is the default bounding-box regression loss in Fast R-CNN and is used throughout modern object detection architectures.

**Source**: Lecture 10, **Multi-task Loss** slide; Girshick, ICCV 2015.

@bite~

@Describe the typical structure of a *multi-task loss* in computer vision, and why a balancing hyperparameter $\lambda$ is needed.

::

A multi-task loss is the weighted sum of two or more task-specific losses, applied to a model that outputs predictions for each task. The canonical form is

$$\mathcal L = \mathcal L_\text{task A} + \lambda \cdot \mathcal L_\text{task B}.$$

Example (R-CNN family): $\mathcal L = \mathcal L_\text{cls}(\hat y, y) + \lambda \mathcal L_\text{reg}(\hat b, b)$, combining a classification cross-entropy and a bounding-box regression smooth-$L_1$.

Why $\lambda$ matters:

- The two losses are typically on different scales (e.g. cross-entropy in tens, smooth-$L_1$ in fractions).
- Their gradients should make balanced contributions during training; otherwise one task dominates and the other is essentially ignored.
- $\lambda$ is a hyperparameter set per task / dataset, often by grid search over the validation set.

Optional refinement: gate the regression loss with an indicator $\mathbb 1[y \ge 1]$ so it only applies when the object is foreground (no bounding box for background).

**Source**: Lecture 10, **Multi-task Loss** slide.

@bite~

@Justify why the SVM *hinge loss* $\max(0, 1 - y f(\pmb x))$ is a good surrogate for the 0-1 classification loss.

::

- *Convex upper bound*: hinge is a convex function that lies above the 0-1 loss everywhere, so minimising hinge is a convex surrogate for the (non-convex, non-differentiable) 0-1 loss.
- *Zero gradient when classified correctly with margin*: for $y f(\pmb x) \ge 1$, the loss is zero and the gradient is zero — so correctly-classified-with-margin points don't influence training. This induces *sparsity*: only points within the margin (support vectors) affect the decision boundary.
- *Non-zero, constant gradient when misclassified or within margin*: $-y \pmb x$, pushing the decision boundary away from the violating point.
- *Single discontinuity in derivative at $y f(\pmb x) = 1$*: not differentiable there, but subgradient methods handle this fine.

These properties make the hinge loss the canonical SVM surrogate.

**Source**: Lecture 6, **Training SVMs** slide.

@bite~

Cross-entropy + softmax is the standard classification loss. For a one-hot ground truth at class $C_{GT}$, it simplifies to $-\log p(C_{GT} \mid \pmb x) = -f_{GT}(\pmb x) + \log \sum_j \exp f_j(\pmb x)$ . The second term is the *log-sum-exp* (numerically stable softmax) which can blow up if individual logits are large — usually handled by subtracting $\max_j f_j(\pmb x)$ inside the exponent.

**Source**: Lecture 6, **Cross-Entropy Loss -- Simplified** slide.

@bite~ @exam~

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt