Computer Vision MT25, Loss function design
Flashcards
@Define the smooth $L _ 1$ loss.
The loss for R-CNN looks something like the following:
\[L(y, \hat P, b, \hat b) = -\log \hat P(y) + \lambda \mathbb I[y \ge 1] L _ \text{reg}(b, \hat b)\]
where:
- $y$ is the ground truth class of an object in ground truth bounding box $b$
- $\hat P$ are the predicted class probabilities and $\hat b$ is the predicted box
- $L _ \text{reg}$ is a loss for the bounding box regression
In what sense is this a multi-task loss?
You are combining two separate loss functions, one for the class prediction task and one for the bounding box regression task.
Bite-sized
The $L _ 1$ regression loss is $\mathcal L _ 1(\pmb y, \pmb y^\ast) = \tfrac{1}{n}\sum _ i \vert y _ i - y _ i^\ast \vert $. The $L _ 2$ regression loss is $\mathcal L _ 2(\pmb y, \pmb y^\ast) = \tfrac{1}{n}\sum _ i (y _ i - y _ i^\ast)^2$. Key difference: $L _ 1$ has constant gradient $\pm 1$ everywhere except at 0, making it robust to outliers; $L _ 2$ has gradient proportional to the residual, so individual outliers can dominate the loss.
@Justify the design of the smooth $L _ 1$ loss
\[\mathrm{smooth} _ {L _ 1}(x) = \begin{cases} 0.5 x^2 & \vert x \vert < 1 \\ \vert x \vert - 0.5 & \text{otherwise} \end{cases}\]
in terms of combining the best behaviours of $L _ 1$ and $L _ 2$.
- For small residuals ($ \vert x \vert < 1$): behaves quadratically like $L _ 2$ — gradient $\to 0$ as $x \to 0$, giving smooth convergence to a minimum and avoiding the gradient discontinuity of $L _ 1$ at zero.
- For large residuals ($ \vert x \vert \ge 1$): behaves linearly like $L _ 1$ — gradient is bounded at $\pm 1$, so outliers cannot dominate training.
The constant offset $-0.5$ ensures the function is continuous at the changeover point $ \vert x \vert = 1$: both branches evaluate to $0.5$ there.
This is the default bounding-box regression loss in Fast R-CNN and is used throughout modern object detection architectures.
@Describe the typical structure of a multi-task loss in computer vision, and why a balancing hyperparameter $\lambda$ is needed.
A multi-task loss is the weighted sum of two or more task-specific losses, applied to a model that outputs predictions for each task. The canonical form is
\[\mathcal L = \mathcal L _ \text{task A} + \lambda \cdot \mathcal L _ \text{task B}.\]Example (R-CNN family): $\mathcal L = \mathcal L _ \text{cls}(\hat y, y) + \lambda \mathcal L _ \text{reg}(\hat b, b)$, combining a classification cross-entropy and a bounding-box regression smooth-$L _ 1$.
Why $\lambda$ matters:
- The two losses are typically on different scales (e.g. cross-entropy in tens, smooth-$L _ 1$ in fractions).
- Their gradients should make balanced contributions during training; otherwise one task dominates and the other is essentially ignored.
- $\lambda$ is a hyperparameter set per task / dataset, often by grid search over the validation set.
Optional refinement: gate the regression loss with an indicator $\mathbb 1[y \ge 1]$ so it only applies when the object is foreground (no bounding box for background).
@Justify why the SVM hinge loss $\max(0, 1 - y f(\pmb x))$ is a good surrogate for the 0-1 classification loss.
- Convex upper bound: hinge is a convex function that lies above the 0-1 loss everywhere, so minimising hinge is a convex surrogate for the (non-convex, non-differentiable) 0-1 loss.
- Zero gradient when classified correctly with margin: for $y f(\pmb x) \ge 1$, the loss is zero and the gradient is zero — so correctly-classified-with-margin points don’t influence training. This induces sparsity: only points within the margin (support vectors) affect the decision boundary.
- Non-zero, constant gradient when misclassified or within margin: $-y \pmb x$, pushing the decision boundary away from the violating point.
- Single discontinuity in derivative at $y f(\pmb x) = 1$: not differentiable there, but subgradient methods handle this fine.
These properties make the hinge loss the canonical SVM surrogate.
Cross-entropy + softmax is the standard classification loss. For a one-hot ground truth at class $C _ {GT}$, it simplifies to $-\log p(C _ {GT} \mid \pmb x) = -f _ {GT}(\pmb x) + \log \sum _ j \exp f _ j(\pmb x)$. The second term is the log-sum-exp (numerically stable softmax) which can blow up if individual logits are large — usually handled by subtracting $\max _ j f _ j(\pmb x)$ inside the exponent.