Computer Vision MT25, Interpreting vision models
Flashcards
@State and @define the three aspects that need to be considered for the explanation of a deep learning model.
- Recipient: Explanations need to adapt to the recipient of the information.
- Content: Explanations provide different types of information.
- Purpose: Explanations differ based on use-cases.
@State and @define three different approaches to explainable models.
- Post-hoc analysis: Explanations are derived from a fixed, pre-trained model via analysis.
- Transparent models: The model is specifically constructed such that some mechanism has semantic meaning.
- Learned explanations: The model is trained to deliver explanations together with predictions.
What are the pros and cons of the post-hoc analysis approach to explaining models?
- Pro: There is no impact on performance.
- Con: Typically very difficult.
- Con: Explanations are often local around predictions.
What are the pros and cons of the “transparent models” approach to explainable AI?
- Pro: Does not require difficult post-hoc analysis.
- Con: Requires a task-specific architecture.
- Con: Can affect performance, there is a trade-off between explainability vs capabilities.
What are the pros and cons of the “learned explanations” approach to explainable models?
- Pro: Explanations can be very semantic.
- Con: Might need meta-explanations; you are always unsure if the explanation is a true explanation.
- Con: Can affect performance.
The last layer of ResNet18 has dimension $512 \times 1000$, corresponding to $1000$ classes each of dimension $512$. What would be one way to @visualise these weights?
Use PCA to reduce to $2 \times 1000$ and plot.

One way to understand a classification model is to visualise its learned weights e.g. with PCA. How could you instead visualise what the model does with its inputs?
Compute the activations, use PCA, then visualise the embedding with class labels.

What is t-SNE?
A non-linear (although less interpretable) embedding, sort of like PCA.
@State two sanity checks for interpretability techniques.
- See if it makes a randomly initialised network appear interpretable.
- Train another model on the same data but random labels, and see if this model also seems interpretable.
What is the technique of input reconstruction for interpretability?
Search for inputs that maximise class probabilities.
What is the occlusion method for interpreting vision models?
A black-box interpretability method where you occlude a part of the image and measure the change in response; the bigger the change, the more important the region.
Why might it not be a good idea to occlude patches of an image by replacing them with a black square?
If the image is already black there, it will make no difference.
What is the gradient method for interpreting vision models? @Visualise an example.
Plot the gradient magnitude $ \vert \nabla _ x f(x) \vert _ 1$, i.e. see which direction does the input need to change in order to affect the output the most.

What is the ROAR method for benchmarking attribution (also called saliency) methods?
- RemOve And Retrain
- Run your attribution method, delete X% of the most important pixels
- Retrain your network on this data, see how much performance changes
Bite-sized
@Describe the Clever Hans effect and its standard computer-vision instance from Lecture 9.
Clever Hans was a German horse (1895-1916) that appeared to do arithmetic by tapping its hoof. A formal investigation revealed the horse was actually reading subtle cues from its trainer’s body language. The trainer was unaware of providing such cues.
The CV analogue: a classifier appears to perform a task well, but is actually exploiting a spurious correlation in the training data rather than the intended visual content.
Lecture 9’s worked example: a horse classifier on PASCAL VOC achieves ~90% accuracy. Visualising the heatmap reveals it’s not looking at the horse — it’s keying on the copyright notice that appears in the lower part of many horse photographs. Across the dataset average, the most “activated” region of horse images is exactly where copyright notices sit.
So the model has “learned” not the visual concept of horse, but the visual concept of copyright-notice-likely-on-horse-photo. This is why test-set accuracy alone is insufficient to validate a model.
GDPR Article 13.2.f gives data subjects the right to explanation of automated decisions made about them — specifically “meaningful information about the logic involved, as well as the significance and the envisaged consequences” of automated decision-making and profiling. This is the legal underpinning of the field of explainable AI in Europe.
@Describe the distinction between black-box and white-box attribution methods.
- Black-box: only the inputs and outputs of the model are observed. The attribution must be inferred from how the output changes when the input is perturbed. Example: occlusion, where a sliding patch of the image is blocked and the change in target-class probability is recorded. Slow (many forward passes) and depends on the form of perturbation.
- White-box: the internal weights and computations of the model are accessible. Attribution leverages this access directly. Example: gradient methods, which take $ \vert \nabla _ {\pmb x} f(\pmb x) \vert _ 1$ to find the input directions that change the output most. Fast (one back-pass) and continuous, but sensitive to ReLU/pooling choices.
The lecture argues white-box methods are typically harder to make robust (they can be largely independent of the network weights — failing sanity checks), whereas black-box methods are easier to interpret but slower.
A standard sanity check for an attribution / saliency method is to randomly reinitialise the model’s weights and check that the saliency map also changes (becomes random). If a saliency method produces the same map for both a trained and a randomly-initialised network, it is not actually visualising what the network has learned — it is just visualising input edge structure or similar.
ROAR (RemOve And Retrain) results show that simple gradient-magnitude attribution can perform worse than randomly deleting pixels — i.e. it’s not actually identifying the most important pixels for the model’s decision. Ensemble methods that average gradients over many small input perturbations (e.g. SmoothGrad, SmoothGrad-Squared) substantially outperform vanilla gradient on ROAR.
The PASCAL Visual Object Classes (VOC) benchmark covers 20 classes (person, bird, cat, cow, dog, horse, sheep, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, tv/monitor) across $\sim 10{,}000$ images with $\sim 25{,}000$ annotated objects. It ran 2004-2012 and was the standard pre-ImageNet detection/classification benchmark.
@Justify why t-SNE plots should be interpreted with caution.
- Stochasticity: t-SNE is initialised randomly and optimised via gradient descent; each run produces a different embedding for the same data. Different seeds can produce visually quite different cluster shapes, so any particular t-SNE figure must not be over-interpreted as canonical.
- Distances are not preserved: t-SNE optimises for local neighbourhood preservation only. Inter-cluster distances and absolute scale are essentially meaningless — two clusters that look far apart in t-SNE may actually be close in the original space, and vice versa.
- Cluster sizes are not informative: t-SNE tends to equalise cluster sizes, so a small visual cluster does not necessarily reflect a small number of points.
- Non-linear: unlike PCA, there is no analytic interpretation of the t-SNE coordinates.
So t-SNE is good for exploratory hints about clustering structure, but bad for quantitative claims about distances or proportions.
@Justify why naive input-reconstruction by gradient ascent (find the input that maximises class probability) produces noise-like images rather than canonical examples of a class.
The gradient-ascent procedure $\pmb x \leftarrow \pmb x + \eta \nabla _ {\pmb x} p(\text{class} \mid \pmb x)$ has no constraint that $\pmb x$ should look like a natural image. So the optimisation finds an adversarial example: a high-frequency, low-amplitude perturbation pattern that happens to maximally activate the target class. These look like random-coloured noise to a human.
To get class-canonical visualisations, one must add regularisers that push $\pmb x$ toward the natural-image manifold:
- Smoothness regulariser (total variation $L _ 1$ on image gradients) reduces high-frequency noise.
- Image jittering: randomly shift $\pmb x$ by a few pixels each iteration, forcing the optimum to be robust to small spatial perturbations.
- More sophisticated regularisers (Olah et al., Distill.pub) give significantly better visualisations.
This same observation underlies adversarial examples generally: neural network output landscapes are extremely non-smooth in input space.
@Describe the confirmation-bias trap in interpreting attention-map visualisations.
ViT-style attention maps offer many choices for what to visualise:
- which layer (out of $\sim 12$),
- which token’s attention pattern,
- which attention head (out of $\sim 12$),
- whether to look at a single head or aggregate.
That’s many candidate visualisations per input. With this much freedom, it is almost always possible to find some configuration where the attention map appears to “highlight the right thing” — irrespective of whether the model is genuinely attending to that region for its decision.
So a single pretty attention-map figure can be cherry-picked, and on its own provides weak evidence about what the model is doing. Robust interpretability claims need cross-layer/head aggregation, sanity checks, and quantitative metrics like ROAR rather than single-image visual storytelling.
The first-layer filters of a CNN are visually interpretable as oriented edges, blobs, and colour patterns — much like SIFT descriptors and the receptive fields of biological V1 cells. Deeper layers are not directly visualisable because they operate on the features computed by earlier layers, not on raw pixels.