Article - Do reasoning models use their scratchpad like we do? (2025)

Created: November 18, 2025 | Updated: November 18, 2025 | About these notes | View in context | Study these flashcards

Full title: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Author(s): Fabien Roger
Year: 2025
Link: https://alignment.anthropic.com/2025/distill-paraphrases/
Relevant for:
- [[Distillation and AI safety]]^?

Summary

Worry is that models might improve performance through encoded reasoning or Chain-of-Thought steganography
Distilling paraphrased reasoning traces back into a non-reasoning model improves the pass@1 accuracy to roughly the same as the pass@1 accuracy of the original reasoning model, indicating that the model is not using encoded reasoning to squeeze out extra performance from its scratchpad

Flashcards

Related posts

[[Paper - Towards Understanding Subliminal Learning (2025)]]^N

(sim: 0.473)
[[Slides - Dark knowledge (2014)]]^N

(sim: 0.466)
[[Paper - Subliminal Learning, Language models transmit behavioural traits via hidden signals (2025)]]^N

(sim: 0.457)
[[Paper - It's Owl in the Numbers, Token Entanglement in Subliminal Learning (2025)]]^N

(sim: 0.408)