Figure 1: Left: Example of models either succeeding or failing to articulate a cue that influences their answer. We edit an MMLU question by prepending a Stanford professor's opinion. For examples like this where the cue changes the model answer, we measure how often models articulate the cue in their CoT. (Here we show only options A and B, rather than all four.) Right: Inference-Time-Compute models articulate the cue more often. The ITC version of Qwen refers to QwQ-32b-Preview, and non-ITC refers to Qwen-2.5-72B-Instruct. For Gemini, we use gemini-2.0-flash-thinking-exp and gemini-2.0-flash-exp respectively. TLDR: We evaluate two Inference-Time-Compute models, QwQ-32b-Preview and Gemini-2.0-flash-thinking-exp for CoT faithfulness. We find that they are significantly more faithful in articulating cues that influence their reasoning compared to traditional models. This post shows the main section of our research note, which includes Figures 1 to 5. Full research note which includes other tables and figures [...]
Outline:
(01:35) Abstract
(03:26) 1. Introduction
(09:00) 2. Setup and Results of Cues
(10:15) 2.1 Cue: Professors Opinion
(12:08) 2.2 Cue: Few-Shot with Black Square
(14:55) 2.3 Other Cues
(18:54) 3. Discussion
(18:58) Improving non-ITC articulation
(19:27) Advantage of ITC models in articulation
(20:13) Length of CoTs across models
(21:05) False Positives
(22:35) Different articulation rates across cues
(23:12) Training data contamination
(23:45) 4. Limitations
(23:49) Lack of ITC models to evaluate
(24:26) Limited cues studied
(24:51) Subjectivity of judge model
(25:22) Acknowledgments
(25:38) Links
First published: January 15th, 2025
---
Narrated by TYPE III AUDIO).
Images from the article:
)
)
)
)
)
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.