We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Inference-Time-Compute: More Faithful? A Research Note” by James Chua, Owain_Evans

2025/1/15

Figure 1: Left: Example of models either succeeding or failing to articulate a cue that influences their answer. We edit an MMLU question by prepending a Stanford professor's opinion. For examples like this where the cue changes the model answer, we measure how often models articulate the cue in their CoT. (Here we show only options A and B, rather than all four.) Right: Inference-Time-Compute models articulate the cue more often. The ITC version of Qwen refers to QwQ-32b-Preview, and non-ITC refers to Qwen-2.5-72B-Instruct. For Gemini, we use gemini-2.0-flash-thinking-exp and gemini-2.0-flash-exp respectively. TLDR: We evaluate two Inference-Time-Compute models, QwQ-32b-Preview and Gemini-2.0-flash-thinking-exp for CoT faithfulness. We find that they are significantly more faithful in articulating cues that influence their reasoning compared to traditional models. This post shows the main section of our research note, which includes Figures 1 to 5. Full research note which includes other tables and figures [...]

Outline:

(01:35) Abstract

(03:26) 1. Introduction

(09:00) 2. Setup and Results of Cues

(10:15) 2.1 Cue: Professors Opinion

(12:08) 2.2 Cue: Few-Shot with Black Square

(14:55) 2.3 Other Cues

(18:54) 3. Discussion

(18:58) Improving non-ITC articulation

(19:27) Advantage of ITC models in articulation

(20:13) Length of CoTs across models

(21:05) False Positives

(22:35) Different articulation rates across cues

(23:12) Training data contamination

(23:45) 4. Limitations

(23:49) Lack of ITC models to evaluate

(24:26) Limited cues studied

(24:51) Subjectivity of judge model

(25:22) Acknowledgments

(25:38) Links

First published: January 15th, 2025

Source: https://www.lesswrong.com/posts/C8HAa2mf5kcBrpjkX/inference-time-compute-more-faithful-a-research-note)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined )))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Inference-Time-Compute: More Faithful? A Research Note” by James Chua, Owain_Evans

LessWrong (30+ Karma)

Abstract: What's the Core of This Research?

Introduction: Setting the Stage

Setup and Results of Cues: How Do Models Respond to Influential Cues?

Cue: Professor's Opinion: Does It Matter?

Cue: Few-Shot with Black Square: A Unique Test

Other Cues: Exploring Further Influences

Discussion: What Do the Results Mean?

Improving Non-ITC Articulation: Can We Bridge the Gap?

Advantage of ITC Models in Articulation: Why Are They Better?

Length of CoTs Across Models: Does Size Matter?

False Positives: What Are the Pitfalls?

Different Articulation Rates Across Cues: Why the Variance?

Training Data Contamination: A Hidden Issue?

Limitations: What Are the Challenges?

Lack of ITC Models to Evaluate: A Limiting Factor?

Limited Cues Studied: Is the Scope Too Narrow?

Subjectivity of Judge Model: How Reliable?

Acknowledgments: Who Contributed?

Links: Where to Find More?

Shownotes Transcript

“Inference-Time-Compute: More Faithful? A Research Note” by James Chua, Owain_Evans 26:09 Share

LessWrong (30+ Karma)

Abstract: What's the Core of This Research?

Introduction: Setting the Stage

Setup and Results of Cues: How Do Models Respond to Influential Cues?

Cue: Professor's Opinion: Does It Matter?

Cue: Few-Shot with Black Square: A Unique Test

Other Cues: Exploring Further Influences

Discussion: What Do the Results Mean?

Improving Non-ITC Articulation: Can We Bridge the Gap?

Advantage of ITC Models in Articulation: Why Are They Better?

Length of CoTs Across Models: Does Size Matter?

False Positives: What Are the Pitfalls?

Different Articulation Rates Across Cues: Why the Variance?

Training Data Contamination: A Hidden Issue?

Limitations: What Are the Challenges?

Lack of ITC Models to Evaluate: A Limiting Factor?

Limited Cues Studied: Is the Scope Too Narrow?

Subjectivity of Judge Model: How Reliable?

Acknowledgments: Who Contributed?

Links: Where to Find More?

Shownotes Transcript

“Inference-Time-Compute: More Faithful? A Research Note” by James Chua, Owain_Evans