We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#201 - GPT 4.5, Sonnet 3.7, Grok 3, Phi 4

2025/3/5

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Andrey Kurenkov

Sharon Zhou

Topics

Andrey Kurenkov: 我认为OpenAI发布的GPT-4.5是一个非常大的模型，其规模可能比其他大型语言模型大一个数量级。虽然在基准测试中得分较高，但在实际使用中速度非常慢，并且价格昂贵（每百万输入75美元）。OpenAI强调GPT-4.5在情感智能和更愉快的聊天方面有所改进，而不是在智能方面有显著提升。我认为这表明单纯扩展大型语言模型的规模可能会遇到收益递减的问题。此外，OpenAI似乎有意将GPT-4.5定位为更侧重于写作的消费者助手，而不是编程助手。这与Anthropic的Claude Sonnet 3.7形成对比，后者在编程基准测试中表现出色，并推出了一个名为ClaudeCode的代码辅助工具。总的来说，GPT-4.5的发布并没有像人们预期的那样引起轰动，这可能反映了人们对单纯规模扩展的关注正在转向对推理能力和更有效的训练方法的关注。 Sharon Zhou: Anthropic发布的Claude Sonnet 3.7是一个混合模型，它结合了推理和非推理能力，旨在简化用户体验，避免用户在不同模型之间切换。虽然价格昂贵（每百万输入令牌3美元，每百万输出令牌15美元），但在编程和代码编写基准测试中表现出色。Claude Sonnet 3.7还与一个名为ClaudeCode的代码辅助工具集成，允许用户直接从终端运行任务。此外，Claude Sonnet 3.7在可靠性方面有所改进，减少了不必要的混淆。许多用户对Claude Sonnet 3.7在Agentic模式下的表现感到兴奋，认为它能够在几小时内生成完整的应用程序或网站。然而，我个人在基本的软件工程任务中并没有发现它与3.5版本有显著区别。 XAI发布的Grok 3在大型语言模型排行榜上名列前茅，它结合了图像分析和推理能力，并以详细的方式展示其推理过程。Grok 3使用了大量的GPU（约20万个），其计算能力是其前身Grok 2的十倍以上。虽然围绕Grok 3存在一些争议，例如其可能反映Elon Musk的观点，但其在基准测试和实际使用中的表现都非常出色，与OpenAI和Anthropic的模型不相上下。

Deep Dive

Chapters

In this section, we delve into the latest updates from OpenAI's GPT-4.5 release and how it compares with the Claude Sonnet 3.7 from Anthropic. The discussion includes an analysis of the new capabilities, costs, and how these models stand out in the current AI landscape.

GPT-4.5 is released by OpenAI, emphasizing emotional intelligence over reasoning.
The model is significantly larger and costlier, priced at $75 per million inputs.
Claude Sonnet 3.7 is a hybrid model integrating reasoning, priced at $3 per million input tokens.
Anthropic's model excels in coding benchmarks, indicating a focus on code automation.
OpenAI's GPT-4.5 focuses more on writing and consumer assistance rather than programming.

Shownotes Transcript

Our 201st episode with a summary and discussion of last week's big AI news! Recorded on 03/02/2025

Join our brand new Discord here!) https://discord.gg/nTyezGSKwP

Hosted by Andrey Kurenkov) and guest host Sharon Zhou Feel free to email us your questions and feedback at [email protected] )and/or [email protected])

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/).

In this episode:

The release of GPT-4.5 from OpenAI, Anthropic's Claude 3.7, and Grok 3 from XAI, comparing their features, costs, and capabilities.
Discussion on new tools and applications including Sesame's new voice assistant and Google's AI coding assistant, Gemini Code Assist, highlighting their unique benefits.
OpenAI's continued user growth despite competition, pricing models for Google's text-to-video platform, and HP acquiring and shutting down Humane's AI pin.
Insights into new research on alignment and specification gaming in LLMs, including papers on fine-tuning causing broad misalignment and Google's multi-agent system for scientific collaboration.

Timestamps + Links:

(00:00:00) Intro / Banter

(00:01:36) News Preview

Tools & Apps

(00:02:33) OpenAI announces GPT-4.5, warns it’s not a frontier AI model)

(00:07:22) Anthropic launches a new AI model that ‘thinks’ as long as you want)

(00:11:14) New Grok 3 release tops LLM leaderboards)

(00:16:43) Sesame is the first voice assistant I’ve ever wanted to talk to more than once)

(00:18:30) Google launches a free AI coding assistant with very high usage caps)

(00:20:45) Rabbit shows off the AI agent it should have launched with)

(00:22:23) Mistral’s Le Chat tops 1M downloads in just 14 days)

Applications & Business

(00:24:06) OpenAI Tops 400 Million Users Despite DeepSeek’s Emergence)

(00:27:37) Google’s new AI video model Veo 2 will cost 50 cents per second)

(00:29:52) HP is buying Humane and shutting down the AI Pin)

Projects & Open Source

(00:31:44) Microsoft launches next-gen Phi AI models.)

(00:33:47) OpenAI introduces SWE-Lancer: A Benchmark for Evaluating Model Performance on Real-World Freelance Software Engineering Work)

(00:37:12) SWE-Bench+: Enhanced Coding Benchmark for LLMs)

Research & Advancements

(00:40:00) Towards an AI co-scientist)

(00:42:52) Magma: A Foundation Model for Multimodal AI Agents)

Policy & Safety

(00:47:32) Demonstrating specification gaming in reasoning models)

(00:51:03) Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs)

#201 - GPT 4.5, Sonnet 3.7, Grok 3, Phi 4 58:37 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#201 - GPT 4.5, Sonnet 3.7, Grok 3, Phi 4