We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AGI for Christmas

2024/12/24

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Insights AI Chapters Transcript

People

Adam D'Angelo

Ahmad Mostak

Amjad Mossad

Boyantungus

Didi Das

Ethan Malik

Florian Mai

Francis Cholet

Greg Brockman

Harry Law

Julia McCoy

Santiago

Sully

Terry Tao

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

主持人: 本期节目讨论了OpenAI最新发布的O3模型，该模型在编码、数学和科学基准测试中取得了令人瞩目的成绩，尤其是在ARC-AGI测试中得分超过85%，引发了关于其是否接近AGI的广泛讨论。O3模型在Codeforces编程竞赛平台上的表现也超过了OpenAI首席科学家，全球排名靠前。这些突破性进展引发了人们对未来就业市场和社会经济的担忧，一些人认为这标志着编码工作的终结，全球经济将面临重塑。但也有人指出，基准测试的成绩并不一定能完全反映实际工作能力，O3模型的应用还需要进一步探索。 Francis Cholet: O3模型在ARC-AGI测试中的表现代表了AI在适应新任务方面取得的重大突破，但仍未达到AGI的水平。一些简单的任务O3模型仍然无法解决，未来ARC-AGI 2测试将对O3模型提出更大的挑战。 Cholet认为，目前的ARC-AGI测试已经趋于饱和，不再是一个有效的基准，新的测试将是评估AGI进展的更佳途径。 Greg Brockman: OpenAI承认O3模型取得了突破性进展，但在其是否达到AGI方面，并未明确声称。 Florian Mai: O3模型的能力已经超过了绝大多数程序员，这需要社会重视并采取负责任的行动，以应对可能出现的挑战。 Sully: O3模型的出现可能意味着编码工作的终结，其在编码基准测试中的表现令人震惊。 Santiago: O3模型在编码基准测试中的出色表现并不一定意味着它能够胜任所有软件工程师的工作，因为专业程序员的工作内容远不止是解决编码竞赛中的问题。 Didi Das: O3模型在数学基准测试中的表现令人难以置信，其难度远超一般人的理解能力。 Terry Tao: O3模型在高难度数学问题上的表现，虽然取得了一定进展，但仍有很大提升空间。 Ahmad Mostak: O3模型的出现可能导致全球经济的重塑，需要新的经济和社会框架。 Harry Law: O3模型在某些任务上的成本效益已经超过了雇佣咨询公司。 Adam D'Angelo: 市场尚未充分认识到O3模型的意义和对AGI的影响。 Amjad Mossad: 认为O3模型会完全取代软件工程师的观点是错误的。 Ethan Malik: 即使AI在各个领域都超过了人类的能力，社会和组织的变革速度仍然会相对较慢，这将为我们适应变化提供时间。 Julia McCoy: O3模型的出现并非仅仅是AI变得更聪明，更是让人类获得更多自由，可以摆脱重复性劳动。 Boyantungus: 与其与机器竞争，不如专注于提升自身的人文素养。

Deep Dive

Key Insights

What is O3, and why did OpenAI skip O2?

O3 is OpenAI's second-generation reasoning model. The company skipped O2 to avoid an intellectual property dispute with a large British telecom company.

How did O3 perform on coding benchmarks compared to O1?

O3 outperformed O1 by nearly 23 percentage points on a standard coding benchmark and surpassed OpenAI's chief scientist on Codeforces, ranking among the top 200 in the world.

What was O3's performance on the AIME math exam?

O3 achieved a near-perfect score on the AIME math exam, missing only one question.

How did O3 perform on the ARC-AGI test, and what does this test measure?

O3 scored 85% on the ARC-AGI test, tripling O1's score. This test measures a model's ability to handle novel problems that are difficult to pre-train, focusing on reasoning capabilities.

What did Francis Cholet, the creator of the ARC-AGI test, say about O3?

Cholet acknowledged O3 as a significant breakthrough in AI's ability to adapt to novel tasks but noted that it is not yet AGI, as there are still easy tasks it cannot solve.

What are the implications of O3's performance for the job market, particularly for programmers?

O3's coding abilities suggest it could outperform 99.95% of programmers on competitive coding platforms, raising concerns about job displacement in the coding industry.

Why might O3's performance on coding benchmarks not fully translate to real-world programming tasks?

While O3 excels in competitive coding challenges, it may not be as effective in real-world programming tasks that require broader problem-solving and collaboration skills.

What did Didi Das highlight about O3's performance on a math benchmark?

Didi Das noted that O3 achieved a 25% success rate on a highly challenging math benchmark created by math professors, a feat no other model has come close to.

How does the cost of using O3 compare to hiring human consultants like McKinsey?

At $3,000 per task, O3 is already more cost-effective than hiring McKinsey, highlighting its potential as a labor-saving tool despite its high compute costs.

What does Ethan Malik argue about the pace of societal change in response to AI advancements?

Malik argues that societal and organizational change will be slower than technological advancements due to human inertia, giving society time to adapt to AI's capabilities.

Chapters

This chapter explores the recent release of OpenAI's O3 reasoning model and the ensuing debate about its potential to be considered Artificial General Intelligence (AGI). The model's exceptional performance on various benchmarks, including coding, math, and the ARC-AGI test, is examined.

OpenAI released its second generation reasoning models, O3 and O3 Mini.
O3 significantly outperformed O1 on coding benchmarks and achieved a near-perfect score on the AIME math exam.
O3 exceeded the 85% human performance threshold on the ARC-AGI test, a benchmark for AGI.
The ARC-AGI test measures a model's ability to deal with novel problems.
Francis Chollet, creator of the ARC-AGI test, noted O3's significant breakthrough but didn't consider it AGI yet.

Shownotes Transcript

Translations:

中文

Today on the AI Daily Brief, did we just get AGI for Christmas? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Hello, friends. For our last regular AI Daily Brief episode of the year, we are skipping the headlines. It's mostly small things, a couple of new AI appointments to the White House, stuff like that. And instead, we are going to spend all of our time on the big discussion from the last three days, which

which is whether OpenAI just gave us AGI. What's going on, guys? It has been a very interesting 12 days of shipments from OpenAI. We kicked it off with a full version of O1. Maybe the biggest announcement was Sora. But then by the end of last week, it seemed like just maybe we were actually going to get an entire new model. If you listen to my Friday episode, you heard all of the evidence that we were going to get O3. And indeed, that is what happened.

Specifically, on Friday, OpenAI announced their second generation of reasoning models, O3 and O3 Mini. Now, if you're wondering what the hell the name is about, the company skipped O2 in order to avoid an intellectual property dispute with the large British telco. Sam Altman said that the company was simply upholding its tradition of being truly bad at names.

And just to cut to the chase, while the announcement itself was relatively muted, the conversation that has followed has been all about whether this actually represents something close to AGI. So today we're going to explore all of those arguments and what we should actually think about this. Now, certainly from the numbers they shared, the model seems very good. On a standard coding benchmark, O3 bettered O1 by almost 23 percentage points.

It also bested the company's chief scientist on the competitive coding platform Codeforces. In fact, right now, there's less than 200 people in the world with a better score on Codeforces. 174 to be exact.

In somewhat understated fashion, then, Altman said that the model is, quote, incredible at coding. The model also achieved a near-perfect score on the AIME math exam, missing only a single question. It achieved an 87.7% on the expert-level science benchmark GPQA Diamond, far exceeding top human performance. Still, while those benchmark results are practical and important, a significant amount of focus, at least in the announcement, was on how O3 performed on the ARC-AGI test.

That test attempts to measure a model's ability to deal with novel problems that are difficult to pre-train. It's viewed as testing reasoning capability at the bare minimum and is one plausible benchmark for when AGI has been achieved.

O3 crossed the 85% human performance threshold for AGI, tripling the score achieved by O1. This year's Arc AGI prize winner scored 53.5% using a fine-tuned model of a novel design, and only a handful of attempts have managed to score higher than 30%, just to give a sense of how far the bar was raised. One of the interesting things about the test is that it's relatively easy to solve for humans using basic logic and reasoning, but has so far stumped AI models.

You might have seen these grids of red and blue boxes over the past few days, and this is one of the hardest problems on the test.

Francis Cholet, a legend in machine learning and the creator of the test wrote, Today, OpenAI announced O3, its next-gen reasoning model. We've worked with OpenAI to test it on Arc AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low compute mode for $20 per task in compute and 87.5% in high compute mode, which costs thousands of dollars per task.

It's very expensive, but it's not just brute. These capabilities are new territory and they demand serious scientific attention. Now, when it comes to the question of AGI, OpenAI is not claiming that title. They are using big language. OpenAI co-founder and president Greg Brockman wrote, O3 is a breakthrough with a step function improvement on our hardest benchmarks. But that's different, of course, than claiming AGI.

One of the first noteworthy opinions on whether this is AGI came from Cholet himself. In the thread announcing the test results, he commented, so is this AGI?

While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI. There's still a fair number of very easy Arc AGI 1 tasks that O3 can't solve. And we have early indicators that Arc AGI 2 will remain extremely challenging for O3. This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans yet impossible for AI, without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.

As a total aside on the ARC prize itself, that test is run on a fully private set of questions and must be completed using just 10 cents of compute per task. The team is committed to keeping those parameters until someone releases an open source model that can achieve an 85% score. Choualet believes version one of this test is now saturated and no longer a useful benchmark, but expects version two to present a much greater challenge.

He added,

The ARC Prize 2025 leaderboards will be the best place to monitor reproduction attempts.

Vanta automates compliance for ISO 27001, SOC 2, GDPR, and leading AI frameworks like ISO 42001 and NIST AI risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI.

Over 8,000 global companies like Langchain, Leela AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at vanta.com slash nlw. That's vanta.com slash nlw.

If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode.

That's why Superintelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.

If you are interested in the agent readiness and opportunity audit, reach out directly to me, nlw at bsuper.ai. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Now, of course, the question of whether a thing is or isn't AGI is ultimately less relevant to how good is it at doing things that people currently do now and what's that going to mean for jobs, the economy, and society.

The big place where this conversation was taking place was around developers. Florian Mai writes, "'03 is better than 99.95% of programmers. The public needs to wake up to what's happening so we can act responsibly. For that to happen, we first need the scientific community to acknowledge the evidence. This is the most important problem of our time.'" Entrepreneur Sully writes, "'Yeah, it's over for coding with 03. This is mind-boggling. Looks like the first big jump since GPT-4 because these numbers make zero sense.'"

Still, some are pointing out that coding competitions don't necessarily translate to real-life problems. Machine learning instructor Santiago wrote, O3 is better than 99.95% of programmers solving code forces problems. 99.99% of professional programmers don't need to do code forces problems to make a living. There's absolutely no proof that O3 is capable of doing what those professional programmers do to make money. He continued, I'm not downplaying how much the world is changing. My argument is about what exactly performing well on software engineering benchmarks tells us

and how it's related to the current work of software engineers.

The other benchmarks were no less impressive for paradigm shifting. Didi Das of VC at Menlo Ventures tried to describe just how wild the math benchmark is, commenting, 99.99% of people cannot comprehend how insane frontier math is. The problems are created by math professors and not in any training data. Math legend Terry Tao said these are extremely challenging. I think they will resist AIs for several years at least. OpenAI03 did 25% on this. At this stage, no other model has completed more than a single question.

And this is where we started to see some big think implication type conversations. Stability AI co-founder Ahmad Mostak wrote, my take on O3, the global economy is cooked. We need a new economic and societal framework. Any work that can be done on the other side of a computer screen, AI will be able to do at a fraction of the price. Harry Law, a Google DeepMind and Cambridge University alumni wrote, at $3,000 per task, O3 is already a more cost-effective solution than hiring McKinsey.

And while I think the analogy is not even close to perfect, there is an important point here that a lot of these numbers only seem expensive when they're placed in the context of software, not so much when they're labor replacement. Nick Camerata writes, I set my AI expectations to unrealistically high bonkers AI world and I still underestimated recent progress.

And while I don't want to go deep into the debates around what this means for the singularity and hard takeoff and all these sort of theories, at least not in this particular episode, what's important relative to O3 is that they're a part of the conversation. One thing that was notable was about how big a gap between the internal AI conversation and what's being reported.

Adam D'Angelo writes, wild that the O3 results are public and yet the market still isn't pricing in AGI. Bloomberg reported it as just another leg of the race between OpenAI and Google. The Wall Street Journal ran a feature story about delays to GPT-5 with the headline, the next great leap in AI is behind schedule and crazy expensive. And yet for all the discussion around how cooked people are and all this sort of stuff, I think it's really important to have some perspective here as well. Replit CEO Amjad Mossad said, the idea that O3 will automate software engineers is silly.

Object Zero writes, Matt Griswold pointed out that the replacement of developers is progressing at a much slower pace than the advancement of the technology, commenting,

Professor Ethan Malik makes a point that I do all the time. The reason everything will not change quickly, he writes, even if AI generally exceeds human capabilities across fields, is in large part the nature of systems. Organizational and societal change is much slower than technological change, even when the incentives to change quickly are there. Human social and organizational inertia is going to be a slowdown force that helps us have time to adapt.

Julia McCoy flips it around and says, hype about O3 misses the plot. This isn't about AI getting smarter. It's about humans getting freer. No more data entry, no more mundane tasks, no more trading time for money. Cushy has a similar point. If you view the O3 launch as anything less than irrefutable evidence that this is the most exciting time to be alive, you may need to take a deep breath and rekindle your optimism.

And for those trying to figure out where they spent their time now that this exists, Boyantungus writes, I've been telling you for a while not even to try to compete with machines at being a better machine. Instead, try competing with humans at being a better human.

Look, call me optimistic, but at the end of the day, I just think that the entire history of human experience points towards the output of this explosion of intelligence being a massive increase in human creation. We're going to make more stuff. We're going to make more code. We're going to make more products. We're going to make more entertainment. None of which is to say that the disruptions along the way won't be painful and we do need to deal with them.

But I continue to think that the future is going to be even more exciting than the present. And that seems to me to be a pretty good way to close out 2024. Now, this will be the last regular AI Daily Brief episode of the year. From here on, I've got a number of end of year episodes, which I'm really excited about. We've got the 15 most important AI products of the year, 25 predictions for agents, and a bunch more.

For now, though, can't tell you how much I appreciate you guys watching, listening, hanging out with me here every day. I hope that you are headed into a wonderful holiday season. Until next time, peace.

AGI for Christmas 12:45 Share