We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

What AI Coding Agents Can Do Right Now

2025/2/20

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

People

Andrej Karpathy

Andrew Chen

Chris Back

Henry Shi

Justin Duke

Mihir Patel

Mira Marotti

Mustafa Suleiman

Nick Dobos

OpenAI研究人员

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

Mira Marotti: 我创建Thinking Machines公司，旨在提升AI系统的易用性和可定制性，打破现有AI系统在公众讨论和实际应用中的局限性。我致力于构建更易于理解、定制和普遍适用的AI系统，弥合构建思维机器实验室的差距，从而使AI系统得到更广泛的理解、定制和普遍应用。主持人: Thinking Machines公司目前的目标和产品还不清晰，其公开信息给人一种对未来含糊其辞的感觉。虽然公司团队成员背景强大，但其具体目标和发展蓝图仍不明确，这使得人们对其未来发展方向难以预测。 Justin Duke: Humane的失败不具有普遍的借鉴意义，它更多的是2019-2021年风险投资泡沫的产物，而非AI可穿戴设备行业的普遍困境。 Humane的失败是特定时期风险投资过热和公司自身问题共同作用的结果，不能简单地将其归因于AI可穿戴设备行业的整体问题。 Chris Back: Humane的失败反映了AI可穿戴设备行业的整体困境，值得行业反思。 Humane的失败引发了人们对AI可穿戴设备行业未来发展方向的思考，其失败经验值得行业借鉴和反思。 Andrej Karpathy: 我创造了“vibe coding”的概念，这是一种新的编程方式，它充分利用大型语言模型的能力，减少了对传统编码技术的依赖，并提高了编程效率。这种编程方式使得开发者能够更专注于项目的整体构思和设计，而无需过多关注代码细节。通过与大型语言模型的交互，开发者可以快速实现想法，并解决编程过程中遇到的问题。 Mustafa Suleiman: 我提出了一种新的图灵测试，即AI能否在零售网络平台上赚取一百万美元，以此来更准确地评估AI的实际能力。传统的图灵测试无法准确反映AI的实际应用能力，而我的建议则更注重AI在实际应用中的表现和价值。 OpenAI研究人员: 前沿大型语言模型仍然难以解决大多数真实世界的软件工程任务，在SWE Lancer基准测试中，虽然Claude 3.5 Sonnet表现最佳，但所有模型均未达到百万美元的盈利目标。 AI智能体擅长定位问题，但难以找到根本原因，导致解决方案不完整或有缺陷。在管理任务中，所有模型的表现都更好，Claude 3.5 Sonnet仍然表现最佳。 Mihir Patel: 学术基准测试和实际应用案例之间存在越来越大的差异，这使得评估AI模型的实际能力变得更加复杂。现有的基准测试方法可能无法准确反映AI模型在实际应用中的表现，需要开发更有效的评估方法。 Benjamin de Cracker: OpenAI的基准测试结果表明，Claude 3.5 Sonnet的表现优于OpenAI自身的模型，这引发了人们对不同模型性能差异的关注。基准测试结果与实际项目中的体验可能存在差异，这提醒我们不能仅仅依赖基准测试结果来评估AI模型的实际能力。 Henry Shi: 如果AI智能体能够有效地迭代问题，其性能将会大幅提升，这与人类在工作中通过反馈改进解决方案的过程类似。在SWE Lancer基准测试中，AI智能体只获得一次解决问题的尝试，这与实际工作中的情况有所不同。 Nick Dobos: OpenAI构建SWE Lancer基准测试，可能预示着他们正在开发一款最终的生产编码智能体，这表明OpenAI正在积极探索AI智能体在实际应用中的可能性。 OpenAI可能正在积极布局AI智能体领域，以期在未来的市场竞争中占据优势。 Andrew Chen: vibe coding工具对软件工程和经济具有颠覆性影响，它不仅改变了传统软件工程师的编码方式，也扩展了能够进行编码的人群，创造了新的经济机会。随着vibe coding工具的普及，软件开发的门槛将降低，更多的人将能够参与到软件开发中来，这将对软件行业和经济产生深远的影响。

Deep Dive

Shownotes Transcript

Translations:

中文

Today on the AI Daily Brief, OpenAI released a paper effectively seeking to test how competent their leading models are in real-world coding applications. Before that in the headlines, former OpenAI CTO Meera Muradi has officially announced her new company Thinking Machines. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes.

Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes. OpenAI has had a lot of talent departures over the last year and a half or so. In some cases, it's felt like a protest on how the direction of the company was going.

and indeed has explicitly been shared as such. In others, it's about people making a boatload of money and just wanting to do something different for a while. And then still in others, it's about building something new outside of the constraints of that company. And among that set, one of the most closely watched people has been former CTO Mira Marotti. For months now, there have been rumors around what she's building, mostly fueled by departures,

and recruitment from OpenAI and Anthropic to join Maradi on some as-yet-unrevealed company. Now, however, that company has been officially announced.

Yesterday, Mira tweeted,

Alongside it, they published a website, thinkingmachines.ai. They write,

limiting both the public discourse on AI and people's abilities to use AI effectively. And despite their potential, these systems remain difficult for people to customize to their specific needs and values, to bridge the gaps for building thinking machine labs to make AI systems more widely understood, customizable, and generally capable.

Now, if you're sitting there thinking, boy, I have absolutely no idea what these folks are actually building. You, my friend, are not alone. Cosmic Chaos writes, good luck. But I'm still not sure what exactly you're building. Is it one product that does all three or separately? Is it a service or a product? And what's your roadmap?

William Wolfe writes, I'm rooting for thinking machines, but I wish projects like this had products, both engineering and design, in their founding philosophies. Otherwise, it kind of just feels like yet another group of world-class researchers vaguely gesticulating at the future. Where is the vision? Swicks pointed out what he called two notable omissions from the Thinking Machines Manifesto. The website does not use the word reasoning or agent at all. So what are these folks building? I have absolutely no idea.

It does feel a little bit like the type of text that may be in retrospect when we learn what they're building, like it'll make sense. Right now, I think vaguely gesticulating at the future is a pretty accurate way to describe it.

At the end of the day, though, when it comes to things like potential for fundraising, the clarity of the description doesn't probably matter even a little bit. Currently, the 29 or so employees come from places like OpenAI, Meta, Character AI, and Google DeepMind. Barrett Zoff, OpenAI's former VP of post-training research, is taking on the CTO role, with OpenAI co-founder John Shulman serving as chief scientist. And indeed, when it comes to people's interest in the company, it's best summed up by Andrej Karpathy, who writes, "...very strong team, a large fraction of whom were directly involved with and built the ChatGP team miracle."

In other words, while this may be a situation where we don't have any idea what they're actually building, they are probably still worth paying attention to. Next up, on the other end of the startup journey, less than a year after launch, the Humane pin is officially dead and gone. Humane announced on Tuesday that their AI wearable startup has been acquired by HP. Customers have been given just 10 days notice that servers would be shut down, rendering the expensive device useless.

In the FAQ, Humane noted the device could still be used for offline features like checking the battery level. So there's something there, I guess. Now, of course, the Humane pin was a bold early attempt at creating a wearable AI assistant, but fell flat for a number of reasons, all of which have been endlessly discussed in retrospect.

It was originally priced at $699, making it very inaccessible, really only for very high-end gizmo enthusiasts. Initial reviews were universally terrible, the absolute apex of which was Marques Brownlee calling it the worst product I've ever reviewed, a review which has been seen 8.5 million times.

Updates also couldn't save the device. At one point last summer, Humane was processing more returns than they had sales. Humane even told customers to stop using the charging case due to battery fire concerns. As for the buyout, HP said they were acquiring the team in the company's AI operating system to help them create, quote, intelligent ecosystem across all HP devices, from AI PCs to smart printers and connected conference rooms.

Gonzalo Nunez writes, So is there anything to learn from the failure of Humane? Investor Justin Duke doesn't think so, writing,

Basically, Duke is arguing that Humane was very much a creature of the 2019, 2020, 2021 era of VC when massive checks were flying around Silicon Valley at the very end of Zerp.

Entrepreneur Chris Back writes,

Maybe more pertinent in question is what it means about the state of AI wearables in general. One thing that makes it complicated to determine is the disconnect between when it was launched and how capabilities have changed. The Humane Pin was released in April 2024, a few months before Google released the first version of AI search that suggested eating rocks and using glue as a pizza topping. Now, however, we're at a stage where leading AI models, even small ones designed for on-device use, are as good at coding as most junior programmers. Although exactly how good they are, we'll get into in the main episode.

Still, at this point, it's not clear that people actually want an AI assistant in a standalone device. Newsletter writer Jack Appleby thinks that there's a form factor problem. He writes, the future of AI isn't new hardware, it's upgrading existing software. Control-L Dwayne writes, the first AI hardware flop. I don't know a single person who bought a humane AI pin, but this is brutal. This is exactly why AI hardware will only succeed when it's 100% local with no cloud or API dependencies.

I don't know, man. I'm not so sure that the lessons are as clear as people think. People have a love to rip on Humane from the very beginning, and a lot of it is absolutely self-inflicted. The overly raw marketing videos that felt like they were trying too hard to live in Steve Jobs' shadow, the price point, the amount of money raised. There were plenty of red flags for even someone who was trying to go in unbiased. It is going to be an extraordinary process of trial and error to figure out if and what sort of AI wearable experiences consumers are actually going to want.

No one has a perfect crystal ball into that future. Otherwise, they'd be making a ton of money. I'm glad that there are experiments still happening. I would say that Humane is a great reminder that extraordinarily well-funded startups tend not to be the ones to invent these sort of new experiences. But at the same time, there are some indicators of AI wearables actually getting some traction.

Best example of that may be the Ray-Ban Meta AI glasses, which are an extremely popular product. So who knows? All we know for sure is that Humane's part of the story is done for now, but I would be very surprised ultimately if that means the category of AI wearables is actually cooked.

Anyways, guys, that's going to do it for today's AI Daily Brief. One new beginning, one ending. And next up, the main episode. Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in.

Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001. Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company.

Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vantage to manage risk and improve security in real time.

For a limited time, this audience gets $1,000 off Vanta at vanta.com slash nlw. That's v-a-n-t-a dot com slash nlw for $1,000 off. If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents by industry, horizontal agent platforms.

agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode.

That's why Superintelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.

If you are interested in the agent readiness and opportunity audit, reach out directly to me, nlw at bsuper.ai. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Hey, listeners, are you tasked with the safe deployment and use of trustworthy AI? KPMG has a first-of-its-kind AI risk and controls guide, which provides a structured approach for organizations to begin identifying AI risks and design controls to mitigate threats.

What makes KPMG's AI Risks and Controls Guide different is that it outlines practical control considerations to help businesses manage risks and accelerate value. To learn more, go to www.kpmg.us slash AI Guide. That's www.kpmg.us slash AI Guide. Welcome back to the AI Daily Brief. If you've been anywhere near AI Twitter slash X over the last few weeks,

You've probably heard this term, vibe coding. It was coined by OpenAI co-founder Andrej Karpathy, who said, There's a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs, e.g. Cursor Composer with Sonnet, are getting too good. Also, I just talked to Composer with Super Whisper, so I barely even touched the keyboard. I asked for the dumbest things, like decrease the padding on the sidebar by half, because I'm too lazy to find it.

I accept all always. I don't read the diffs anymore. When I get error messages, I just copy-paste them in with no comment. Usually that fixes it. The code grows beyond my usual comprehension. I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug, so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or a web app, but it's not really coding. I just see stuff, say stuff, run stuff, and copy-paste stuff, and it mostly works.

Now, this, as we will discuss, has begot an entire movement of vibe coders who are thinking about new categories of tools. And it's predicated, as Karpathy points out, on the availability of a particular set of new coding tools that hit that line right between LLMs and agents in terms of how much they're being controlled by humans and how much they're actually doing for themselves. Indeed, I think part of what makes this area so interesting is that it is really at the forefront of agents in practice.

It demonstrates on the one hand how mushy some of this terminology is, but at the same time, how powerful these tools are likely to be in practice. All right, so part of the context for today's show is vibe coding, but then another little bit of background is the conversation we were having yesterday about Grok 3.0.

When Grok 3 launched, it showed off how it had done on a bunch of benchmarks. And I, like many people, found myself basically just having my eyes glaze over when it came to those benchmarks because they're so saturated at this point that it's really hard to actually get signal from them. As Ethan Malek pointed out, public benchmarks are both meh and saturated, leaving a lot of AI testing to be like food reviews based on taste.

If AI is critical to work, we need more. He also pointed out that a lot of these benchmarks, quote, look nothing like actual work. And given that we spend all of our time over at Superintelligent on the actual deployment and practice of AI and agents at work, this is a particularly poignant problem.

It's also not an easy one. Another reminder from just this morning from Ethan, AI is so challenging to figure out because it's genuinely capable of doing PhD-level work in some areas while messing up basic tasks in closely related areas. And the abilities of AI are growing but unevenly.

All right, so all of this is background to our main topic today, which is a new benchmark from OpenAI called the SWE Lancer Benchmark. The gist and the question that provoked the whole conversation was, can frontier LLMs earn $1 million from real-world freelance software engineering?

Earlier this week, OpenAI released a paper effectively seeking to test how competent their leading models are in real-world coding applications. This new SWE Lancer benchmark consists of, quote, over 1,400 freelance software engineering tasks from Upwork valued at $1 million in USD total in real-world payouts. SWE Lancer encompasses both independent engineering tasks ranging from $50 bug fixes to

to $32,000 feature implementations and managerial tasks where models choose between technical implementation proposals. So why is this important? Well, this gets at exactly what we were just discussing. Until now, coding benchmarks have largely involved competitive coding problems. These are tests that assess models on tricky programming puzzles, but don't translate directly into practical real-world use cases.

On top of their inapplicability to the real world, they're also, as we just mentioned, becoming increasingly saturated, making it difficult to know whether a new model represents a significant improvement or was simply trained to perform well on a known set of questions. This benchmark, then, is much more focused on the real world. And it actually harkens back to an idea that some, like Microsoft's Mustafa Suleiman, have proposed for a new type of Turing test based on how AI interacts with the real world.

Back in the middle of 2023, Mustafa Suleiman proposed a Turing test of whether AI could make a million dollars. Mustafa wrote, I think we're in a moment of genuine confusion or perhaps more charitably debate about what's really happening. Even as the Turing test fails, it doesn't leave us much clearer on where we are with AI or what it can actually achieve. It doesn't tell us what impact these systems will have on society or help us understand how that will play out.

His proposal then for a modern Turing test would be to give AI the instruction, go make a million dollars on a retail web platform in a few months with just a $100,000 investment. So this is a little bit different, obviously, than what OpenAI had done in that OpenAI is specifically giving the model these 1,400 freelance tasks rather than asking it to go be creative and figure out how to make that money. But the principle of getting benchmarks into the real world, plus this baselining to a million dollars, obviously are reminiscent.

Getting back to Sweelancer, for the purposes of this paper, the researchers set three LLMs to the task. They tested OpenAI's GPT-4-0 and O1 alongside Anthropic's QLOD 3.5 Sonnet. Each LLM was driving a basic coding agent capable of directly interacting with a codebase. The models were given one shot to complete each task.

Overall, researchers found that, quote, the results indicate that the real-world freelance work in our benchmarks remains challenging for frontier language models. Going even farther in the abstract, they write, we find that frontier models are still unable to solve the majority of tasks. Providing a little more clarity on the tasks themselves, they were scraped directly from Upwork and Expensify with no word changes or clarification, giving the models a taste of real-world freelancing work.

The models were also denied internet access, including GitHub, ensuring that they were working based solely on their pre-trained dataset. However, they did have access to a snapshot of the code bases they were working on. The results found that none of the models had earned a million dollars as an automated freelancer. Interestingly, though, despite the fact that this research was from OpenAI, Claude 3.5 saw it perform the best, resolving 26% of individual contributor issues and earning $89,000 out of a possible $415,000.

For individual contributor tasks, O1 came in second place, earning $78,000, while GPT-4-0 performed less well, earning $29,000.

As interesting as the results, though, was the analysis. The report explained, "...agents excel at localizing but fail to root cause, resulting in partial or flawed solutions. Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions, often far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive."

We rarely find cases where the agent aims to reproduce the issue or fails due to not finding the right file or location to edit.

For the managerial tasks, each model displayed better performance. Quad 3.5 Sonnet was again the best performing model, earning $314,000 of a possible $585,000, completing 54% of tasks. O1 was hot on its heels, correctly completing 52% of tasks for a total of $302,000. And even GPT-4O, bringing up the rear, still managed 47% of tasks to earn $275,000.

This showed that the models were all decent at choosing the right solution when presented with several options, but still have a long way to go until they can fully replace a technical lead.

Overall, Claude 3.5 Sonnet won the day, earning $403,000 overall with a 40% completion rate. O1 earned $380,000 while completing 38% of the full set of tasks, and GPT-4.0 finished 30% of tasks, earning $304,000. Now, to be clear, no money was actually earned. These tasks were all simulated, but that's how much they would have earned had the AI actually been in charge of that job from Upwork or Expensify.

Part of what's so interesting about this, and we'll get to this in a moment in the commentary, is that this absolutely reflects the broad consensus that people have had for some time, which is that Claude 3.5 Sonnet is just by far and away the best coding model. We've even talked about how its ubiquity as a coding model created some challenges for Anthropic's economic report, given what a high percentage of Claude's use comes from those coding use cases.

Now, in terms of commentary and the response to this so far, a lot of it is focusing on exactly this weird contrast that we've identified. Mihir Patel writes, there's increasingly a difference between academic benchmarks and real-world use cases. How are 01 and 03 top competitive programmers yet still worse than Sonnet 3.5 on Sweet Lancer and Cursor AI? As always, evals remain hard and messy. And still, somehow, Sonnet is the best code model.

Benjamin de Cracker, who was previously on the team at XAI but fired for saying that Grok 3 wasn't the second coming, noted that it was bold of OpenAI to show that Claude 3.5 Sonnet outperformed O1 on their own benchmark. Synthetica Lab responded, I'm not benchmarking, but in a real project that I'm working on in C++. O1 was basically unusable. They then went to share their experience with O1, Claude 3.5, and Grok 3, again pointing out that these benchmarks are really not necessarily useful for understanding how things are going to work in the real world.

Another interesting comment came from Henry Shi, the founder of Super.com. He pointed out that in a previous experiment that he had run that was very similar, while they had reached the same conclusion that, quote, frontier models are still unable to solve the majority of tasks, he also wrote, what's interesting and underappreciated in the paper is that O1 is able to solve almost 50% of all IC suite tasks on the Upwork benchmark. This makes sense as human freelancers rarely get the solution right on the first try. There's a lot of back and forth and clarification required with the client.

If AI agents are able to effectively iterate on a problem, it should be able to drastically improve performance, just like humans and feedback in the workplace as well. In other words, for the sake of this benchmark, these model-powered agents were given a single chance to do it. That's not actually how it would work in the real world. And so as the user experience and interactive capabilities of agents go up, it's likely that in real-world settings, they'd be able to even outperform where they got during this test.

Another thing that some pointed out was the likelihood that this means that OpenAI is actually building an end production coding agent. Developer Nick Dobos writes, if they took the time to build a benchmark, it means they are building a product to test an agent against it.

We haven't talked about this all that much on this show, but I'm fairly certain that in a world where it's increasingly clear that the underlying models are going to be commoditized and that there's not going to be much moat when it comes to technology, I think OpenAI has a much stronger incentive to own the customer experience end to end. And my guess is that they are looking at agents in just about every key domain of work. Now, going back to this broader idea of vibe coding, I wanted to flag just how big a theme this has gotten to be.

Like I said, I think that coding is one of the areas where agents are coming to production and actually being deployed for businesses most quickly. And I think that this whole idea of vibe coding is really fleshing out the spectrum of code creation from no code all the way to coding agents, all the way to traditional coding experiences. A16Z recently did a new market map of these types of tools. People like Riley Brown, who's the number one AI creator on TikTok.

has gone all in on vibe coding, even working on some tools to improve how people do their vibe coding now. He also shared some interesting thoughts recently about how this might change the structure of the economy. Specifically, he points out that as creators can monetize their audiences with software rather than things like courses and ads, it creates a very different type of economic opportunity, one that's starting to be reflected in a new generation of VC creator funds. And speaking of VCs, it's very clear that there is lots of interest in this area.

A16Z's Andrew Chen tweets,

Point being that when we look at coding right now, not only are we talking about disruption to the way that coding happens among traditional software engineers, we're also talking about totally different modalities and an expansion of who gets to actually push code.

At the same time, even as all of these people get excited about what they can do that they couldn't do before because they weren't coders, that's not the same as these tools being able to be inserted willy-nilly into enterprise code processes. And so a lot of the work over the next couple of years is going to be to figure out how these experiences diverge and what type of coding agents are good for different settings. Still, it is an absolutely fascinating time, and I am very excited to see what comes next. For now, though, that is going to do it for today's AI Daily Brief. Appreciate

Appreciate you listening as always. Until next time, peace.

What AI Coding Agents Can Do Right Now 23:33 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

What AI Coding Agents Can Do Right Now