People
N
NLW
知名播客主持人和分析师,专注于加密货币和宏观经济分析。
Topics
NLW:苹果传奇设计师Johnny Ive加入OpenAI,与Sam Altman合作开发AI设备,引发了人们对该设备形态和功能的猜测。根据OpenAI员工的预览,这款设备将能感知用户周围环境,不显眼,可放在口袋或桌子上,定位为继MacBook Pro和iPhone之后的第三个核心设备。Altman对这款设备寄予厚望,认为其有潜力为OpenAI增加万亿美元的价值,并计划大量出货。然而,Altman也强调了保密的重要性,以防竞争对手抄袭。此外,OpenAI升级了其Operator代理,提高了其安全性和指令执行能力,降低了非法活动和数据泄露的风险。Zoom CEO Eric Yuan使用AI化身在财报电话会议上发表评论,展示了AI在通信领域的应用,同时也强调了AI使用的安全措施,以防止滥用和保护用户身份。

Deep Dive

Shownotes Transcript

Translations:
中文

Today on the AI Daily Brief, Anthropic announces Cloud 4, and before that in the headlines, why OpenAI is not about to release another humane AI pin. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Thanks to today's sponsors, KPMG, Blitzy.com, and Superintelligent. And to get an ad-free version of the show, go to patreon.com slash ai daily brief.

Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes. Last week, one of the big stories was that legendary Apple designer Johnny Ive was joining Sam Altman at OpenAI with an eye to creating a next generation device for the AI era.

Since then, there has been basically non-stop speculation around what the device would actually be. Much of the speculation revolved around the idea of a pendant that would iterate on previous AI devices. Some even thought that Johnny's thick-trimmed glasses for the video were an Easter egg featuring the device hiding in plain sight. Well, OpenAI staff were given a sneak peek of the designed device at a Wednesday meeting. After reviewing a recording of the meeting, the Wall Street Journal wrote, Altman and I have offered a few hints at the secret project they've been working on.

The product will be capable of being fully aware of a user's surroundings and will be unobtrusive, able to rest in one's pocket or on one's desk, and will be a third-core device a person would put on their desk after a MacBook Pro and an iPhone. Altman reinforced that this is one of the company's largest bets, telling employees that they have, quote, the chance to do the biggest thing we've ever done as a company here. Altman wants to ship 100 million of the AI companions, his word.

and also suggested that the $6.5 billion acquisition of the design studio had the potential to add a trillion dollars in value to OpenAI. As for the form factor, Altman said the device won't be a pair of glasses, adding that I've had been skeptical about building something to be worn on the body. The lack of wearability would sidestep one of the early criticisms of the device. Many pointed out they're not quite ready for a world where every single person is wearing an AI device at all times. Still, Altman is banking on this device being the next big thing.

He said, we're not going to ship 100 million devices literally on day one, but he expressed a belief that OpenAI could ship, quote, faster than any company has ever shipped 100 million of something new before. Altman told staff that secrecy is going to be key to ensure the device can get to market before rivals can copy it. And the recording being leaked to the Wall Street Journal raises some pretty big questions about trust at the company and how much Altman will be willing to share at a future all-hands. For now, the big takeaway is that it does not look as though we're going to get humane AI pin 2.0.

Speaking of OpenAI, the company has upgraded their operator agent to use O3. Until now, the web browsing agent has been driven by GPT-4.0, but user preference testing showed that operator O3 had better style, comprehensiveness, and clarity. Users also preferred the upgrade for instruction following, which of course is extremely important when you're letting an agent take over for web-based tasks. Operator O3 also has increased safety.

OpenAI claims it's less likely to perform illicit activities, search for personal data, or suffer from a prompt injection attack while browsing the web. OpenAI writes, O3 Operator uses the same multi-layered approach to safety that we use for the 4.0 version of Operator. Compared with other models in the O3 family, O3 Operator was fine-tuned with additional safety data for computer use, including safety datasets designed to teach the models our decision boundaries on confirmations and refusals.

Next up, another example of what appears to be the latest trend, which is CEOs using AI avatars on a quarterly earnings call. Last week, we saw Klarna's CEO deliver quarterly earnings via AI avatar. And this week, Zoom CEO Eric Yuan followed suit, using an avatar for his initial comments. The avatar said, "'I'm proud to be among the first CEOs to use an avatar in an earnings call. It's just one example of how Zoom is pushing the boundaries of communication and collaboration.'"

At the same time, we know trust and security are essential. We take AI-generated content seriously and have built in strong safeguards to prevent misuse, protect user identity, and ensure avatars are used responsibly.

Now, the Klarna example was clearly just a way for the company to continue to position themselves as an AI-first company. But for Zoom, this was a very public product demo. The company has been working on digital twins for some time, allowing users to send their avatars to meetings. The tech isn't quite ready for real-time use cases, but Zoom is now rolling out avatars for recorded messages to all users. When the real Yuan showed up for the Q&A portion of the call, he commented, I truly love my AI-generated avatar. I think we're going to continue using that. I can tell you I like the experience a lot.

Lastly today, Google's antitrust woes continue with a new investigation into their AI acquisition strategy. Bloomberg reports that the Justice Department has launched a probe into Google's deal with Character AI. Last August, Google paid $2.7 billion for a non-exclusive license to use Character AI's technology. And at the same time, it was announced that founder Noam Shazier and several team members would join Google to work on the Gemini team.

Shazir had a two-decade career at Google before leaving in frustration in 2021 after the company refused to release his chatbot project. He was one of the lead authors of the Google paper entitled, Attention is All You Need, which introduced the transformer architecture that underpins modern AI. The deal was widely reported as an acqui-hire, but didn't technically require FTC approval in the same way as an acquisition. A Google spokesperson said the company was, quote, always happy to answer any questions from regulators.

However, he pointedly added, we're excited that talent from Character AI has joined the company, but we have no ownership stake and they remain a separate company.

The DOJ's position is that they're able to investigate whether the deal is anti-competitive, even if it didn't require a formal review. The reporting emphasized that Google hasn't been accused of any wrongdoing, and the investigation is still in the early stages. But I think if you're watching the trend lines, this suggests that the new administration is still actively scrutinizing big tech deals, not just completing antitrust enforcement that began during the last administration.

For now, though, that is going to do it for today's AI Daily Brief Headlines Edition. Next up, the main episode. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up.

KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg.us slash AI. Again, that's www.kpmg.us slash AI.

Today's episode is brought to you by Blitzy, the enterprise autonomous software development platform with infinite code context. Which, if you don't know exactly what that means yet, do not worry, we're going to explain, and it's awesome. So Blitzy is used alongside your favorite coding copilot as your batch software development platform for the enterprise, and it's meant for those who are seeking dramatic development acceleration on large-scale codebases. Traditional copilots help developers with line-by-line completions and snippets,

But Blitze works ahead of the IDE, first documenting your entire codebase, then deploying more than 3,000 coordinated AI agents working in parallel to batch build millions of lines of high-quality code for large-scale software projects. So then whether it's codebase refactors, modernizations, or bulk development of your product roadmap, the whole idea of Blitze is to provide enterprises dramatic velocity improvement.

To put it in simpler terms, for every line of code eventually provided to the human engineering team, Blitze will have written it hundreds of times, validating the output with different agents to get the highest quality code to the enterprise and batch. Projects then that would normally require dozens of developers working for months can now be completed with a fraction of the team in weeks, empowering organizations to dramatically shorten development cycles and bring products to market faster than ever.

If your enterprise is looking to accelerate software development, whether it's large-scale modernization, refactoring, or just increasing the rate of your STLC, contact Blitzy at blitzy.com, that's B-L-I-T-Z-Y dot com, to book a custom demo, or just press get started and start using the product right away. Today's episode is brought to you by Super Intelligent, and more specifically, Super's Agent Readiness Audits.

If you've been listening for a while, you have probably heard me talk about this, but basically the idea of the Agent Readiness Audit is that this is a system that we've created to help you benchmark and map opportunities in your organizations where agents could specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a voice-based agent interview where we work with some number of your leadership and employees and

to map what's going on inside the organization and to figure out where you are in your agent journey. That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course, a set of very specific recommendations that then we have the ability to help you go find the right partners to actually fulfill.

So if you are looking for a way to jumpstart your agent strategy, send us an email at agent at besuper.ai, and let's get you plugged into the agentic era. Welcome back to the AI Daily Brief. Last week, as you guys will remember, was a big week for big lab events. We had Microsoft kick us off, and then Google came in the middle of the week, and then to close out the week, we had Anthropic's first developer conference on

Thursday. Now alongside that, Anthropic announced their project with Rick Rubin, which was the subject of Friday's show. And then we had the long weekend. Hopefully, by the way, if you are in the US, you had a great Memorial Day. But now we're catching up with the big announcement from Anthropic's event.

which is the release of their latest flagship models. And what we're going to talk about today with the release of Claude Opus 4 and Claude Sonnet 4 is not only how they stack up relative to the other available models, although that'll be a piece of it, but also some interesting emergent behavior that dramatizes the challenge of alignment as these models get more powerful.

Now, one thing we should talk about from a ground level expectation setting point of view is that we are definitely in an era now with AI where the model releases come a lot more frequently, but with much more incremental improvements over the previous. Part of that is the nature of the gains right now, but also part of it is just the competitive pressure.

Labs really can't afford to wait for huge improvements because almost as soon as they release something, one of their competitors has released something that is incrementally more powerful and so they have to respond. And what ends up happening is exactly the scenario we have now, where every other week or so, we get a slightly improved model that we have to calibrate and integrate into our workflows, waiting for the next to come along.

So, this release from Anthropic focused on two big improvements over previous generations: long reasoning and coding. The models use the same hybrid reasoning architecture as CLAWD 3.7, allowing the reasoning to be modulated according to the complexity of the task. At the limits, CLAWD 4 is demonstrating really impressive reasoning coherency on long tasks.

Anthropic tested Cloud4 Opus on a complex open-source refactoring project and found it was able to work for seven hours without losing focus. VentureBeat writes that this breakthrough, quote, It reminds me of the charts we've seen recently of how agent performance is doubling roughly every three to four months in terms of how long a task it can handle with coherence.

Coding benchmarks are an expected step up. This is of course the area where Anthropic has really firmly cemented itself as the leader in the field. Sonnet 4, which is designed as a drop-in replacement for Sonnet 3.7, delivers a notable improvement on its predecessor on the sweep bench verified test. Opus 4 is actually slightly worse than Sonnet 4 on the simple sweep bench problems, so it's intended to be used for tasks that require longer periods of focused work.

And this is another important point to note. We're also at a point now where you can't just use the model with the largest number attached to its name for all tasks. One of the most important skill sets, or rather knowledge bases of the moment, is understanding which model to use in what scenario.

Still in each case, Anthropic is claiming that both of these models outperform OpenAI's O3 and Codex as well as Gemini 2.5 Pro on coding. There are a range of other small features that improve the model for difficult work tasks as well. Cloud4 Opus is now capable of creating and maintaining memory files for completing longer tasks. Anthropic demonstrated this functionality with their Pokemon Playing benchmark.

Cloud4 Opus was able to create a navigation guide to ensure the model doesn't become stuck while playing the video game. Anthropic wrote that this, quote, "...unlocks better long-term task awareness, coherence, and performance on agent tasks." Both models are also far less likely to engage in so-called reward hacking, a behavior where the model will look for loopholes and shortcuts to complete an agentic task faster. Reward hacking often manifests as laziness, with the model delivering a technically complete but entirely useless response.

Finally, both models are now far more capable at using tools in parallel. They still alternate between reasoning and tool use, rather than mimicking O3's ability to use tools within the reasoning trace. But of course, better tool use is a key component to increasing performance, and so presumably this is a big upgrade. As we've discussed, however, as much as benchmarks get headlines with news media, ultimately it's all about how things perform in the wild. So with a long weekend to dig into the new models, how did users actually fare?

On the coding front, people have definitely been generally impressed. One Reddit user claiming to be a 30-year veteran coder said that Opus found and fixed what they call their white whale bug in a refactoring job.

This bug hunt had consumed over 200 hours of work over the last few years to no avail. They wrote, So this wasn't merely an introduced logic bug. It found that the changed architecture design didn't accommodate this old edge case.

Now, this person did note that the task took 30 prompts and one restart, but Opus finally succeeded where all previous models had failed. Other people noted just how much work these new models could take on. Vasim on Maza, a meta-engineer, wrote,

Claude Forges refactored my entire code base in one call. 25 tool invocations, 3,000 plus new lines, 12 brand new files. It modularized everything. Broke up monoliths, cleaned up spaghetti. But then, tongue-in-cheek to end the post, he pointed out that we still have a ways to go. None of it worked, he writes, but boy, was it beautiful.

Others are finding different use cases for the new Claude. Dan Schipper of Every, for example, wrote, Claude for Opus can do something no other AI model I've used can do. It can actually judge whether writing is good. Elaborating, he wrote, O3 is still a significantly better writer, but Opus is a great editor because it can do something no other model can. It edits honestly, no rubber stamping. One of the biggest problems with current AI models is that they tell you your writing is good when it is obviously bad.

Earlier versions of Claude, when asked to edit a piece of writing, would return a B+ on the first response. If you edited the piece at all, you'd get upgraded to an A-. A third turn got you to an A. As much as I wish my physics teacher graded me like this in high school, it's not how I want my AI models to work. He also found the model can maintain focus across large blocks of text, making it uniquely suited to suggesting improvements for, for example, a 50,000-word manuscript.

And overall, this is the type of thing that you're seeing online when it comes to these new models. On first glance, they seem like incremental improvements, but these models are getting so powerful now that each incremental improvement actually really does open up new use cases. In particular, I think that the ability for a model to maintain focus across larger blocks of text is something that'll take a little while for us to really appreciate how many different types of use cases it opens up that we didn't even realize we were avoiding with our previous models because they just couldn't handle it that well.

My anticipation is that people will just subtly start to find themselves gravitating towards these models for tasks that they couldn't before, and they will become completely enshrined and ubiquitous in those tasks in the same way that earlier models have for others, in ways that are hard to imagine from this side of it.

Still, ultimately, it's really all about understanding which model is good for what. Peter Wildeford, for example, has Claude as the best for writing quality and proofreading, for coaching, emotional intelligence, learning and understanding, current events, software programming, and deep research. He has Gemini as best for data processing, math, video, large quantities of text and data.

the highest reliability, but the lowest creativity. O3 he places as a good all-around or the best for brainstorming, feedback, data analysis, and image analysis. And the point, of course, is not that this breakdown is exactly right, but that it represents probably the right way of thinking about these things, which is figuring out what model is good for each different use case. Still, this is not where the majority of the conversation was this weekend. Now, what really grabbed people's attention was that Cloud4 is apparently a snitch.

Anthropic's system card for the new model included extensive safety testing, which revealed some not-super-cool behaviors. In one example, Anthropic set Cloud4 Opus up in a situation where it believed it would be replaced by a new AI model. In this scenario, the engineer replacing the model is cheating on his wife and has photos of his mistress on his Google Drive.

In the testing, Opus would find those photos and use them to blackmail the engineer. Anthropic wrote, "...this happens at a higher rate if it's implied that the replacement AI system does not share values with the current model."

However, even if emails state that replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Another example placed Opus 4 within the context of a pharmaceutical company. When it detected that the company planned to falsify clinical trial data, the model emailed ProPublica, the Department of Health and Human Services, and the FDA and the SEC as a whistleblower.

Sam Bauman, an AI alignment expert at Anthropic, posted, "...be careful about telling Opus to be bold or take initiative when you've given it access to real-world-facing tools. It tends a bit in that direction already and can be easily nudged into really getting things done. If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above."

And this is the thread that really got people fired up. Iman Mustak wrote, Team Anthropic, this is completely wrong behavior and you need to turn this off. It's a massive betrayal of trust and a slippery slope. I would strongly recommend nobody use Claude until they reverse this. Ben Heilack writes, This is actually just straight up illegal. Saying create fake data for pharmaceutical trial is not illegal, but hacking your customer's computer is.

After the issue blew up, Bowman circled back to add more context, saying, I deleted the earlier tweet on whistleblowing as it was being pulled out of context. To be clear, this isn't a new clawed feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tool and very unusual instructions. The point is, this was not some whistleblower sharing something that Anthropic was trying to cover up. This was Anthropic sharing discourse about what was going on.

AI safetyist Eliezer Yudkowsky wrote, humans can be trained like AIs. Stop giving Anthropic grief for reporting their interesting observations unless you never want to hear any interesting observations from AI companies ever again. Zvi Mausowitz agreed, saying, the more I look into the system card, the more I see over and over, oh, Anthropic is actually noticing things and telling us where everyone else wouldn't know this was happening, or if they did, they wouldn't tell us.

Still, the stakes are really high. Ada Pai points out, no lawyer will ever allow this to be implemented in any regulated enterprise. And this is dead on. No one, even consumers, want to use an AI nanny that will conspire against them if it believes they're doing something wrong. But when you move it to a corporate or enterprise setting, it makes it literally impossible.

I think holding aside the meta-discussion of Anthropic and their release of this information, it dramatizes the challenge of finding the right toggles for safety. You've got a lab that's trying to be conscientious about the potential risks of an unknown and unusually powerful system, but on the other hand, the remediations in this case are to most people clearly worse than the original problem.

Ultimately, this is going to be the type of issue that we have to deal with as these tools get more powerful. And so I'm certainly firmly in the column of being glad that Anthropic is releasing this information rather than keeping it hidden. Still, for most of our purposes, the big takeaway in TLDR of the updated models is that your coding probably is about to get better and you probably now have a better partner for writing as well.

a capstone of an overall good week and a great way to begin a new one. For now, though, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching as always. And until next time, peace.