The Next Step in Our Journey to AI Agents: Anthropic's Computer Use

2024/10/24

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

A

Alex Albert

A

Alex Falcon

A

An AI entrepreneur

B

Blake

L

Liam Alfaro

M

Michelle Zhou

T

Tony Gey

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

多

多位评论者

主持人：Anthropic 的 Claude 模型采用了一种与 GPT-4 不同的方法，更擅长处理需要推理和循序渐进步骤的任务，例如编码和数学问题。人工智能发展阶段：第一阶段是对话式AI；第二阶段是具有推理能力的AI；第三阶段是能够采取行动的AI智能体；第四阶段和第五阶段是AI智能体的更高级能力。Anthropic 的“计算机使用”功能使 AI 能够像人类一样使用计算机，这标志着 AI 自主性方面的一大进步，它将解锁许多以前无法实现的应用。Claude 通过计算像素来控制鼠标指针，并能够在遇到障碍时自我纠正和重试，这表明计算机使用能力改变了 AI 的思考方式。Anthropic 的计算机使用功能目前仍处于实验阶段，容易出错，并且一些对人类来说简单的操作对 AI 来说仍然很困难，目前仅通过 API 提供。目前没有在互联网上训练计算机使用功能，并认为在模型风险较低时引入这项功能更安全。 Alex Albert：计算机使用是人机交互方式的重大转变，未来几年，人机交互方式将发生根本性改变，AI 将能够像人类一样熟练地操作计算机，并完成更复杂的任务。计算机使用 API 的一个应用案例是工程师使用 Claude 来点外卖。 An AI entrepreneur：Anthropic 的计算机使用 API 提供了一种让 AI 像人类一样操作计算机的解决方案，但安全问题是一个挑战。 Alex Falcon：Anthropic 的计算机使用 API 可以用于收集信息和填写工作申请。 Michelle Zhou：计算机使用功能可以作为人类反馈的替代品，并加速 AI 智能体的自主化进程。 Tony Gey：Anthropic 的计算机使用模型成本高昂，需要更低的成本或更有价值的用例才能有效。 Liam Alfaro：Anthropic 的新计算机模型可能导致人类不再理解软件界面。 Blake：如果目前的自动化技术是成本最低、速度最慢的，那么未来将会非常激动人心。多位评论者：Anthropic 的计算机使用功能标志着 AI 智能体时代的到来，并预示着人机交互方式的根本性转变。

Deep Dive

Today, on the A I daily brief, the next step towards our agents future. The A I daily brief is a daily podcast and video about the most important news and discussions in A I to join the conversation, fall, the S.

Hello, friends, quick know before we dive in today, instead of our Normal set up where we have headlines followed by the main episode we are just doing along extended main episode is all about this big advance from anthropic. I think you'll understand why is you dig in, but that is why you're not getting any headlines. enjoy.

And later this week will be back to more Normal formats. There have been two moments over the past few weeks where we really get a chance to step back and see the very beginnings of a branch off on the AI evolutionary tree. The first of those was, of course, open the eyes of one.

This was their reasoning model. And IT wasn't just a bigger model in GPT foro. IT takes a fundamentally different approach to how IT works. O one's model basically has a built in sort of chain of thought approach that breaks down complex tasks in the simpler steps and reasons through them sequentially before generating a response.

This is why, by the way, one of the things that we learned when prompting o one is opposed to GPT four o is that you don't need to add things like things step by step. Now the net impact of o one's reasoning approach is that it's much Better able to handle things like coating and math. And when IT comes to business, it's Better about things that have a distinct write answers based on inputs. So IT might not make a poll any Better than GPT for o, for example.

But if you're trying to figure out the ideal arrangement of a banquet hall for a big convention and you can give IT all of the relevant inputs, it's going to be much Better at figuring that out than, for example, GPT four o and while the difference is subtle, the important thing, like I said, is that IT represents this branch off of the LLM tree, where we are moving suddenly, but surely, into a new reasoning era, which is, of course, in and of itself, the beginning of a new agented era opening. I also recently share their stages of artificial intelligence. Level one was chatbot and A I with conversational language.

Level two, which a one represents, were reasoners with human level problem solving. Level three, which were not at yet, our agents are systems that can take actions. And the level four and five are basically what collections of agents can do and more advanced abilities.

So level four is innovators, AI that can ate an invention. Level five is org ization AI that can do the work of a full organization. This is the relevant set up for what anthropic recently announced, which is called computer use.

Now this is part of a larger announcement that also includes a model updates, including an upgraded cloth three point five on IT as well as a new model called three point five high. But there is no doubt that the main discussion and excitement around this was computer use developers. Anthropic rights can now direct clud use computers the way people do by looking at a screen, moving a cursor, clicking and typing text, they write the cloud.

Three point five sonet can now follow user commands to move a curse around their computer screen, click on relevant locations and input information via a virtual keyboard, basically IT emulates how people interact with their computers. Now, the version of this that they started expLoring is purposefully very general. IT is not use case specific, and that seems very intentional.

They write a vast amount of modern work happens via computers. Enabling a is to interact directly with computer software in the same way people do will unlock a huge range of applications that simply aren't possible for the current generation of the assistance. Now, in a similar way to the idea that, oh, one was not just a bigger or Better model, but a different approach.

So too, is computer use not innovation as model innovation, but a capability y's innovation, anthropic rights. Over the last few years, many important milestones have been reached in the development of powerful A I, for example, the ability to perform complex logical reasoning in, the ability to see and understand images. The next frontier, they argue, is computer use A I models that don't have to interact via b spoke tools, but that instead are empowered red to use essentially any piece of software as instructed.

The announcement post also gave a little bit of the background around how this came together, one of the cool behind the scenes things as how this works. They write when to develop tasks, claude, with using a piece of computer software and gives you the necessary access cloud, looks at screen shots of what's visible to the user, then count how many pixel vertically or horizontally IT needs to move a cursor in order to click on the correct place. They continue training claude account pixel accurately was critical without the skill the model finds is difficult to give mouse commands.

So this is a big part of the secret of these capabilities that IT literally counts pixel. However, they also found these new capabilities unlocked a lot of new capabilities that IT wasn't specifically trained on. They say we were surprised by how rapidly claude generalized from the computer use training.

We gave up on just a few pieces of simple software, such as a calculator and a text eaters, in combination with clouds, other skills. This training granted that the remarkable ability to turn the users written prompt into a sequence logical steps and then take actions on the computer. We observe that the model would self correct and retry test when I encountered obstacles, basically in the same way that the ability to use the computer changes how we think the ability to use a computer seems to change the way that the L A quote, quote.

thinks. Anthropic sums up the shift by saying computer uses a completely different approach to AI development. Up until now, eleven developers have made tools for the model producing customer environments where a is used specifically design tools to complete various tasks.

Now we can make the model with the tools clg can fit into the computer environments we all use of everyday. Our goal is for claud to take preexisting pieces of computer often are, and simply use them as a person would. Now a couple caveats to all of this.

First, anthropic is quite clear that this is very experimental at this stage and that IT tends to be pretty area prone. IT also says that there are a bunch of actions that seem incredibly easier effort list for people that are difficult to impossible for cloud computer use. Those includes scrolling draggin, zoom ing, the framework in which they're releasing.

This is as an experiment, and right now, it's only available through A P I. This isn't something that a general user of claude can just fire up and do. It's something that's going to require a developer to actually set up a specific application for overall, the tones from anthropic is very much this is a first glimpse into the future, not a production ready product.

They even joke, even while recording these demos, we encountered some amusing moments. In one, claud accidentally stopped a long running screen recording, causing all footage to be lost. Later, close, take a break from our coating demo and and began to produce photos of yellow low stone national park.

So we can say for the machines that at least IT has good taste. Now I should note that anthropic also does get into some of the questions around safety. They did not, for example, train computer use on navigating the internet yet.

They note that clock three point five saw IT, even with computer use, still remains at A I safety level. Two, meaning that a quote doesn't require a higher standard of safety and security measures than those we currently have in place. They continue when future models require A I safety level, three or four safeguards because they present catastrophic risks.

Computer use might acerbity those risks. We judge that it's likely Better to introduce computer use now while models still only need A I safety level two safeguards that means we can begin grappling with any safety issues before the stakes are too high, rather than adding computer use capabilities for the first time into a model with much more serious risks. For what it's worth, this sounds very similar to how open eye always talks about their approach to safety.

In other words, iterative deployment that allows all of us to adapt to new capabilities in something of a more incremental way. Today's episode brought to you by Venus, Venus is a private, uncensored, generate A I APP IT access is open source models to enable text, image and co generation without the fear of being spied on or having your data discuss anything with Venus, without concern about IT being monitored, sold or given to advertisers in governments. Venus is different because your conversations and creations are kept security within the browser, never stored or accessible by Venus, unlike other AI apps.

When is want to tell you what's OK to say or not? Then is one patronized. You IT simply provides direct access to machine intelligence.

No topics are off limits, no idea taboo with Venus, your in control of the A I, as you should be prosued ripes are available for forty nine dollars a year or eight dollars per month. A ily brief listeners receive a twenty percent discount on Venus pro. Visit Venus a slash w you enter the discount code N L W daily brief.

That's N L W daily brief. All one word, two things I feel qualified to talk about organization and productivity, apps and A I tools, which is what I am very happy that today's episode, sponsored by notion, notion devines your notes, stocks and projects into one space that simple and beautifully designed. And the new notion A I has the capability of multiple A I tools built in, which means you can search, generate, analyze and chat all inside.

Notion, the new notion A I is a single A I tool that does IT all search across notion and other apps, generate dogs in your style, analyzed PDF and images, and chat with you about anything. Notion is a perfect place to organize your tasks, track your habits, right, beautiful dogs, and collaborate with your team and more content you to notion, the more notion a eye and personal responses for you. Basically, unlike generic chat pod s notion A I already has a conduct of your work.

There are also a bunch of great integrations. Notion uses A I knowledge from GPT foreign laud and with A I connectors, which is now in beta. Notion A I can also search across discussions, google doc sheet and slides, and more tools like get hub and gear are coming soon.

Notion is used by over half of fortune five hundred companies, but more importantly, it's used all day, every day by me. Try notion for free when you go to a notion doc com slash AI daily brief, that's all lower case letters notion doc com slash AI daily brief to try the powerful, easy to use notion AI today. And when you use that link, you're supporting the show once again, that notion dot com slash A I daily brief.

Today's episode is brought you by super intelligent. Every single business workflow in function is being remade and reimagine with artificial intelligence. There is a huge chAllenge, however, of going from the potential of AI to actually capturing the value. And that gap is what super intelligence is dedicated to filing super r intelligent accelerates AI adoption and engagement to help teams actually use A I to increase productivity and drive business value. An interactive A I use case registry gives your company full visibility into how people are using artificial intelligence right now.

Pare that with capabilities, building content in the form of tutorials, learning path, ths and a use case library and super intelligence helps people inside your company show how they're getting value out of A I, while providing resources for people to put that inspiration into action. The next three teams that sign up with one hundred or more seats are going to get free. Embedded consulting, that's the process by which are super intelligence team, sits with your organization, figures out the specific use cases that matter most to you, and helps actually ensure support for adoption of those use cases to drive real value.

Go to be super duty. I to learn more about this. A I enable ment network. And now back to the show, alex Albert, who is, I think, nominally the head of developer relations, but list himself as the head of claude relations on twitter road. A nice thread about how bigger shift this represents.

He writes, computer uses the first step towards a completely new form of human computer interaction. In just a few years, the way we interface with computers will be completely different from today. Computer use allows A S to use computers, just as you would know, complex abstractions or specific p, just pure visual understanding and interaction, exactly like how you use your computer.

He gives an example video where he says, claude opens up claud dia ee and a browser prompt opens the output and website code in a new code file within VS code, and then proceeds to fix a bug in the website or with computer use. Alex continues. This is entirely different from how most agented frame ks currently work today.

Most quote, quote agents are a patchwork of mutio spoke k. PS grew together under the hood of some complex cathode, alex says. I believe we will be able to reach near human level performance in the next few years, if not much sooner.

When that is reached, that means eyes can Operate the basics of a computer, just as well as an average person can. At that point, we can start stringing together. A is doing tasks now instead of an, a is doing a simple task on a computer that would only take a human a couple of minutes.

We will do that task, and we've on to two more tasks. Suddenly the A S will be doing, and and tasks that would take humans hours and days, read a fifty page research report to create a full executive sumi and slide deck. Scan financial documents to build A D C F.

Model you a wire frame to ship a production ready website. How s continues? Combine this with a longer context window and increased chain of thought. And now you have the beginnings of the unholy bling of A I products, the pieces of the true agent puzzler starting to fall into place.

If you're developing on A I today, you need to be thinking about building complementary pieces to this reality because IT may come faster than most expect. AI entrepreneur in do ready rights computer ua pi by anthropic is an interesting take on identical P. S.

Agents are chAllenging because they have to talk to other systems, and most of these systems don't have good. One potential solution is to use the computer, use A P, I, which allows the alarm to pretend to be a human Operating a computer. The big gest issue with this approach, security, but it's not insurance table.

What about the use cases that are available right now? Obviously, there's a lot to talk about things in the future, but what can computer use to at this moment? Well, once again, Alice Albert rites, fun story.

From our time working on computer use, we held an engineering bug bash to make sure we found all the potential problems with the A. P. I.

Is not bringing a handful of engineers in a room together. For a few hours, we were hungry. So one of our engineers first computer use request was to ask claud to navigate to door dash in order enough to feature group of people.

About a minute later, we saw laud decide to order us some pizzas. Alex falcon and A I of Angeles with weight and biases, and host of the thursday AI podcast rights mind office iii, blown once again, this computer use cloud demo from anthropic and work for me, so I just asked that to fix itself. So IT did K N K S.

Rights anthropic recently released computer use A P I, with which developers can direct cloud use computers the way people do. He then shares a video around how we set up computers to use the fire crowd D, V, P, I to gather information and fill out job applications. Michelle could toaster the president at replica.

I can't tell you the last time I was so excited to see a new A I capability in action. We plug in a computer, use in reply agent as a human feedback replacement. And IT just works.

I feel like I won't take long until our agent will become fully autonomists. Now replay t had some advances, access to this, who has had a little bit more of a chance to experiment, except that there was any scepticism. IT was not about the long term, but about the very short term. So moller, for example, respondents to the announcement asking, is this just a faster horse?

Develop tony gey rights first looking anthropos clog computer, use deo like it's cool, but of underwhelming and definitely not cost effective hundred and fifty thousand tokens just to visit and navigate through a couple sites, either the models have to get much cheaper or your use case has to be insanely valuable? Or am I just missing something? Others are concerned about just how much change are going to go through the land alfaro rights.

The new computer mode by anthropic is the beginning of an era where we humans will stop understanding software. The interfaces we have today will become useless. Now I think it's a valid concern, but also an open question.

How much in in what specific ways, as IT matter for us to understand interfaces. And like I said, there is a ton of experimenting already happening that in Blake rights, if this is the slowest and cheapest that identical automation will ever be were in for a wild future. I tried a few different tasks, one involving in ministering my facebook groups, post and declining and accepting member request, which I did great.

I tried booking a haircut. IT seemed to work, but hit A P. I rate limit, did the making yourself a website thing from the demo, and IT cooked, ultimately chaba O N X sum up a lot of the sentiment when they wrote, this is so huge today, we entered the era of agents.

Microsoft now, much more so anthropic today Marks a new step in the direction of A G. I. Your move OpenAI. Anthropic is in the lead. Mca morn rights, the dawn of truly agented AI is upon us.

While today we're seeing the early stages of A P I enabling models to interact with systems, we are rapidly approaching an error where intelligent and autonomists will be deeply integrated into our Operating systems, serving as intelligent system understand and execute our intentions. I've long maintaining that the future of computing isn't just about smarter algorithms. More parameters are improved tuning, about intelligently automated systems that can actively engage with our digital environment.

The rest advances in multi moto models in computer interaction capabilities are just the beginning of a new trend. As these models become more integrated into our Operating systems, we will see a fundamental shift in how we interact with technology. This is about society altering and inevitable.

And so that is anthropic computer use. Still, Nathan? Yes, still limited in what I can do.

But as with o one, another break off the branch of the evolutionary tree of alarms and A I and another step towards a very different agents future that's going to do over the days is a ideally brief. Appreciate you listening as always. And until next time, peace.

The Next Step in Our Journey to AI Agents: Anthropic's Computer Use 16:36 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

The Next Step in Our Journey to AI Agents: Anthropic's Computer Use