People
Topics
我介绍了OpenAI新推出的Codex模型,它专为开发者设计,并可能对数十亿美元的收购产生重大影响。Codex的成功与否将决定OpenAI是否能成为AI领域的领导者。AI Box Playground允许用户以每月20美元的价格试用各种顶级AI模型,支持在同一聊天中与不同模型进行交流,并可以重新运行聊天以使用不同的模型。它提供了一个便捷的平台,可以在一个地方测试和使用所有AI模型,无需订阅多个服务。 OpenAI推出了名为Codex的软件工程代理,可以在云端运行,执行编写新功能或修复错误等任务。Codex旨在与Claude Code竞争,后者在设计方面表现出色,能够根据应用的整体设计风格进行调整。Codex将首先向Pro Enterprise和Teams用户开放,然后逐渐向所有用户开放。Codex由Codex One驱动,Codex One是为软件工程优化的OpenAI 03版本,经过强化学习训练,可以生成模仿人类风格的代码。用户可以通过ChatGPT侧边栏访问Codex,并通过输入提示和点击代码来分配新的编码任务。Codex可以在独立的隔离环境中处理任务,可以读取和编辑文件,以及运行命令,包括测试工具。Codex完成任务通常需要1到30分钟,但这仍然比使用人工更快。Codex在完成任务后,会将更改提交到环境中,并提供可验证的证据,以便用户追踪每一步。用户可以配置Codex环境,使其尽可能接近真实的开发环境。目前只有付费200美元的Pro版本ChatGPT用户才能使用Codex。通过ChatGPT,用户可以要求Codex编写Python脚本或将JavaScript函数转换为TypeScript。Codex可以生成代码,但可能因为缺少必要的库而导致代码崩溃。对于非技术人员来说,使用这些编码工具仍然需要一定的技术知识和学习。虽然编码工具对开发者很有帮助,但OpenAI尚未实现完全的无代码化,用户仍然需要一定的编程知识。 Codex可以由agents.md文件引导,这些文件可以告知Codex如何浏览代码库以及运行哪些命令进行测试。像人类开发者一样,Codex代理在配置好的开发环境、可靠的测试设置和清晰的文档中表现最佳。Codex One在SWE(软件开发工程师)编码基准测试中表现出色,甚至优于O3 High。Codex One优于O3 High,但优势并不显著,集成代理等功能非常重要。OpenAI的目标是建立安全和值得信赖的智能代理。Codex目前是研究预览版,这意味着它尚未完全发布,可能存在一些未知的错误或问题。OpenAI在设计Codex时优先考虑安全性和透明度,以便用户可以验证其输出。OpenAI训练Codex的主要目标是使其输出与人类的编码偏好和标准紧密对齐。OpenAI还有一个关于如何防止滥用Codex的部分,但考虑到Llama等开源模型的存在,这方面的担忧可能较小。Codex代理在云端的安全隔离容器中运行,禁用互联网访问,以防止代码泄露。禁用互联网访问是为了提高安全性,防止恶意网站窃取代码库。一些公司正在探索如何使用Codex,例如Temporal、Cisco和Superhuman。OpenAI还发布了Codex CLI,这是一个轻量级的开源编码代理,可以在终端中运行。Codex是建立在其他模型之上的微调模型。使用Codex Mini latest模型的开发者,输入token的价格为每百万1.5美元,输出token的价格为每百万6美元,并提供75%的prompt缓存折扣。OpenAI据称以30亿美元的价格收购了Windsurf,但Windsurf也宣布正在构建自己的AI模型,这使得收购是否会进行成为一个有趣的问题。

Deep Dive

Shownotes Transcript

Translations:
中文

Today on the podcast, we're talking about the brand new model from OpenAI called Codex, specifically for developers. And I think there is a ton of big implications on billions of dollars worth of acquisitions. OpenAI is currently looking at doing a massive segment that OpenAI is currently getting smoked on for coders. We're going to get into all of that. I think this is a

This is a really key moment for open AI. If they can land this, they will be the kings of AI. And if not, some of their competitors will live to see another fighting day. So we're going to get into all of that. Before we do, I wanted to mention, if you're interested in trying all of the new AI models that keep coming out in rapid succession, test them next to each other, you need to try the AI Box Playground. My very own startup is finally launched. We have our very first product.

The AI Box Playground is in beta and essentially allows you to try all of the different top AI models for $20 a month. So you don't need subscriptions to Claude and Gemini and OpenAI and Grok and everything else. You get all access to everything, including image and audio models as well, by the way.

and you can chat with all of the models in the same chat. So if you're mid chat and you're like, "Hey, I'm not loving what ChatGPT is saying here," you can switch to Cloud, you can switch to Gemini, you can switch to Grok and get better or different responses. And in addition, you can actually ask the chat to rerun the chat with a different model. So if you didn't like a response that it gave you, you can hit the rerun button, you can go over and select a different model and then actually have it regenerate the response with a new model. And you can compare those models side by side with a little compare button.

This is one of the best ways I think to test all of the models side by side and use them all, access them all in one place without having to have a hundred subscriptions and even just like a hundred different tabs open, right? Like instead of seven tabs using different tools, you can just have them all in one place and get access to everything in a really easy way. So it's AI box dot AI. There is a link in the description if you're interested in

trying it out. But otherwise, let's get into the rest of the episode here. So we have a tweet from Sam Altman that of course is going viral. And he says, "Today we're introducing Codex. It's a software engineering agent that runs in the cloud and does tasks for you like writing a new feature or fixing a bug. You can run many tasks in parallel." Of course, we have to label this an agent because that's the way anything launches today with AI.

But essentially, this is a really useful tool. And I think it's directly trying to fight back against Claude code, which has been picking up a lot of steam. And to be if I'm being 100% honest over here at AI box, we exclusively are using Claude code, it plugs straight into our, our

or code base front end and back end ties them all together it understands everything and the thing that i've been blown away by with cloud code in particular that i haven't quite seen from opening up but i hope codex starts giving us a run for the money is the design so if you ask cloud code to create a new feature or button or tool or fix something when it redesigns something it's going to draw off the design of your whole uh your whole actual app it knows what everything looks like and it's going to just kind of copy some of the same design elements the same colors styles and it looks amazing so that's what i love about it but

opening eye now needs to compete. So who has access to this? I was disappointed personally about whatever to find out, um, that it's going to be for pro enterprise and teams users starting today. And then everyone else is going to hopefully get it soon. Um,

It is a lot of people are very excited about this and have a lot of, uh, yeah, a lot of hype and I guess good press coming out on this. This is a definitely a needed tool. OpenAI has got some really cool. They just came out with a new code specific or fine tuned model for code. And now it's going to be plugged into, um, this new codex. So.

This is what they said specifically about it. They said on their blog post over open, I said, codex is powered by codex one, a version of open AI 03 optimized for software engineering. It was trained using reinforcement learning on real word world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences adheres precisely to instructions and can iteratively run tests until it receives a passing result.

And so they just said who they're going to roll it out to, which is interesting. I think this could be a fantastic tool, definitely better than some of the other things that they have out today. Essentially what they said is, you know, how this actually works. They said, today you can access Codex through the sidebar in ChatGPT and assign it new coding tasks by typing a prompt and clicking code. If you want to ask Codex questions,

uh, codex, a question about your code base, click ask. Each task is processed independently in a separate isolated environment, preloaded with your code base. Codex can read and edit files as well as run commands, including test harnesses, uh,

linters and type checkers. Task completion typically takes between one and 30 minutes. And I know for a lot of people are like, maybe even if you're not a developer, you're like, oh my gosh, 30 minutes to complete a task. Yeah, it takes 30 minutes. And with Cloud Code, what we're doing right now, because it's, I mean, Cloud Code has been out for like maybe like two months now at this point that we've been using it. We've just done insane things. We've like really ramped up our development from it.

And yeah, sometimes you ask it a question, you're like, all right, I'm coming back to that in about 15, 20 minutes once it runs through and figures stuff out. But it's still so much faster than using actual people to do these tasks. So one to 30 minutes is absolutely no problem.

Opening AI also said once Codex completes a task, it commits its change to the environment. Codex provides verifiable evidence of its actions through citations of terminal logs and test outputs, allowing you to trace each step taken during task completion. You can review the requests

request for the revisions, open a GitHub pull request, or directly integrate the changes into your local environment. In the product, you can configure the codex environment to match your real development environment as closely as possible. So this is a fantastic tool. If you are on the $20 version of ChatGPT and you go over there,

You're not going to be able to run it at the moment. It's only if you're paying the $200 for the pro version and hopefully us poor $20 people will get access to it soon. I recently was just playing around on ChatGPT testing out the coding capabilities. I know they got their new model, so I should...

Run it with that. But I believe just running on their ChatGPT 4.0, I asked it, hey, can I use Codex? It's like, yeah, just ask us to write some code. And it had an example of some request I could ask. So I said, write a Python script that scrapes headlines from a news site or convert this JavaScript function to TypeScript.

I did that and I just put that in as a prompt, which now I'm realizing is actually two different prompts, but it wrote some code to, it said, here's a simple script using requests and beautiful soup to scrape headlines from the BBC news front page. Gave me some code and it opens up the code inside of OpenAI's canvas, which looks great. And it has a button. If you run the code, of course it crashes. And I'm assuming this is because

It's referencing a library, beautiful soup, which I'm assuming it doesn't actually have access to. So in any case, it's not perfect today. I think for a lot of people like myself, I'm not a technical co-founder, I'm not a technical developer. And so when I try to go do vibe coding, yes, I'm sure I could figure this out. I basically understand the issue here. I could go watch some YouTube tutorials. I could kind of get into the weeds and try to vibe code some sort of cool app. But at the end of the day,

It's a little bit above, I think, a lot of people that are non-developers. I don't understand or can't pull this off instantly. I have to watch a bunch of YouTube tutorials.

All I'm saying is these coding tools are fantastic. They're amazing for developers. We're still not at a place where opening eyes making, you know, no code tools. You tell it to do something and it vibe codes it for you. You still kind of have to know what you're doing and watch some, uh, some videos when you're using opening eye. Now there are other tools that are doing much better at this lovable. There's a couple other players that are doing more of the no code. You still got to know a little bit of what, what you're actually doing, but in any case, just wanted to bring that up.

Okay. If you go over to Codex to try to get into what they're doing as far as the AI agent side of things, I'm quite impressed and very interested. So they say Codex can be guided by agents.md files, which is...

They said these can be placed within your repository. These are text files, similar to README, when you can inform Codex how to navigate your code base, which commands to run for testing, and how best to adhere to your project standard practices. Like human developers, Codex agents perform best when provided with configured dev environments, reliable testing setups, and clear documentation. Our coding evaluations and internal benchmarks

have Codex One shows strong performance even without agent files or custom scaffolding. So this is fairly impressive. So as far as the SWE, which is like a developer coding benchmark, as far as it goes, Codex One is beating O3 high, which is good because there's of course like the O4 mini.

And 03 is, you know, the latest and greatest, the best thing.

the best, I guess it was the best of the O3 models. Now we're on O4, unless that's backwards. But in any case, this is their high model. So they're giving it the most compute, the most intelligence and Codex One is beating it. Now, the one thing I will say is it's not like completely smoking it. It's not like Codex One is a thousand times better than O3 high, but it is better. So it is the best. And that's important to know. And of course, they've got all these cool things like integrating with agents that are important.

They said 23 SWE bench verified samples that were not runnable on our internal infrastructure were excluded. So pretty much they're saying how they actually got these results. Codex one was tested at a maximum context length of 192,000 tokens and medium reasoning effort, which is a setting that will be available in product today for details on the O3 evaluations. They have a whole website where you can go see that. So

Their goal here is that you're able to build safe and trustworthy relationships

agents, pretty much. This is technically a research preview, right? So this isn't, you know, they're not saying this is a full release, they kind of do this for everything nowadays, I think are the newest 4.5 model is a research preview. And they're just saying like, yeah, it's a research preview, so that if it does anything crazy or goes off the rails, like now it's a research preview, like we weren't, we weren't fully rolled out yet. Now it's like, they're not going to put the official stamp of like, this is our

They're like our official best model that's fully out there until, you know, they see if it has any like crazy bugs or quirks. I do think that's really funny because other than calling it a research preview, there's not a ton of differences, except it's not bad PR if it has, if it goes crazy, pretty much. They say they're prioritizing security and transparency when designing codecs so users can verify its outputs, a safe card that grows faster.

Increasingly more important is AI models handle more complex coding tasks independently. So really, I mean, this is important. You need to know what it's actually doing in your code base. You don't want it to mess anything up that's super big. And then, you know, all of a sudden you're like, oh crap, what did it do? They say that their primary goal was when they were training this was to align the outputs really closely with human coding preferences and standards. So

So it's much better at that. It's really quite impressive. They compare it a lot to O3 and it is better than O3. So they also have a whole section about how they're going to prevent abuse of this. You know, bad actors aren't going to be using it. At the end of the day, I think there's enough open source models with Lama that

If bad actors are going to be using, you know, AI and coding tools, it's going to happen. So I'm less concerned about kind of what they're doing there. But maybe that's just me.

Secure execution. This I do think is important. They said the Codex agent operates entirely within a secure, isolated container in the cloud. During task execution, internet access is disabled, limiting the agent's interaction solely to the code explicitly provided via GitHub repositories and pre-installed dependencies configured by the user via a setup script. The agent cannot access external websites, APIs, or other agents.

or other services. And pretty much the reason why they're doing this, they're not like giving access to the internet. Some people are like, oh, it's not going to be as good at searching for the most up-to-date things, which is true. But the reason they're doing it is because if you put in your code, they don't want some sort of website to be clever and sneaky and essentially fish your whole code base and get the agent to paste in your whole code base and people could steal your code base. That's what they're trying to avoid here. Um, and so, you know, that makes it much more secure, but definitely is limiting. Um,

Not that anyone else is doing that, but yeah, that is one thing to do. There's a bunch of really interesting early use cases of people that are trying it. It's funny when they did this whole thing, they, you know, they list some people that are using it. They're like temporal uses codecs to accelerate feature development. They're like Cisco is exploring how codecs can help their engineering teams. I thought that was funny. It's not like

Cisco's using it. It's like they're exploring how they could possibly use this. Like, yes, I'm sure every company in the world, when a big feature comes out, is exploring how it could possibly be beneficial to them. So I'm just going to ignore their Cisco plug there. Superhuman's using it. Kodiak is using it.

bunch of other players. This is probably a great tool. You know, unless I bag on it too much. They said last month we launched Codex CLI, a lightweight open source coding agent that runs in your terminal. And so now they're releasing a smaller version of Codex one, which is a version of 04 mini designed specifically for using Codex CLI. It's kind of interesting, right? This Codex isn't like a brand new

I don't know, brand new out of thin air model. It's just kind of built on top of it. It's like a fine-tuned model built on top of other models they have. So yeah, I do think this is very interesting. As far as their pricing, they say for developers building with Codex Mini latest, the model is available on the response API and priced at $1.50 per million output tokens and $6 per 1 million output tokens.

Sorry, $1.50 for a million input tokens, $6 for a million output tokens with a 75% prompt cashing discount. That's fantastic. That actually saves you a lot of money doing the prompt cashing. So...

That's great. In any case, this is still an early development. It's a very cool tool. I'm excited to see this in the wild, people actually using it and to see how it stacks up to Cloud Code. Now, the drama behind all of this is that OpenAI just offered allegedly $3 billion to buy Windsurf, one of the biggest, probably the second biggest coding tool after Cursor. And so now it seems like they're building a direct competitor to Windsurf.

And Windsurf also yesterday just announced that they're building their own actual AI model. Like they trained their own AI model on code. So now it's like Windsurf doesn't need OpenAI and OpenAI doesn't need Windsurf. So will that acquisition go through? It's a big interesting question to ask.

And I'll keep you up to date on all the latest drama there. If they actually pull off that acquisition, what happens there? In any case, thanks so much for tuning into the podcast. I hope you learned something new about this crazy world of coding. And if you did, make sure to go check out AIbox.ai if you want to try all of the latest models for $20 a month. You don't have to have subscriptions to everything. You get them all in one place and a ton of great features as well. Thanks so much for tuning in and I'll catch you next time.