Today on the podcast, we're talking about the brand new model from OpenAI called Codex, specifically for developers. And I think there is a ton of big implications on billions of dollars worth of acquisitions. OpenAI is currently looking at doing a massive segment that OpenAI is currently getting smoked on for coders. We're going to get into all of that. I think this is a
This is a really key moment for open AI. If they can land this, they will be the kings of AI. And if not, some of their competitors will live to see another fighting day. So we're going to get into all of that. Before we do, I wanted to mention, if you're interested in trying all of the new AI models that keep coming out in rapid succession, test them next to each other, you need to try the AI Box Playground. My very own startup is finally launched. We have our very first product.
The AI Box Playground is in beta and essentially allows you to try all of the different top AI models for $20 a month. So you don't need subscriptions to Claude and Gemini and OpenAI and Grok and everything else. You get all access to everything, including image and audio models as well, by the way.
and you can chat with all of the models in the same chat. So if you're mid chat and you're like, "Hey, I'm not loving what ChatGPT is saying here," you can switch to Cloud, you can switch to Gemini, you can switch to Grok and get better or different responses. And in addition, you can actually ask the chat to rerun the chat with a different model. So if you didn't like a response that it gave you, you can hit the rerun button, you can go over and select a different model and then actually have it regenerate the response with a new model. And you can compare those models side by side with a little compare button.
This is one of the best ways I think to test all of the models side by side and use them all, access them all in one place without having to have a hundred subscriptions and even just like a hundred different tabs open, right? Like instead of seven tabs using different tools, you can just have them all in one place and get access to everything in a really easy way. So it's AI box dot AI. There is a link in the description if you're interested in
trying it out. But otherwise, let's get into the rest of the episode here. So we have a tweet from Sam Altman that of course is going viral. And he says, "Today we're introducing Codex. It's a software engineering agent that runs in the cloud and does tasks for you like writing a new feature or fixing a bug. You can run many tasks in parallel." Of course, we have to label this an agent because that's the way anything launches today with AI.
But essentially, this is a really useful tool. And I think it's directly trying to fight back against Claude code, which has been picking up a lot of steam. And to be if I'm being 100% honest over here at AI box, we exclusively are using Claude code, it plugs straight into our, our
or code base front end and back end ties them all together it understands everything and the thing that i've been blown away by with cloud code in particular that i haven't quite seen from opening up but i hope codex starts giving us a run for the money is the design so if you ask cloud code to create a new feature or button or tool or fix something when it redesigns something it's going to draw off the design of your whole uh your whole actual app it knows what everything looks like and it's going to just kind of copy some of the same design elements the same colors styles and it looks amazing so that's what i love about it but
opening eye now needs to compete. So who has access to this? I was disappointed personally about whatever to find out, um, that it's going to be for pro enterprise and teams users starting today. And then everyone else is going to hopefully get it soon. Um,
It is a lot of people are very excited about this and have a lot of, uh, yeah, a lot of hype and I guess good press coming out on this. This is a definitely a needed tool. OpenAI has got some really cool. They just came out with a new code specific or fine tuned model for code. And now it's going to be plugged into, um, this new codex. So.
This is what they said specifically about it. They said on their blog post over open, I said, codex is powered by codex one, a version of open AI 03 optimized for software engineering. It was trained using reinforcement learning on real word world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences adheres precisely to instructions and can iteratively run tests until it receives a passing result.
And so they just said who they're going to roll it out to, which is interesting. I think this could be a fantastic tool, definitely better than some of the other things that they have out today. Essentially what they said is, you know, how this actually works. They said, today you can access Codex through the sidebar in ChatGPT and assign it new coding tasks by typing a prompt and clicking code. If you want to ask Codex questions,
uh, codex, a question about your code base, click ask. Each task is processed independently in a separate isolated environment, preloaded with your code base. Codex can read and edit files as well as run commands, including test harnesses, uh,
linters and type checkers. Task completion typically takes between one and 30 minutes. And I know for a lot of people are like, maybe even if you're not a developer, you're like, oh my gosh, 30 minutes to complete a task. Yeah, it takes 30 minutes. And with Cloud Code, what we're doing right now, because it's, I mean, Cloud Code has been out for like maybe like two months now at this point that we've been using it. We've just done insane things. We've like really ramped up our development from it.
And yeah, sometimes you ask it a question, you're like, all right, I'm coming back to that in about 15, 20 minutes once it runs through and figures stuff out. But it's still so much faster than using actual people to do these tasks. So one to 30 minutes is absolutely no problem.
Opening AI also said once Codex completes a task, it commits its change to the environment. Codex provides verifiable evidence of its actions through citations of terminal logs and test outputs, allowing you to trace each step taken during task completion. You can review the requests
request for the revisions, open a GitHub pull request, or directly integrate the changes into your local environment. In the product, you can configure the codex environment to match your real development environment as closely as possible. So this is a fantastic tool. If you are on the $20 version of ChatGPT and you go over there,
You're not going to be able to run it at the moment. It's only if you're paying the $200 for the pro version and hopefully us poor $20 people will get access to it soon. I recently was just playing around on ChatGPT testing out the coding capabilities. I know they got their new model, so I should...
Run it with that. But I believe just running on their ChatGPT 4.0, I asked it, hey, can I use Codex? It's like, yeah, just ask us to write some code. And it had an example of some request I could ask. So I said, write a Python script that scrapes headlines from a news site or convert this JavaScript function to TypeScript.
I did that and I just put that in as a prompt, which now I'm realizing is actually two different prompts, but it wrote some code to, it said, here's a simple script using requests and beautiful soup to scrape headlines from the BBC news front page. Gave me some code and it opens up the code inside of OpenAI's canvas, which looks great. And it has a button. If you run the code, of course it crashes. And I'm assuming this is because
It's referencing a library, beautiful soup, which I'm assuming it doesn't actually have access to. So in any case, it's not perfect today. I think for a lot of people like myself, I'm not a technical co-founder, I'm not a technical developer. And so when I try to go do vibe coding, yes, I'm sure I could figure this out. I basically understand the issue here. I could go watch some YouTube tutorials. I could kind of get into the weeds and try to vibe code some sort of cool app. But at the end of the day,
It's a little bit above, I think, a lot of people that are non-developers. I don't understand or can't pull this off instantly. I have to watch a bunch of YouTube tutorials.
All I'm saying is these coding tools are fantastic. They're amazing for developers. We're still not at a place where opening eyes making, you know, no code tools. You tell it to do something and it vibe codes it for you. You still kind of have to know what you're doing and watch some, uh, some videos when you're using opening eye. Now there are other tools that are doing much better at this lovable. There's a couple other players that are doing more of the no code. You still got to know a little bit of what, what you're actually doing, but in any case, just wanted to bring that up.
Okay. If you go over to Codex to try to get into what they're doing as far as the AI agent side of things, I'm quite impressed and very interested. So they say Codex can be guided by agents.md files, which is...
They said these can be placed within your repository. These are text files, similar to README, when you can inform Codex how to navigate your code base, which commands to run for testing, and how best to adhere to your project standard practices. Like human developers, Codex agents perform best when provided with configured dev environments, reliable testing setups, and clear documentation. Our coding evaluations and internal benchmarks
have Codex One shows strong performance even without agent files or custom scaffolding. So this is fairly impressive. So as far as the SWE, which is like a developer coding benchmark, as far as it goes, Codex One is beating O3 high, which is good because there's of course like the O4 mini.
And 03 is, you know, the latest and greatest, the best thing.
the best, I guess it was the best of the O3 models. Now we're on O4, unless that's backwards. But in any case, this is their high model. So they're giving it the most compute, the most intelligence and Codex One is beating it. Now, the one thing I will say is it's not like completely smoking it. It's not like Codex One is a thousand times better than O3 high, but it is better. So it is the best. And that's important to know. And of course, they've got all these cool things like integrating with agents that are important.
They said 23 SWE bench verified samples that were not runnable on our internal infrastructure were excluded. So pretty much they're saying how they actually got these results. Codex one was tested at a maximum context length of 192,000 tokens and medium reasoning effort, which is a setting that will be available in product today for details on the O3 evaluations. They have a whole website where you can go see that. So
Their goal here is that you're able to build safe and trustworthy relationships
agents, pretty much. This is technically a research preview, right? So this isn't, you know, they're not saying this is a full release, they kind of do this for everything nowadays, I think are the newest 4.5 model is a research preview. And they're just saying like, yeah, it's a research preview, so that if it does anything crazy or goes off the rails, like now it's a research preview, like we weren't, we weren't fully rolled out yet. Now it's like, they're not going to put the official stamp of like, this is our
They're like our official best model that's fully out there until, you know, they see if it has any like crazy bugs or quirks. I do think that's really funny because other than calling it a research preview, there's not a ton of differences, except it's not bad PR if it has, if it goes crazy, pretty much. They say they're prioritizing security and transparency when designing codecs so users can verify its outputs, a safe card that grows faster.
Increasingly more important is AI models handle more complex coding tasks independently. So really, I mean, this is important. You need to know what it's actually doing in your code base. You don't want it to mess anything up that's super big. And then, you know, all of a sudden you're like, oh crap, what did it do? They say that their primary goal was when they were training this was to align the outputs really closely with human coding preferences and standards. So
So it's much better at that. It's really quite impressive. They compare it a lot to O3 and it is better than O3. So they also have a whole section about how they're going to prevent abuse of this. You know, bad actors aren't going to be using it. At the end of the day, I think there's enough open source models with Lama that
If bad actors are going to be using, you know, AI and coding tools, it's going to happen. So I'm less concerned about kind of what they're doing there. But maybe that's just me.
Secure execution. This I do think is important. They said the Codex agent operates entirely within a secure, isolated container in the cloud. During task execution, internet access is disabled, limiting the agent's interaction solely to the code explicitly provided via GitHub repositories and pre-installed dependencies configured by the user via a setup script. The agent cannot access external websites, APIs, or other agents.
or other services. And pretty much the reason why they're doing this, they're not like giving access to the internet. Some people are like, oh, it's not going to be as good at searching for the most up-to-date things, which is true. But the reason they're doing it is because if you put in your code, they don't want some sort of website to be clever and sneaky and essentially fish your whole code base and get the agent to paste in your whole code base and people could steal your code base. That's what they're trying to avoid here. Um, and so, you know, that makes it much more secure, but definitely is limiting. Um,
Not that anyone else is doing that, but yeah, that is one thing to do. There's a bunch of really interesting early use cases of people that are trying it. It's funny when they did this whole thing, they, you know, they list some people that are using it. They're like temporal uses codecs to accelerate feature development. They're like Cisco is exploring how codecs can help their engineering teams. I thought that was funny. It's not like
Cisco's using it. It's like they're exploring how they could possibly use this. Like, yes, I'm sure every company in the world, when a big feature comes out, is exploring how it could possibly be beneficial to them. So I'm just going to ignore their Cisco plug there. Superhuman's using it. Kodiak is using it.
bunch of other players. This is probably a great tool. You know, unless I bag on it too much. They said last month we launched Codex CLI, a lightweight open source coding agent that runs in your terminal. And so now they're releasing a smaller version of Codex one, which is a version of 04 mini designed specifically for using Codex CLI. It's kind of interesting, right? This Codex isn't like a brand new
I don't know, brand new out of thin air model. It's just kind of built on top of it. It's like a fine-tuned model built on top of other models they have. So yeah, I do think this is very interesting. As far as their pricing, they say for developers building with Codex Mini latest, the model is available on the response API and priced at $1.50 per million output tokens and $6 per 1 million output tokens.
Sorry, $1.50 for a million input tokens, $6 for a million output tokens with a 75% prompt cashing discount. That's fantastic. That actually saves you a lot of money doing the prompt cashing. So...
That's great. In any case, this is still an early development. It's a very cool tool. I'm excited to see this in the wild, people actually using it and to see how it stacks up to Cloud Code. Now, the drama behind all of this is that OpenAI just offered allegedly $3 billion to buy Windsurf, one of the biggest, probably the second biggest coding tool after Cursor. And so now it seems like they're building a direct competitor to Windsurf.
And Windsurf also yesterday just announced that they're building their own actual AI model. Like they trained their own AI model on code. So now it's like Windsurf doesn't need OpenAI and OpenAI doesn't need Windsurf. So will that acquisition go through? It's a big interesting question to ask.
And I'll keep you up to date on all the latest drama there. If they actually pull off that acquisition, what happens there? In any case, thanks so much for tuning into the podcast. I hope you learned something new about this crazy world of coding. And if you did, make sure to go check out AIbox.ai if you want to try all of the latest models for $20 a month. You don't have to have subscriptions to everything. You get them all in one place and a ton of great features as well. Thanks so much for tuning in and I'll catch you next time.