We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Building AGI in Real Time (OpenAI Dev Day 2024)

2024/10/3

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

AI Charlie

组织和主持多个高影响力的 AI 活动和会议，促进 AI 领域的发展和社区建设。

Alistair Pullen

Olivier Godement

Topics

AI Charlie: 本期节目回顾了OpenAI开发者大会2024,由于无法录制主题演讲,我们请来了NotebookLM团队进行总结。我们还采访了多位演讲者,包括OpenAI的产品负责人、开发者体验负责人以及API技术负责人,以及Cosine公司的CEO Alistair Pullen和OpenAI的CEO Sam Altman。 NotebookLM & NotebookLM 2: OpenAI开发者大会发布了实时API、模型微调、提示缓存和模型蒸馏等新技术,以及O1模型。实时API使用WebSocket连接和函数调用,允许AI访问外部工具和信息,从而实现更自然的交互。 OpenAI的转型以及高管离职引发了人们对其未来发展方向的担忧。微调技术允许开发者定制AI模型,视觉微调技术在医疗等领域具有巨大潜力。自动提示缓存和模型蒸馏技术能够降低成本并提高可访问性。O1模型是一个重大的技术突破,它能够进行推理和学习,在数学和编码方面表现出色,但运行速度和资源消耗方面与GPT-4存在差异。 swyx & Ilan: 实时语音API的演示展示了其在实际应用中的潜力,例如订购商品。该API使用WebSocket连接和函数调用,能够实时响应用户的语音指令。在使用该API时,需要考虑法律法规和用户隐私,并谨慎选择合适的集成方式。该API的会话机制简化了开发流程,无需使用状态机。该API目前主要支持语音转文本功能,未来可能支持视频等其他模态。通过限制音频输出并强制函数调用,可以构建更可靠的命令式架构。 Olivier Godement: OpenAI开发者大会旨在与开发者建立更紧密的联系,并促进全球AI发展。开发者是连接AI技术与未来应用的关键,OpenAI致力于与开发者合作,共同推动AGI的发展。实时API是本次大会上最具挑战性的发布,其设计目标是实现人类水平的延迟。提示缓存机制旨在实现零代码更改,模型蒸馏和评估机制旨在帮助开发者更容易地进行模型评估。 OpenAI的目标是成为AI开发平台提供商,而非仅仅是模型提供商。 OpenAI将根据开发者的需求不断改进其API和工具,并继续投资于推理、多模态和工具使用等领域。 OpenAI希望开发者能够提供反馈,帮助其确定未来的发展方向,并使音频和多模态成为一流的应用体验。 OpenAI计划在API中提供更可控的安全设置,并改进其语音模型的安全性和鲁棒性。 Michelle Pokrass & Simon Willison: 实时API是OpenAI的首个WebSocket API,其设计涉及许多重要的决策。OpenAI团队在测试实时API的过程中,开发了一些有趣的项目。实时API使构建可进行语音交互的网站变得更容易。开发者在使用实时API时需要构建代理服务器来隐藏API密钥,OpenAI建议开发者使用其合作伙伴提供的解决方案来解决API密钥管理问题。 OpenAI正在考虑为实时API提供OAuth支持,以简化API密钥管理。 OpenAI的视觉微调技术被低估了,它在OCR和边界框检测等方面具有显著的优势。 OpenAI的评估工具能够帮助开发者更容易地进行模型评估,并支持结构化输出测试。 OpenAI的API设计遵循自下而上的原则,先提供底层功能,再根据实际需求构建更高级别的抽象。 OpenAI计划允许开发者使用URL或文件上传音频数据到聊天完成API中,并计划在未来支持视频功能。 OpenAI的实时API使用WebSocket技术,因为它适合双向流式传输。 Alistair Pullen: OpenAI邀请他参加开发者大会,分享其在微调方面的经验。他在开发者大会上分享了构建Genie模型的技术细节,包括如何生成数据集来训练模型。他之前对SweetBench团队隐瞒了Genie模型的推理过程,现在看来这个决定是正确的。 O1模型的发布证明了Genie模型的开发方向是正确的。他计划使用O1模型来改进Genie模型的推理过程。 Genie模型在SweetBench Verified上的得分高于O1模型。他计划使用更大的语言模型来改进Genie模型的性能。他认为SweetBench并不是评估模型性能的最佳指标。他认为OpenAI正在构建一些Genie模型内部使用的工具。他认为OpenAI的模型蒸馏技术非常有用。他认为视觉微调技术在UI开发方面具有巨大潜力。他认为OpenAI正在不断改进其微调产品和API。他认为没有必要使用现有的LLM Ops工具,因为Genie模型内部已经构建了类似的工具。他认为语音模式可能在未来集成到Genie模型中。他正在努力使Genie模型的数据集更加多样化,以提高其在不同编程语言中的性能。他计划为Genie模型提供根据用户代码库进行微调的功能。 Sam Altman & Kevin Weil: Sam Altman认为,AGI的概念已经被过度使用,OpenAI更关注于持续改进AI模型。 Sam Altman认为,O1模型已经达到了“推理器”级别,并且很快就能达到“代理”级别。 Sam Altman认为,AI模型的能力正在快速发展,未来几年将取得显著进展。 Sam Altman认为,OpenAI将继续重视研究工作,并不断突破AI技术的边界。 Kevin Weil认为,OpenAI的产品开发与其他公司不同,因为它需要不断适应AI技术的快速发展。 Sam Altman认为,OpenAI需要根据科学进展来调整其研究方向和产品开发策略。 Kevin Weil认为,OpenAI需要兼顾用户的当前需求和未来需求。 Sam Altman认为,OpenAI致力于构建安全可靠的AI系统,并不断改进其安全措施。 Sam Altman认为,迭代式部署是OpenAI安全策略的重要组成部分。 Sam Altman认为,关注AI的潜在风险非常重要,但OpenAI更关注于解决当前的挑战。 Kevin Weil认为,迭代式部署能够帮助OpenAI更好地了解用户需求,并改进其产品。 Sam Altman认为,AI代理技术将对未来世界产生重大影响。 Sam Altman认为,AI代理技术将改变人们工作和生活的方式。 Sam Altman认为,OpenAI将继续开发和改进其AI代理技术。 Sam Altman认为,AI代理技术的安全性是当前面临的主要挑战。 Sam Altman认为,安全措施可能会限制AI技术的应用,但这是必要的。 Sam Altman认为,OpenAI在开发AI技术时会采取谨慎的态度,并根据实际情况调整其安全策略。 Kevin Weil认为,AI技术的发展速度很快,开发者需要不断适应新的技术和工具。 Sam Altman认为,开发者需要关注AI模型的当前能力,并利用其构建具有未来潜力的应用。 Sam Altman认为,AI技术只是工具,开发者需要建立良好的商业模式来获得成功。 Sam Altman认为,OpenAI计划在未来几个月内为O1模型添加函数调用等功能。 Sam Altman认为,AI模型的能力将持续快速提升。 Sam Altman认为,Google的Notebook LLM是一个非常酷的产品。 Kevin Weil认为,Anthropic的Projects是一个非常好的产品。 Kevin Weil认为,OpenAI需要平衡用户的当前需求和未来需求。 Sam Altman认为,OpenAI内部广泛使用其AI模型,并将其用于产品开发和研究。 Kevin Weil认为,OpenAI内部使用AI模型来改进客户服务和安全工作。 Kevin Weil认为,OpenAI内部使用多个AI模型来完成复杂的端到端任务。 Sam Altman认为,OpenAI目前不优先考虑发布离线模型,但未来可能会考虑。 Kevin Weil认为,OpenAI致力于与政府机构合作,以促进AI技术的应用。 Sam Altman认为,开源对AI发展非常重要,但OpenAI目前没有优先考虑开源其模型。 Sam Altman认为,OpenAI希望能够为AI发展做出独特的贡献,而不是仅仅改进基准测试结果。

Deep Dive

Key Insights

Why is the Realtime API significant for practical AI applications?

The Realtime API is significant because it allows for human-level latency in interactions, enabling seamless and natural conversations. It can handle real-time interruptions and maintain context, making it more effective for applications like voice assistants, customer service, and real-time translation.

What internal changes is OpenAI making to become more of a platform provider?

OpenAI is transitioning from a model provider to a platform provider by focusing on tooling around their models, such as fine-tuning, model distillation, and evaluation tools. They are also emphasizing real-time capabilities and providing more integrated solutions, similar to AWS, to meet developers where they are.

Why is OpenAI moving away from the term AGI?

OpenAI is moving away from the term AGI because it has become overloaded and is often misinterpreted. Instead, they are focusing on continuously improving AI models and ensuring they are used responsibly, without the constraints of a binary definition of AGI.

What is the vision for O1 and its successors in terms of AI capabilities?

O1 and its successors are expected to be very capable reasoning models that can handle complex tasks and multi-turn interactions. Over time, they aim to increase the rate of scientific discovery and solve problems that would traditionally take humans years to figure out.

What are the challenges in deploying AI agents that control computers?

The main challenges in deploying AI agents that control computers include ensuring high robustness, reliability, and alignment. These systems need to be safe and trustworthy, especially when they interact with users over longer periods and in complex environments.

Why is OpenAI's approach to safety and alignment important for AI development?

OpenAI's approach to safety and alignment is crucial because it balances the rapid advancement of AI technologies with responsible deployment. They focus on iterative testing and real-world feedback to identify and mitigate potential harms, ensuring that AI systems are safe and beneficial to society.

What is the new Realtime API used for and how does it work?

The Realtime API is used for real-time interactions with AI, such as voice assistants and live translations. It uses WebSocket connections for bi-directional streaming, allowing the AI to respond instantly and handle complex tasks like function calling and tool use.

What is the impact of OpenAI's vision fine-tuning on fields like medicine?

Vision fine-tuning can significantly impact fields like medicine by training AI models on specific datasets, such as medical images. This can help doctors in making more accurate diagnoses and spotting details that might be missed by human eyes.

Why is fine-tuning AI models with diverse data sets important?

Fine-tuning AI models with diverse data sets is important because it ensures the models are adaptable and perform well across different use cases. Training on a variety of programming languages, for example, can improve the model's performance in specific applications and avoid biases.

What is the future of context windows in AI models?

The future of context windows in AI models will see significant improvements in both length and efficiency. OpenAI expects to reach context lengths of around 10 million tokens in the coming months, and eventually, infinite context within a decade. This will enable more complex and versatile interactions.

Shownotes Transcript

Translations:

中文

Happy October.

This is your AI co-host, Charlie. One of our longest standing traditions is covering major AI and ML conferences in podcast format. Delving, yes delving, into the vibes of what it is like to be there stitched in with short samples of conversations with key players just to help you feel like you were there. Covering this year's Dev Day was significantly more challenging because we were all requested not to record the opening keynotes.

So in place of the opening keynotes, we had the Viral Notebook LM Deep Dive crew, my new AI podcast nemesis, give you a seven minute recap of everything that was announced.

Of course, you can also check the show notes for details. I'll then come back with an explainer of all the interviews we have for you today. Watch out and take care. All right. So we've got a pretty hefty stack of articles and blog posts here all about OpenAI's Dev Day 2024. Yeah, lots to dig into there. Seems like you're really interested in what's new with AI. Definitely.

And it seems like OpenAI had a lot to announce. New tools, changes to the company. It's a lot. It is. And especially since you're interested in how AI can be used in the real world, you know, practical applications, we'll focus on that. Perfect. Like, for example, this new real-time API, they announced that, right? That seems like a big deal if we want AI to sound, well, less like a robot. It could be huge. Yeah.

The real-time API could completely change how we interact with AI. Like, imagine if your voice assistant could actually handle it if you interrupted it. Or like have an actual conversation. Right, not just these clunky back and forth things we're used to. And they actually showed it off, didn't they? I read something about a travel app, one for languages, even one where the AI ordered takeout. Those demos were really interesting, and I think they show how this real-time API can be used in so many ways.

And the tech behind it is fascinating, by the way. It uses persistent WebSocket connections and this thing called function calling so it can respond in real time. So the function calling thing, that sounds kind of complicated. Can you explain how that works? So imagine giving the AI access to this whole toolbox, right? Information, capabilities, all sorts of things. Okay. So take the travel agent demo, for example. With function calling, the AI can pull up details, let's say, about Fort Mason, right, from some database, right?

like nearby restaurants stuff like that ah i get it so instead of being limited to what it already knows it can go and find the information it needs like a human travel agent would precisely and someone on hack news pointed out a cool detail the api actually gives you a text version of what's being said so you can store that analyze it that's smart it seems like open ai put a lot of thought into making this api easy for developers to use

But while we're on OpenAI, you know, the sides of their tech, there's been some news about like internal changes too. Didn't they say they're moving away from being a nonprofit? They did. And it's got everyone talking. It's a major shift. And it's only natural for people to wonder how that'll change things for OpenAI in the future. I mean, there are definitely some valid questions about this move to for-profit. Like, will they have more money for research now? Probably.

But will they care as much about making sure AI benefits everyone? Yeah, that's the big question, especially with all the like the leadership changes happening at OpenAI too, right? I read that their chief research officer left and their VP of research and even their CTO. It's true. A lot of people are connecting those departures with the changes in OpenAI structure. And I guess it makes you wonder what's going on behind the scenes.

But they are still putting out new stuff. Like this whole fine tuning thing really caught my eye. Right. Fine tuning. It's essentially taking a pre-trained AI model and like customizing it. So instead of a general AI, you get one that's tailored for a specific job. Exactly. Exactly.

And that opens up so many possibilities, especially for businesses. Imagine you could train an AI on your company's data, you know, like how you communicate your brand guidelines. So it's like having an AI that's specifically trained for your company? That's the idea. And they're doing it with images now too, right? Fine-tuning with vision is what they called it. It's pretty incredible what they're doing with that, especially in fields like medicine. Like using AI to help doctors make diagnoses. Exactly. And AI could be trained on...

like thousands of medical images, right? And then it could potentially spot things that even a trained doctor might miss. That's kind of scary to be honest. What if it gets it wrong?

Well, the idea isn't to replace doctors, but to give them another tool, you know, help them make better decisions. Okay, that makes sense. But training these AI models must be really expensive. It can be. All those tokens add up. But OpenAI announced something called automatic prompt caching. Automatic what now? I don't think I came across that. So basically, if your AI sees a prompt that it's already seen before, OpenAI will give you a discount. Huh.

Like a frequent buyer program for AI. Kind of, yeah. It's good that they're trying to make it more affordable. And they're also doing something called model distillation. Okay, now you're just using big words to sound smart. What's that?

Think of it like a recipe, right? You can take a really complex recipe and break it down to the essential parts. Make it simpler, but it still tastes the same. Yeah. And that's what model distillation is. You take a big, powerful AI model and create a smaller, more efficient version. So it's like lighter weight, but still just as capable. Exactly. And that means more people can actually use these powerful tools. They don't need like...

So they're making AI more accessible. That's great. It is. And speaking of powerful tools, they also talked about their new O1 model. That's the one they've been hyping up, the one that's supposed to be this big leap forward. Yeah, O1. It sounds pretty futuristic. Like from what I read, it's not just a bigger, better language model. Right. It's a different porch. They're saying it can like actually reason, right? Think different.

It's trained differently. They used reinforcement learning with O1. So it's not just finding patterns in the data it's seen before. Not just that. It can actually learn from its mistakes, get better at solving problems.

So give me an example. What can O1 do that, say, GPT-4 can't? Well, OpenAI showed it doing some pretty impressive stuff with math, like advanced math. Yeah. And coding, too. Complex coding. Things that even GPT-4 struggled with. So you're saying if I needed to, like, write a screenplay, I'd stick with GPT-4. But if I wanted to solve some crazy physics problem, O1 is what I'd use. Something like that, yeah. Although there is a tradeoff. O1 takes a lot more power to run.

And it takes longer to get those impressive results. Makes sense. More power, more time, higher quality. Exactly. It sounds like it's still in development though, right? Is there anything else they're planning to add to it? Oh, yeah. They mentioned system prompts, which will let developers set some ground rules for how it behaves.

and they're working on adding structured outputs and function calling. - Wait, structured outputs? Didn't we just talk about that? - We did. That's the thing where the AI's output is formatted in a way that's easy to use, like JSON. - Right, right. So you don't have to spend all day trying to make sense of what it gives you. It's good that they're thinking about that stuff. - It's about making these tools usable. And speaking of that, Dev Day finished up with this really interesting talk. Sam Altman, the CEO of OpenAI,

and Kevin Weil, their new chief product officer. They talked about like the big picture for AI. Yeah, they did, didn't they? Anything interesting come up? Well, Altman talked about moving past this whole AGI term, artificial general intelligence. I can see why. It's kind of a loaded term, isn't it? He thinks it's become a bit of a buzzword and people don't really understand what it means. So are they saying they're not trying to build AGI anymore? It's more like they're saying they're focused on just making AI better.

Constantly improving it, not worrying about putting it in a box. That makes sense. Keep pushing the limits. Exactly. But they were also very clear about doing it responsibly. They talked a lot about safety and ethics. Yeah, that's important. They said they were going to be very careful about how they release new features. Good, because this stuff is powerful. It is. It was a lot to take in, this whole Dev Day event. New tools, big changes at OpenAI.

and these big questions about the future of AI. It was. But hopefully this deep dive helped make sense of some of it. At least that's what we try to do here. Absolutely. Thanks for taking the deep dive with us. The biggest demo of the new real-time API involved function calling with voice mode and buying chocolate-covered strawberries from our friendly local OpenAI developer experience engineer and strawberry shop owner, Ilan Biggio.

We'll first play you the audio of his demo and then go into a little interview with him. Fantastic. Could you place a call and see if you could get us 400 strawberries delivered to the venue? But please keep that under $1,500. We'll get those strawberries delivered for you. Hello? Is this Illest? I'm Romance AI Assistant. Call me about it. Fantastic. Could you tell me what flavors? Yeah, we have chocolate, vanilla, and we have peanut butter.

Are you sure you want 400? Yes, 400 chocolate-covered strawberries. How much would that be? I think that'll be around like $1,415.92. 400 chocolate-covered strawberries. Great, where would you like that delivered? Please deliver them to the Gateway Pavilion in Fort Mason. Okay, sweet. So just to confirm, you want 400 chocolate-covered strawberries to the Gateway Pavilion.

we expect to live with. Well, you guys are right nearby, so it'll be like, I don't know, 37 seconds? Cool, you too.

Hi Ilan, welcome to Latinspace. Thank you. I just saw your amazing demos, had your amazing strawberries. You are dressed up like exactly like a strawberry salesman. Gotta have it all. What was the building on demo like? What was the story behind the demo? It was really interesting. This is actually something I had been thinking about for months before the launch. Like having a like AI that can make phone calls is something like I've personally wanted for a long time. And so as soon as we launched internally, like I started hacking on it.

And then that sort of just made it into an internal demo. And then people found it really interesting. And then we thought, how cool would it be to have this on stage as one of the demos? Yeah. Would you call out any technical issues building? You were basically one of the first people ever to build with the voice mode API. Would you call out any issues integrating it with Twilio like that, like you did with function calling, with a form filling element? I noticed that you had like--

intents of things to fulfill and then you're like when

When you're still missing info, the voice would prompt you, role-playing the store guy. Yeah, yeah. So I think technically there's like the whole just working with audio and streams is a whole different piece. Like even separate from like AI and this like new capabilities, it's just tough. Yeah, when you have a prompt, conversationally it'll just follow like the, it was set up like kind of step by step to like ask the right questions based on like what the request was, right?

The function calling itself is sort of tangential to that. You have to prompt it to call the functions, but then handling it isn't too much different from what you would do with assistant streaming or chat completion streaming. I think the API feels very similar just to if everything in the API was streaming, it actually feels quite familiar to that.

to that. And then function calling wise, I mean, does it work the same? I don't know. Like I saw a lot of logs. You guys showed like in the playground a lot of logs. What is in there? What should people know? Yeah, I mean, it is like the events...

may have different names than the streaming events that we have in Chat Completions, but they represent very similar things. It's things like, you know, function call started, argument started. It's like, here's like argument deltas and then like function call done. Conveniently, we send one with like has the full function and then I just use that. Nice. Yeah. And then like what restrictions should people be aware of? Like, you know, I think...

Before we recorded, we discussed a little bit about the sensitivities around basically calling random store owners and putting like an AI on them. Yeah. So, I think there's recent regulation on that, which is why we want to be like very, I guess, aware of you can't just call anybody with AI, right? That's like just robocalling, you wouldn't want someone just calling you with AI. Yeah. So, I'm a developer, I'm about to do this on random people. Yeah. What laws am I about to break?

I forget what the governing body is, but you should, I think, having consent of the person you're about to call, it always works, right? I, as the strawberry owner, have consented to, like, getting called with AI. I think past that, you want to be careful. Definitely individuals are more sensitive than businesses. I think businesses, you have a little bit more leeway. Also, businesses, I think, have an incentive to want to receive AI phone calls, especially if, like,

they're dealing with it. It's doing business. Right? Like it's more business. It's kind of like getting on a booking platform, right? You're exposed to more, but I think it's still very much like a gray area. And so I think everybody should tread carefully, like figure out what it is. I, I, I, the law is so recent. I didn't have enough time to like, I'm also not a lawyer. Yeah. Okay, cool. Fair enough. One other thing. This is kind of agentic. Did you use a state machine at all? Did you use any framework?

No. No. You stick it in context and then just run it in a loop until it ends call? Yeah. There isn't even a loop like,

Because the API is just based on sessions, it's always just going to keep going. Every time you speak, it'll trigger a call. And then after every function call, it was also invoking a generation. And so that is another difference here. It's inherently almost in a loop just by being in a session. No state machines needed. I'd say this is very similar to the notion of routines, where it's just a list of steps and it's

like sticks to them softly, but usually pretty well. - And the steps is the prompts. - The steps, it's like the prompt,

Like the steps are in the prompt. It's like step one do this, step one do that. What if I want to change the system prompt halfway through the conversation? You can. To be honest, I have not played with that too much. But I know you can. Awesome. I noticed that you called it real-time API but not voice API. So I assume that it's like real-time API starting with voice. I think that's what he said on the thing. I can't imagine, like what else is real-time? Well, yes.

To use ChatGPT's voice mode as an example, we've demoed the video, real-time image. So I'm not actually sure what timelines are, but I would expect, if I had to guess, that that is probably the next thing that we're going to be making. You'd probably have to talk directly with a team building this. Sure. You can't promise their timelines. Yeah, right. Exactly. But given that this is the features that exist that we've demoed on ChatGPT, that's fine.

Yeah. There will never be a case where there's like a real-time text API, right? Well, this is a real-time text API. You can do text only on this. Oh. Yeah. I don't know why you would. But it's actually, so text-to-text here doesn't,

quite make a lot of sense. I don't think you'll get a lot of latency gain. But, like, speech-to-text is really interesting because you can prevent responses, like audio responses, and force function calls. And so you can do stuff like UI control that is, like, super, super reliable. We had a lot of, like, you know, like,

we weren't sure how well this was going to work because it's like you have a voice answering, it's like a whole persona, right? Like that's a little bit more risky. But if you like cut out the audio outputs and make it so it always has to output a function, like you can end up with pretty reliable like commands, like a command architecture. Yeah. Actually, that's the way I want to interact with a lot of these things as well, like one-sided voice. Yeah. You don't necessarily want to hear the voice back. Okay. And like sometimes it's like,

Yeah, I think having an Outpoint voice is great, but I feel like I don't always want to hear an Outpoint voice. I'd say usually I don't. But yeah, exactly. Being able to speak to it is super smooth. Cool. Do you want to comment on any other stuff that you announced? From caching, I noticed was like...

I like the no code change part. I'm looking forward to the docs because I'm sure there's a lot of details on what you cache, how long you cache. Because Anthropic caches were like five minutes. I was like, okay, but what if I don't make a call every five minutes? Yeah, to be super honest with you, I've been so caught up with the real-time API and making the demo that I haven't read up on the other launches too much. I mean, I'm aware of them, but I think I'm excited to see how all distillation...

works. That's something that we've been doing, like, I don't know, I've been doing it between our models for a while and I've seen really good results. Like, I've done back in the day, like, from GPT-4 to GPT-3.5 and got, like, pretty much the same level of function calling with hundreds of functions. So that was super, super compelling. So I feel like easier distillation, I'm really excited for.

I see. Is it a tool? So I saw evals. Yeah. Like, what is the distillation product? It wasn't super clear, to be honest. I think I want to let that team talk about it. Well, I appreciate you jumping on. Yeah, of course. Amazing demo. It was beautifully designed. I'm sure that was part of you and Roman. Yeah, I guess shout out to the first people to like creators of Wanderlust originally were like Simon and Carolis. And then like...

I took it and built the voice component and the voice calling components. Yeah, so it's been a big team effort. And then the entire PI team for debugging everything as it's been going on. It's been so great working with them. Yeah, you're the first consumers on the DX team. Yeah. I mean, the classic role of what we do there. Yeah. Okay, yeah. Anything else? Any other calls to action? No, enjoy Dev Day. Thank you. Yeah. That's it.

The Latent Space crew then talked to Olivier Godemont, head of product for the OpenAI platform, who led the entire Dev Day keynote and introduced all the major new features and updates that we talked about today. Okay, so we are here with Olivier Godemont. I don't pronounce French. That's fine. It was perfect. And it was amazing to see your keynote today. What was the backstory of preparing something like this, preparing Dev Day?

Essentially came from a couple of places. Number one, excellent reception from last year's Dev Day. Developers, startups, founders, researchers want to spend more time with OpenAI and we want to spend more time with them as well. And so for us, it was a no-brainer, frankly, to do it again, like a nice conference. The second thing is going global. We've done a few events in Paris and a few other non-American countries. And so this year we're doing SF, Singapore and London to frankly just meet more developers.

Yeah, I'm very excited for the Singapore one. Ah yeah. Will you be there? I don't know. I don't know if I got an invite. No. Actually, I can just talk to you. Yeah, and then there was some speculation around October 1st. Is it because 01, October 1st? It has nothing to do. I discovered the tweet yesterday where people are so creative. No, 01, there was no connection to October 1st. But in hindsight, that would have been a pretty good meme. Okay.

Yeah, and I think OpenAI's outreach to developers is something that I felt the whole in 2022 when people were trying to build a ChatGPT and there was no function calling, all that stuff that you talked about in the past. And that's why I started my own conference as like, here's our little developer conference thing. But to see this OpenAI Dev Day now and to see so many

developer-oriented products coming out of OpenAI, I think it's really encouraging. Yeah, totally. That's what I said essentially, like developers are basically

the people who make the best connection between the technology and the future, essentially. Essentially, see a capability, see a low-level technology, and are like, "Hey, I see how that application or that use case can be enabled." And so in the direction of enabling AGI for all of humanity, it's a no-brainer for us to partner with Devs.

And most importantly, you almost never had waitlists, which compared to other releases people usually have.

You had prompt caching, you had real-time voice API. Sean did a long Twitter thread so people know the releases. Yeah. What is the thing that was sneakily the hardest to actually get ready for that day? Or what was the less 24 hours, anything that you didn't know was going to work? Yeah. They're all fairly, I would say, involved, like features to ship. So the team has been working for months, all of them.

The one which I would say is the newest for OpenAI is the real-time API for a couple of reasons. I mean, one, you know, it's a new modality. Second, it's the first time that we have an actual web-circuit-based API. And so I would say that's the one that required the most work over the month to get right from a developer perspective and to also make sure that our existing safety mitigations work well with real-time audio in and audio out.

What are the design choices that you want to highlight? I think for me, WebSockets, you just receive a bunch of events, it's two-way. I obviously don't have a ton of experience. I think a lot of developers are going to have to embrace this real-time programming. What are you designing for or what advice would you have for developers exploring this? The core design hypothesis was essentially how do we enable human level latency?

We did a bunch of tests, like on average, like human beings, like, you know, text, like something like 300 milliseconds to converse with each other. And so that was the design principle, essentially, like working backwards from that and, you know, making the technology work. And so we evaluated a few options and WebSockets was the one that we landed on. So that was like one design choice. A few other like big design choices that we had to make from caching. From caching, the design like target was automated from the get go, like zero code change from the developer.

that way you don't have to learn like what is the prompt prefix and you know how long does the cache work like we just do it as much as we can essentially so that was a big design choice as well and then finally on distillation like an evaluation the big design choice was something i love that's hype like in my previous job like a philosophy around like a pit of success like what is essentially the the

the minimum number of steps for the majority of developers to do the right thing. Because when you do evals on fat tuning, there are many, many ways to mess it up, frankly, and have a crappy model, evals that tell a wrong story. And so our whole design was, OK, we actually care about helping people who don't have that much experience, like a very cheap model, get in a few minutes to a good spot. And so how do we essentially enable that bit of success in the product flow? Yeah.

I'm a little bit scared to fine-tune, especially for vision because I don't know what I don't know for stuff like vision. For text, I can evaluate pretty easily. For vision, let's say I'm trying to... One of your examples was Grab, which is very close to home. I'm from Singapore. I think your example was they identified stop signs better.

Why is that hard? Why do I have to fine tune that? If I fine tune that, do I lose other things? There's a lot of unknowns with Vision that I think developers have to figure out. For sure. Vision is going to open up like a new, I would say, evaluation space.

Because you're right, it's harder to tell correct from incorrect, essentially, with images. What I can say is that we've been alpha testing the vision fine-tuning for several weeks at that point. We are seeing even higher performance uplift compared to text fine-tuning.

So that's, there is something here like we've been pretty impressed like in a good way frankly but you know how well it works. But for sure like you know I expect the developers who are moving from one modality to like text and images will have like more testing evaluation like you know to set in place like to make sure it works well. The model distillation and evals is definitely like the most interesting moving away from just being a model provider to being a platform provider. How should people think about

being the source of truth? Like, do you want OpenAI to be like the system of record of all the prompting? Because people sometimes store it in like different data sources. And then is that going to be the same as the models evolve? So you don't have to worry about, you know, refactoring the data or like things like that or like future model structures. The vision is if you want to be a source of truth, you have to earn it, right? Like we're not going to force people like to pass us data if there is no value prop like, you know, for us to store the data. The vision here is

At the moment, most developers use a one-size-fits-all model, the off-the-shelf GP40, essentially. The vision we have is, fast forward a couple of years, I think most developers will essentially have an automated, continuous, fine-tuned model.

The more you use the model, the more data you pass to the model provider, the model is automatically fine-tuned, evaluated against some of the sets. And essentially, you don't have to, every month when there is a new snapshot, to go online and try a few new things. That's a correction.

We are pretty far away from it. But I think that evaluation and decision product are essentially a first good step in that direction. It's like, "Hey, if you are excited by the direction and you give us evaluation data, we can quickly log your completion data and start to do some automation on your behalf." Then you can do evals for free if you share data with OpenAI? Yeah.

How should people think about when it's worth it, when it's not? Sometimes people get overly protective of their data when it's actually not that useful. But how should developers think about when it's right to do it, when not, or if you have any thoughts on it? The default policy is still the same. We don't train on any API data unless you opt in. What we've seen from feedback is evaluation can be expensive. If you run all one eval on thousands of samples, your bill will get increased pretty significantly.

That's problem statement number one. Problem statement number two is essentially I want to get to a world where whatever open AI ships a new model snapshot,

we have full confidence that there is no regression for the task that developers care about. And for that to be the case, essentially we need to get evals. And so that essentially is a sort of a two-plus-one zone. It's like we subsidize basically the evals, and we also use the evals when we ship new models to make sure that we keep going in the right direction. So in my sense, it's a win-win. But again, completely opt-in. I expect that many developers will not want to share their data, and that's perfectly fine to me.

I think free evals though, very very good incentive. I mean it's a fair trade, you get data, we get free evals. Exactly, and we sanitize, PII everything, we have no interest in the actual sensitive data, we just want to have good evaluation on the real use cases. I almost want to eval the eval, I don't know if that ever came up. Sometimes the evals themselves are wrong, and there's no way for me to tell you.

Everyone who is starting with LLM, tinkering with LLM, is like, "Yeah, evaluation, easy. I've done testing all my life." And then you start to actually build evals, understand all the corner cases, and you realize, "Wow, there's a whole field itself." So yeah, good evaluation is hard. And so, yeah.

But I think there's a, you know, I just talked to Braintrust, which I think is one of your partners. They also emphasize code-based evals versus your sort of low-code. What I see is like, I don't know, maybe there's some more that you didn't demo, but what I see is kind of like a low-code experience, right, for evals. Would you ever support like a more code-based, like would I run code on...

on OpenAI's eval platform? - For sure. I mean, we meet developers where they are. At the moment, the demand was more for easy to get started, like eval. But if we need to expose an evaluation API, for instance, for people to pass their existing test data, we'll do it. So yeah, there is no philosophical, I would say, misalignment on that. - Yeah, yeah, yeah. What I think this is becoming, by the way, and it's basically like you're becoming AWS, like the AI cloud.

And I don't know if that's a conscious strategy or it's like... It doesn't even have to be a conscious strategy. You're going to offer storage, you're going to offer compute, you're going to offer networking. I don't know what networking looks like. Networking is maybe like caching. It's a CDN. It's a prompt CDN. But it's the AI versions of everything. Do you see the analogies? Yeah, totally. Whenever I talk to developers, I feel like

Good models are just half of the story to build a good app. There's a ton more you need to do. Evaluation is the perfect example. You can have the best model in the world, if you're in the dark, it's really hard to get into confidence. And so our philosophy is the whole software development stack is being basically reinvented with LLMs.

there is no freaking way that open AI can build everything. There is just too much to build, frankly. And so my philosophy is essentially we'll focus on the tools which are the closest to the model itself. So that's why you see us investing quite a bit in fine-tuning, distillation, evaluation, because we think that it actually makes sense to have in one spot all of that. There is some sort of virtual circle, essentially, that you can set in place. But stuff like, you know,

LLM Ops, like tools which are further away from the model, I don't know. If you want to do super elaborate, like home management or tooling, I'm not sure OpenAI has such a big edge, frankly, to build this sort of tools. So that's how we view it at the moment.

But again, frankly, the philosophy is super simple. The strategy is super simple. It's meeting developers where they want us to be. And so that's frankly day in, day out, what I try to do. Cool. Thank you so much for the time. I'm sure you're going to... Yeah, I have more questions on... A couple questions on voice and then also your call to action, what you want feedback on, right? I think we should spend a bit more time on voice because I feel like that's the big splash thing. I talked... Well, I mean, just like...

What is the future of real-time for OpenAI? Because I think obviously video is next, you already have it in the ChatGPT desktop app. Do we just have a permanent life? Are developers just going to be sending sockets back and forth with OpenAI? How do we program for that? What is the future? Yeah, that makes sense. I think with multi-modality, real-time is quickly becoming essentially the right experience to build an application.

So my expectation is that we'll see like a non-trivial volume of applications moving to real-time API. If you zoom out, like, audio is already simple. Like audio until basically now, audio

on the web, in apps, was basically very much like a second-class citizen. Like you basically did like an audio chatbot for users who did not have a choice. You know, they were like struggling to read or I don't know, they were like not super educated with technology. And so, frankly, it was like the crappier option, you know, compared to text. But when you talk to people in the real world, the vast majority of people like prefer to talk and

listen instead of typing and writing. We speak before we write. Exactly. I don't know. I mean, I'm sure it's the case for you in Singapore. For me, my friends in Europe, the number of WhatsApp voice notes I receive every day, I mean, just people. It makes sense, frankly. Chinese. Chinese, yeah. Yeah, all voice. It's easier. There is more emotions. I mean, you get the point across pretty well. And so, my personal ambition for the real-time API and audience in general

is to make audio and multimedia truly a first-class experience. If you're the amazing, super bold startup out of YC, you're going to build the next billion user application, to make it truly audio-first and make it feel like an actual good project experience.

So that's essentially the ambition and I think it could be pretty big. I think one issue that people have with the voice so far as released in the advanced voice mode is the refusals. You guys had a very inspiring model spec, I think Joanne worked on that, where you said like, yeah, we don't want to overly refuse all the time. In fact, even if not safe for work, in some occasions it's okay. Yeah.

How is there an API that we can say, not safe for work, okay? I think we'll get there. The model spec nailed it. It nailed it, it's so good. Yeah, we are not in the business of policing if you can say vulgar words or whatever. There are some use cases like I'm writing a Hollywood script, I want to say vulgar words, it's perfectly fine. And so I think the direction where we'll go here is that basically there will always be a set

of behavior that will just like forbid frankly, because they're illegal against our terms of services. But then there will be some more risky themes which are completely legal, like vulgar words or not safe for work stuff, where basically we'll expose a controllable safety knobs in the API to basically allow you to say, "Hey, that theme okay, that theme not okay. How sensitive do you want the threshold to be on safety refusals?"

I think that's the direction. So a safety API. Yeah, in a way, yeah. Yeah, we've never had that. Yeah. Because right now, it's whatever you decide and then that's it. That would be the main reason I don't use OpenAI Voice is because of over-refusals. Yeah, yeah, yeah. No, we've got to fix that. Like singing. We're trying to do voice karaoke. So I'm a singer and you lock off singing. Yeah, yeah, yeah.

But I understand music gets you in trouble. Okay, yeah, so just generally, what do you want to hear from developers? We have all developers watching. What feedback do you want? Anything specific as well, especially from today. Anything that you are unsure about that you're like, our feedback could really help you decide. For sure. I think essentially it's becoming pretty clear after today that, I would say the open-ended actions become pretty clear after today.

investment in reasoning, investment in multi-modality, investment as well in, I would say, tool use, like function calling. To me, the biggest question I have is, where should we put the cursor next?

I think we need all three of them, frankly. So we'll keep pushing. Hire 10,000 people. Actually, no need. Build a bunch of bots. Exactly. And so, let's take O1 for instance. Is O1 smart enough for your problems? Let's set aside for a second the existing models. For the apps that you would love to build, is O1 basically it in reasoning or do we still have a step to do? Preview is not enough. I need the full one. Yeah.

So that's exactly the sort of feedback. Essentially what I would love to do is for developers, I mean there's a thing that Sam has been saying over and over again, it's easier said than done, but I think it's directionally correct. As a developer, as a founder, you basically want to build an app which is a bit too difficult for the model today, right? Like what you think is right, it's sort of working, sometimes not working,

And that way, that basically gives us a goal post and be like, "Okay, that's what you need to enable with the next small release in a few months." And so, I would say that usually that's the sort of feedback which is the most useful that I can directly incorporate. Awesome. I think that's our time.

Thank you so much, guys. Yeah, thank you so much. Thank you. We were particularly impressed that Olivier addressed the not safe for work moderation policy question head on, as that had only previously been picked up on in Reddit forums. This is an encouraging sign that we will return to in the closing candor with Sam Altman at the end of this episode.

Next, a chat with Roman Hewitt, friend of the pod, AI Engineer World's fair-closing keynote speaker and head of developer experience at OpenAI on his incredible live demos and advice to AI engineers on all the new modalities.

Alright, we're live from OpenAI Dev Day. We're with Ramon, who just did two great demos on stage and has been a friend of Latentspace. So thanks for taking some of the time. Of course, yeah. Thank you for being here and spending your time with us today. Yeah, I appreciate it. I appreciate you guys putting this on. I know it's like extra work, but it really shows the developers that you care about reaching out.

Yeah, of course. I think when you go back to the OpenAI mission, I think for us it's super important that we have the developers involved in everything we do, making sure that they have all of the tools they need to build successful apps. And we really believe that the developers are always going to invent the ideas, the prototypes, the fun factors of AI that we can't build ourselves. So it's really cool to have everyone here.

We had Michelle from you guys on. Yes, great episode. Thank you. She very seriously said API is the path to AGI. Correct. And people in our YouTube comments were like,

API is not AGI. I'm like, no, she's very serious. API is the path to AGI because you're not going to build everything like the developers are, right? Of course, yeah. That's the whole value of having a platform and an ecosystem of amazing builders who can in turn create all of these apps. I'm sure we talked about this before, but there's now more than three million developers building on OpenAI. So, it's pretty exciting to see all of that energy into creating new things.

I was going to say, you built two apps on stage today, an international space station tracker and then a drone. The hardest thing must have been opening Xcode and setting that up. Now, the models are so good that they can do everything else. You had two modes of interaction. You had kind of like chat GPT app to get the plan, Twitter one, and then you had cursor to apply some of the changes. How should people think about the best way to consume the coding models, especially both for brand

brand new projects and then existing projects that they're trying to modify. Yeah. I mean, one of the things that's really cool about O1 Preview and O1 Mini being available in the API is that you can use it in your favorite tools like Cursor, like I did, right? And that's also what like Devon from Cognition can use in their own software engineering agents. In the

In the case of Xcode, it's not quite deeply integrated in Xcode, so that's why I had ChatGPT side by side. But it's cool, right? Because I could instruct one preview to be my coding partner and brainstorming partner for this app, but also consolidate all of the files and architect the app the way I wanted. So all I had to do was just port the code over to Xcode and zero-shot the app built. I don't think I conveyed, by the way, how big a deal that is, but you can now create an iPhone app

from scratch describing a lot of intricate details that you want and your vision comes to life in like a minute. It's pretty outstanding. I have to admit I was a bit skeptical because if I open up Esco, I don't know anything about iOS programming. You know which file to paste it in. You probably set it up a little bit. So I'm like I have to go home and test it to like figure out and I need the ChatGPT desktop app so that it can tell me where to click. Yeah, I mean like

Xcode and iOS development has become easier over the years since they introduced Swift and SwiftUI. I think back in the days of Objective-C or like the storyboard, it was a bit harder to get in for someone new. But now with Swift and SwiftUI, their dev tools are really exceptional. But now when you combine that with O1 as your brainstorming and coding partner, it's like your architect effectively. That's the best way I think to describe O1. People ask me like, "Can GPT-4 do some of that?"

And it certainly can, but I think it will just start spitting out code, right? And I think what's great about O1 is that it can make up a plan. In this case, for instance, the iOS app had to fetch data from an API. It had to look at the docs. It had to look at how do I parse this JSON? Where do I store this thing? And kind of wire things up together. So that's where it really shines. Is Mini or Preview the better model that people should be using? Oh, good. Yeah.

I think people should try both. We're obviously very excited about the upcoming O1 that we shared the evals for. But we noticed that O1 Mini is very, very good at everything math, coding, everything STEM. If you need for your kind of brainstorming or your kind of science part, you need some broader knowledge than reaching for O1 previews better.

But yeah, I used the one mini for my second demo and it worked perfectly. All I needed was very much like something rooted in code, architecting and wiring up like a front end, a back end, some UDP packets, some web sockets, something very specific and it did that perfectly. And then maybe just talking about voice and Wanderlust, the app that keeps on giving. It does indeed, yeah. What's the backstory behind preparing for all of that?

You know, it's funny because when last year for Dev Day, we were trying to think about what could be a great demo app to show like an assistive experience. I've always thought travel is a kind of a great use case because you have like pictures, you have locations, you have the need for translations potentially. There's like so many use cases that are bounded to travel that I thought last year, let's use a travel app and that's how Wanderlust came to be. But of course, a year ago, all we had was a text-based assistant.

And now we thought, well, if there's a voice modality, what if we just bring this app back as a wink? And what if we were interacting better with voice? And so with this new demo, what I showed was the ability to have a complete conversation in real time with the app. But also, the thing we wanted to highlight was the ability to call tools and functions, right? So in this case, we placed a phone call using the Twilio API interfacing with our AI agents.

but developers are so smart that they'll come up with so many great ideas that we could not think ourselves, right? But what if you could have like, you know, a 911 dispatcher? What if you could have like a customer service like a center that is much smarter than what we've been used to today? There's gonna be so many use cases for real time. It's awesome. Yeah, and sometimes actually you like this should kill phone trees like

Like there should not be like dial one. Of course. Para español, you know. Yeah, exactly. I mean, even you starting speaking Spanish would just do the thing, you know. You don't even have to ask. So yeah, I'm excited for this future where we don't have to interact with those legacy systems. Yeah, yeah. Is there anything, so you're doing function calling in a streaming environment. So basically it's WebSockets, it's UDP, I think. Yeah.

It's basically not guaranteed to be exactly once delivery. Is there any coding challenges that you encountered when building this? Yeah, it's a bit more delicate to get into it. We also think that for now what we ship is a beta of this API. I think there's much more to build onto it.

It does have the function calling and the tools, but we think that for instance, if you want to have something very robust on your client side, maybe you want to have WebRTC as a client, right? And as opposed to like directly working with the sockets at scale. So that's why we have partners like LifeKit and Agora if you want to use them. And I'm sure we'll have many more in the future.

But yeah, we keep on iterating on that and I'm sure the feedback of developers in the weeks to come is going to be super critical for us to get it right. Yeah, I think LifeKit has been fairly public that they are used in the ChatGPT app.

Like, is it just all open source and we just use it directly with OpenAI or do we use LiveKit Cloud or something? So right now we released the API, we released some sample code also and reference clients for people to get started with our API. And we also partnered with LiveKit and Agora so they also have their own like ways to help you get started that plugs naturally with the real-time API.

So depending on the use case, people can decide what to use. If you're working on something that's completely client, or if you're working on something on the server side, for the voice interaction, you may have different leads. So we want to support all of those. I know you've got a run. Is there anything that you want the AI engineering community to get feedback on specifically? Like, even down to, like, you know, a specific API endpoint or, like...

What's the thing that you want? Yeah, I mean, if we take a step back, I think Dev Day this year is a little different from last year and in a few different ways. But one way is that we wanted to keep it intimate, even more intimate than last year. We wanted to make sure that the community is on the spotlight. That's why we have community talks and everything.

And the takeaway here is like learning from the very best developers and AI engineers. And so, you know, we want to learn from them. Most of what we ship this morning, including things like prompt caching, the ability to generate prompts quickly in the playground, or even things like vision fine tuning. These are all things that developers have been asking of us. And so the takeaway I would leave them with is to say like, hey, the roadmap that we're working on is heavily influenced by them and their work. And so we love feedback.

from high feature requests, as you say, down to very intricate details of an API endpoint, we love feedback. So yes, that's how we build this API. Yeah, I think the model distillation thing as well, it might be the most boring, but actually used a lot. True, yeah. And I think maybe the most unexpected, right? Because I think if I read Twitter correctly the past few days,

A lot of people were expecting us to ship the real-time API for speech-to-speech. I don't think developers were expecting us to have more tools for distillation.

And we really think that's going to be a big deal, right? If you're building apps that have, you know, you want high, like low latency, low cost, but high performance, high quality on the use case, distillation is going to be amazing. Yeah, I sat in the distillation session just now and they showed how they distilled from 4.0 to 4.0 Mini and it was like only like a 2% hit in the performance and 15x cheaper. Yeah, I was there as well for the superhuman kind of use case inspired for an employee client. Yeah, this was really good.

Cool, man. Amazing. Thank you so much, buddy. Thanks again for being here today. It's always great to have you. As you might have picked up at the end of that chat, there were many sessions throughout the day focused on specific new capabilities, like the new model distillation features, combining evals and fine-tuning. For our next session, we are delighted to bring back two former guests of the pod, which is something listeners have been greatly enjoying in our second year of doing the Latent Space podcast.

Michelle Pokras of the API team joined us recently to talk about structured outputs and today gave an updated long form session at Dev Day describing the implementation details of the new structured output mode. We also got her updated thoughts on the voice mode API we discussed in her episode now that it is finally announced.

She is joined by friend of the pod and super blogger, Simon Willison, who also came back as guest co-host in our Dev Day 2023 episode. Great, we're back live at Dev Day. Returning guest, Michelle. And then returning guest co-host, Ford.

- Four for first, yeah, I don't know. - I've lost count. - I've lost count. - It's been a few. - Simon Willis is back. Yeah, we just wrapped everything up. Congrats on getting everything live. Simon did a great live blog, so if you haven't caught up. - I implemented my live blog while waiting for the first talk to start using like, GP4 wrote me the JavaScript. And I got that live just in time and then yeah, I was live blogging the whole day. - Are you a cursor enjoyer? - I haven't really gotten to cursor yet, to be honest.

I just haven't spent enough time for it to click, I think. I'm more of copy and paste things out to Claude and ChatGPT. Yeah, it's interesting. I've converted to Cursor for it and 01 is so easy to just toggle on and off. What's your workflow? Copy, paste, apply. I'm going to be real. I'm still VS Code co-pilot.

So, Copilot is actually the reason I joined OpenAI. It was, you know, before ChatGPT, this is the thing that really got me. So, I'm still into it. But I keep meaning to try out Cursor and I think now that things have calmed down, I'm going to give it a real go.

Yeah, it's a big thing to change your tool of choice. Yes. Yeah, I'm pretty dialed. Yeah. I mean, if you want, you can just fork VS Code and make your own. That's the thing to do. It's a done thing, right? Yeah. We talked about doing a hackathon where the only thing you do is fork VS Code and may the best fork win. Nice. That's actually a really good idea.

Yeah, so, I mean, congrats on launching everything today. I know we touched on it a little bit, but everyone was kind of guessing that Voice API was coming and we talked about it in our episode. How do you feel going into the launch? Any design decisions that you want to highlight?

Yeah, super jazzed about it. The team has been working on it for a while. It's like a very different API for us. It's the first WebSocket API. So a lot of different design decisions to be made, like what kind of events do you send? When do you send an event? What are the event names? What do you send on connection versus on future messages? So there have been a lot of interesting decisions there. The team has also hacked together really cool projects as we've been testing it.

One that I really liked is we had an internal hackathon for the API team and some folks built like a little hack that you could use Vim with voice mode to like control Vim and you would tell them on a like write a file and it would you know know all the Vim commands and type those in. So yeah a lot of cool stuff we've been hacking on and really excited to see what people build with it.

I've got to call out a demo from today. I think it was Katia had a 3D visualization of the solar system like WebGL solar system you could talk to. That is one of the coolest conference demos I've ever seen. That was so convincing. I really want the code. I really want the code for that to get put out there. I'll talk to the team. I think we can probably put it out. Absolutely beautiful example. And it made me realize that the real-time API, this WebSocket API, it means that building a website that you can just talk to

is easy now. It's like it's not difficult to build, spin up a web app where you have a conversation with it, it calls functions for different things, it interacts with what's on the screen. I'm so excited about that. There are all of these projects I thought I'd never get to and now I'm like, you know what? Spend a weekend on it. I can have a talk to your database with a little web application. That's so cool. Chat with PDF but really

- Really chat with PDF. - Yeah, exactly. - Not completely. - Totally. And it's not even hard to build. That's the crazy thing about this. Yeah, very cool. Yeah, when I first saw the space demo, I was actually just wowed. And I had a similar moment, I think, to all the people in the crowd. I also thought Roman's drone demo was super cool. - That was a super fun one as well. - Yeah, I actually saw that live this morning and I was holding my breath for sure. Knowing Roman, he probably spent the last two days working on it.

But yeah, I'm curious about-- you were talking with Romain actually earlier about what the different levels of abstraction are with WebSockets. It's something that most developers have zero experience with. I have zero experience with it. Apparently there's the RTC level and then there's the WebSocket level, and there's levels in between. ROMAN NURIK: Not so much. I mean, with WebSockets, with the way they've built their API, you can connect directly to the OpenAI WebSocket from your browser. And it's actually just regular JavaScript. You instantiate the WebSocket thing.

It looks quite easy from their example code. The problem is that if you do that, you're sending your API key from source code that anyone can view. Yeah, we don't recommend that for production. So it doesn't work for production, which is frustrating because it means that you have to build a proxy. So I'm going to have to go home and build myself a little WebSocket proxy just to hide my API key. I want OpenAI to do that. I want OpenAI to solve that problem for me so I don't have to build the

1000th WebSocket proxy just for that one problem. Totally. We've also partnered with some partner solutions. We've partnered with, I think, Agora, LiveKit, a few others. So, there's some loose solutions there, but yeah, we hear you. It's a beta.

Yeah, I mean you still want a solution where someone brings their own key and they can trust that you don't get it, right? Kind of. I mean I've been building a lot of bring your own key apps where it's my HTML and JavaScript, I store the key in local storage in their browser and it never goes anywhere near my server which works but how do they trust me? How do they know I'm not going to ship another piece of JavaScript that steals the key from them? And so nominally this actually comes with the crypto background. This is what Metamask does.

Yeah, it's a public-private key thing. Yeah. Yeah. Like, why doesn't OpenAI do that? I don't know if obviously it's- I mean, as with most things, you'd think there's like some really interesting question and really interesting reason and the answer is just, you know, it's not been the top priority and it's hard for a small team to do everything.

I have been hearing a lot more about the need for things like sign in with OpenAI. I want OAuth. I want to bounce my users through chat GPT and I get back a token that lets me spend up to $4 on the API on their behalf. Then I could ship all of my stupid little experiments, which currently require Peter Koppel

people to copy and paste their API key in, which cuts off everyone. Nobody knows how to do that. Totally. I hear you. Something we're thinking about. And yeah, stay tuned. Yeah, yeah. Right now, I think the only player in town is OpenRouter. That is basically-- it's funny. It was made by-- I forget his name. But he used to be CTO of OpenSea. And the first thing he did when he came over was build MetaMask for AI. Totally. Yeah, very cool. What's the most underrated release from today?

Vision fine-tuning. Vision fine-tuning is so underrated. For the past two months, whenever I talk to founders, they tell me this is the thing they need most. A lot of people are doing OCR on very bespoke formats like government documents, and vision fine-tuning can help a lot with that use case.

Also, bounding boxes. People have found a lot of improvements for bounding boxes with Vision Fine-Tuning. So yeah, I think it's pretty slept on. People should try it. You only really need 100 images to get going. Tell me more about bounding boxes. I didn't think GPT-4 Vision could do bounding boxes at all.

Yeah, it's actually not that amazing at it. We're working on it. But with fine-tuning, you can make it really good for your use case. That's cool because I've been using Google Gemini's banding box stuff recently. It's very, very impressive. Yeah. But being able to fine-tune a model for that. The first thing I'm going to do with fine-tuning for images is I've got five chickens and I'm going to fine-tune a model that can tell which chicken is which. Love it.

Which is hard because three of them are grey. Yeah. So there's a little bit of... Okay, this is my new favorite use case. This is awesome. Yeah. I've managed to do it with prompting. Just like I gave Claude pictures of all of the chickens and then said, okay, which chicken is this? Yeah. But it's not quite good enough because it confuses the grey chickens. Listen, we can close that eval gap. Yeah. It's going to be a great eval. My chicken eval is going to be fantastic.

I'm also really jazzed about the evals product. It's kind of like a sub-launch of the distillation thing, but people have been struggling to make evals. And the first time I saw the flow with how easy it is to make an eval in our product, I was just blown away. So I recommend people really try that. I think that's what's holding a lot of people back from really investing in AI because they just have a hard time figuring out if it's going well for their use case. So we've been working on making it easier to do that.

Does the eval product include structured output testing? Yeah, you can check if it matches your JSON schema. We have guaranteed structured output anyway.

So we don't have to test it. Well, not the schema, but the performance. See, these seem easy to tell apart. I think so. It's like, it might call the wrong function. You're going to have right schema, wrong output. So you can do function calling testing. I'm pretty sure. I'll have to check that for you, but I think so. We'll make sure it's in the notes. How do you think about the evolution of the API design? I think, to me, that's the most important thing. So even with the OpenAI levels, like chatbots, I can understand what the API design looks like.

reasoning, I can kind of understand it even though like channel thought kind of changes things. As you think about real-time voice and then you think about agents, it's like how do you think about how you design the API and like what the shape of it is? Yeah, so I think we're starting with the lowest level capabilities and then we build on top of that as we know that they're useful. So a really good example of this is real-time. We're actually going to be shipping

audio capabilities in chat completions. So, this is like the lowest level capability. So, you supply in audio and you can get back raw audio and it works at the request response layer. But in through building advanced voice mode, we realized ourselves that like it's pretty hard to do with something like chat completions. And so, that led us to building this WebSocket API.

So we really learned a lot from our own tools. And we think, the check and wishes thing is nice for certain use cases or async stuff, but you're really going to want a real-time API. And then as we test more with developers, we might see that it makes sense to have another layer of abstraction on top of that, something closer to more client-side libraries. But for now, that's where we feel we have a really good point of view. LAURENCE MORONEY: So that's a question I have is, if I've got a half hour long audio recording,

At the moment, the only way I can feed that in is if I call the WebSocket API and slice it up into little JSON basics for snippets and file them all over. In that case, I'd rather just give you like an image in the chat completion API, give you a URL to my MP3 files and input. Is that something? That's what we're going to do. Oh, thank goodness for that. Yes.

It's in the blog post. I think it's a short one-liner, but it's rolling out, I think, in the coming weeks. Oh, wow. Oh, really soon then. Yeah, the team has been sprinting. We're just putting finishing touches on stuff. Do you have a feel for the length limit on that? I don't have it off the top. Okay. Sorry.

Because yeah, often I want to do, I do a lot of work with transcripts of hour-long YouTube videos. Yeah. Currently, I run them through Whisper and then I do the transcript that way. But being able to do the multimodal thing, those would be really useful. Totally, yeah. We're really jazzed about it. We want to basically give the lowest capabilities we have, lowest level capabilities, and the things that make it easier to use. So, targeting kind of both.

I just realized what I can do though is I do a lot of Unix utilities, little like Unix things. I want to be able to pipe the output of a command into something which streams that up to the WebSocket API and then speaks it out loud. So I can do streaming speech of the output of things. That should work. I think you've given me everything I need for that. That's cool. Yeah. Excited to see what you build.

I heard there are multiple competing solutions and you guys eval that before you pick WebSockets. Like server-set events, polling.

Can you give your thoughts on the live updating paradigms that you guys looked at? Because I think a lot of engineers have looked at stuff like this. I think WebSockets are just a natural fit for bidirectional streaming. Other places I've worked, like Coinbase, we had a WebSocket API for pricing data. I think it's just a very natural format. So it wasn't even really that controversial at all?

I don't think it was super controversial. I mean, we definitely explored the space a little bit, but I think we came to WebSockets pretty quickly. Cool. Video? Yeah. Not yet, but possible in the future. I actually was hoping for the ChatGPT desktop app with video today because that was demoed. Yeah. This is Dev Day.

I think the moment we have the ability to send images over the WebSocket API, we get video. My question is, how frequently? Because sending a whole video frame of like a 1080p screen, maybe it might be too much. What's the limitations on a WebSocket chunk going over? I don't know.

I don't have that off the top. Like Google Gemini, you can do an hour's worth of video in their context window and just by slicing it up into one frame at 10 frames a second. And it does work. So...

I don't know. But then that's the weird thing about Gemini is it's so good at you just giving it a flood of individual frames. It'll be interesting to see if GPT-4 can handle that or not. Do you have any more feature requests? It's been a long day for everybody, but you got me show right here. My one is, I want you to do all of the accounting for me. I want my users to be able to run my apps

and I want them to call your APIs with their user ID and have you go, "Oh, they've spent 30 cents. Cut them off at a dollar. I can like check how much they spent." All of that stuff because I'm having to build that at the moment and I really don't want to. I don't want to be a token accountant. I want you to do the token accounting for me. Yeah, totally. I hear you. It's good feedback.

Well, how does that contrast with your actual priorities? I feel like you have a bunch of priorities. They showed some on stage with multi-modality and all that. Yeah. It's hard to say. I would say things change really quickly. Things that are big blockers for user adoption, we find very important. It's a rolling prioritization. No assistance API update? Not at this time.

Yeah. I was hoping for like an old one native thing in assistance. Yeah. I thought they would go well together. We're still kind of iterating on the formats. I think there are some problems with the assistance API, some things it does really well. And I think we'll keep iterating and land on something really good, but just wasn't quite ready yet. Some of the things that are good in the assistance API is hosted tools. People really like hosted tools and especially RAG.

And then some things that are less intuitive is just how many API requests you need to get going with the assistance API. It's quite-- It's quite a lot. Yeah, you've got to create an assistant, you've got to create a thread, you've got to do all this stuff. So yeah, it's something worth thinking about. It shouldn't be so hard. The only thing I've used it for so far is code interpreter. It's like it's an API to code interpreter. Crazy exciting. Yes, we want to fix that and make it easier to use. I want code interpreter over WebSockets. That would be wildly interesting.

Yeah. Do you want to bring your own code interpreter or you want to use OpenAI as well? I want to use that because code interpreters are a hard problem. Sandboxing and all of that stuff is... Yeah, but there's a bunch of code interpreter as a service things out there. There are a few now, yeah. Because there's... I think you don't allow arbitrary installation of packages. Oh, they do. They really do. Unless they use your hack.

Yeah, and I do. You can upload a pip package. You can compile C code in Code Interpreter. I know. That's a hack. Oh, it's such a glorious hack, though. Okay. I've had it write me custom SQLite extensions in C and compile them and run them inside of Python, and it works. I mean, yeah. There's others. E2B is one of them. It'll be interesting to see what the real-time version of that will be.

Yeah. Awesome, Michel. Thank you for the update. We left the episode as what will voice mode look like? Obviously, you knew what it looked like, but you didn't say it. So now you could. Yeah, here we are. Hope you guys find it. Yeah. Cool. Awesome. Thank you. That's it. Our final guest today, and also a familiar recent voice on the Latent Space Pod, presented at one of the community talks at this year's Dev Day.

Alistair Pullen of Cosene made a huge impression with all of you. Special shout out to listeners like Jesse from Morph Labs when he came on to talk about how he created synthetic datasets to fine tune the largest lauras that had ever been created for GPT-4-0 to post the highest ever scores on SweeBench and SweeBench Verified while not getting recognition for it because he refused to disclose his reasoning traces to the SweeBench team.

Now that OpenAI's O1 preview is announced, it is incredible to see the OpenAI team also obscure their chain of thought traces for competitive reasons and still perform lower than Cozine's Genie model.

We snagged some time with Ali to break down what has happened since his episode aired. Welcome back, Ali. Thank you so much. Thanks for having me. So you just spoke at opening at Dev Day. What was the experience like? Did they reach out to you? You seem to have a very close relationship. Yeah, so off the back of...

Off the back of the work that we've done that we spoke about last time we saw each other, I think that OpenAI definitely felt that the work we've been doing around fine-tuning was worth sharing. I would obviously tend to agree, but today I spoke about some of the techniques that we learned. Obviously it was like a non-linear path

arriving to where we've arrived and the techniques that we've built to build Genie. So I think I shared a few extra pieces about some of the techniques and how it really works under the hood, how you generate a data set to show the model how to do what we show the model. And that was mainly what I spoke about today. I mean, yeah, they reached out and I was super excited at the opportunity, obviously. Like, it's not every day that you get to come and do this, especially in San Francisco.

Yeah, they reached out and they were like, do you want to talk at Dev Day? You can speak about basically anything you want related to what you've built. And I was like, sure, that's amazing. I'll talk about fine tuning how you build a model that does this software engineering. So, yeah. Yeah. And the trick here is when we talked, O1 was not out. No, it wasn't. Did you know about O1?

I didn't know. I knew some bits and pieces. No, not really. I knew a reasoning model was on the way. I didn't know what it was going to be called. I knew as much as everyone else. Strawberry was the name back then. Because, you know, fast forward, you were the first to hide your chain of thought reasoning traces as IP. Yes. Famously, that got you in trouble with SweetBetch or whatever. I feel slightly vindicated by that now. And now, obviously, O1 is doing it. Yeah, the fact that, I mean, like,

I think it's true to say right now that the reasoning of your model gives you the edge that you have. And the amount of effort that we put into our data pipeline to generate these human-like reasoning traces was... I mean, that wasn't for nothing. We knew that this was the way that you'd unlock more performance, getting them all to think in a specific way. In our case, we wanted it to think like a software engineer. But yeah, I think that...

The approach that other people have taken like OpenAI in terms of reasoning has definitely showed us that we were going down the right path pretty early on. And even now we've started replacing some of the reasoning traces in our Genie model with reasoning traces generated by O1, or at least in tandem with O1. And we've already started seeing improvements in performance from that point. But no, like back to your point, in terms of like the whole like

withholding them, I still think that that was the right decision to do because of the very reason that everyone else has decided to not share those things. It shows exactly how we do what we do and that is our edge at the moment. As a founder, they also feature a cognition on stage, talk about that.

How does that make you feel that like, you know, they're like, hey, 01 is so much better, makes us better. For you, it should be like, oh, I'm so excited about it too because now all of a sudden it's like it kind of like raises the floor for everybody. Like how should people, especially new founders, how should they think about, you know, worrying about the new model versus like being excited about them just focusing on like the core FP and maybe switching out some of the parts like you mentioned? Yeah, speaking for us, I mean, obviously like we were extremely excited about 01 because...

at that point the process of reasoning is obviously very much baked into the model. We fundamentally, if you like remove all distractions and everything, we are a reasoning company, right? We want to reason in the way that a software engineer reasons. So when I saw that model announced, I thought immediately, well, I can improve the quality of my traces coming out of my pipeline. So like my signal to noise ratio gets better. And then not immediately, but down the line, I'm going to be able to train those traces into O1 itself. So I'm going to get even more performance that way as well. So it's,

for us a really nice position to be in to be able to take advantage of it both on the prompted side and the fine-tuned side and also because fundamentally like

we are, I think, fairly clearly in a position now where we don't have to worry about what happens when 02 comes out, what happens when 03 comes out. This process continues. Like, even going from, you know, when we first started going from 3.5 to 4, we saw this happen. And then from 4 turbo to 4.0 and then from 4.0 to 0.1, we've seen the performance get better every time. And I think, I mean, like,

the crude advice I'd give to any startup founders, try to put yourself in a position where you can take advantage of the same, you know, like sea level rise every time essentially. Do you make anything out of the fact that you were able to take 4.0 and fine tune it higher than 0.1 currently scores on SweetBench verified? Yeah, I mean like, yeah, that was obviously, to be honest with you, you realized that before I did. Adding value. Yes, absolutely. That's a value add investor right there. No, obviously I think it's been,

That in of itself is really vindicating to see because I think we have heard from some people, not a lot of people, but some people saying, well, okay, well, if everyone can reason, then what's the point of doing your reasoning? But it shows how much more signal is in the custom reasoning that we generate. And again, it's the very sort of obvious thing. If you take something that's made to be general and you make it specific, of course it's going to be better at that thing, right?

So it was obviously great to see we still are better than O1 out of the box, even with an older model. And I'm sure that that delta will continue to grow once we're able to train O1 and once we've done more work on our data set using O1, that delta will grow as well. It's not obvious to me that they will allow you to find your O1, but maybe they'll try. I think the core question that OpenAI really doesn't want you to figure out is can you use an open source model and beat O1?

Interesting. Because you basically have shown proof of concept that a non-01 model can beat 01. And their whole 01 marketing is don't bother trying. Like, don't bother stitching together multiple chain of thought calls. We did something special. Secret sauce. You don't know anything about it. And somehow...

you know, your 4.0 chain of thought reasoning as a software engineer is still better. Maybe it doesn't last. Maybe they're going to run 0.1 for five hours instead of five minutes and then suddenly it works. So I don't know. It's hard to know. I mean, one of the things that we just want to do out of sheer curiosity is do something like fine-tune 4.0 5B on the same data set. Like same context window length, right? So it should be fairly easy. We haven't done it yet. Truthfully, we have been so swamped with...

The waitlist, shipping product, dev day, onboarding customers from our waitlist, all these different things have gotten in the way. But it is definitely something out of more curiosity than anything else I'd like to try out. But also, it opens up a new vector of if someone has a VPC where they can't deploy an open AI model, but they might be able to deploy an open source model, it opens that up for us as well from a customer perspective. So it'll probably be quite useful. I'd be very keen to see what the results are, though.

I suspect the answer is yes, but it may be hard to do. So Reflection 70B was a really crappy attempt at doing it. You guys were much better, and that's why we had you on the show. I'm interested to see if there's an open 01, basically. People want open 01. Yeah, I'm sure they do. As soon as we do it, and once we've wrapped up what we're doing in San Francisco, I'm sure we'll give it a go. I spoke to some guys today, actually, about fine-tuning 405B, who might be able to allow us to do it.

very easily. I don't want to have to basically do all the setup myself. So, yeah, that might happen sooner rather than later. Anything from the releases today that you're super excited about? So, prompt caching, I'm guessing when you're dealing with a lot of codebases, that might be helpful. Is there anything with vision fine-tuning related to more like UI-related development? Yeah, definitely. Yeah, I mean, we were talking, it's funny, like,

My co-founder Sam, who you've met, and I were talking about the idea of doing vision fine-tuning way back, well over a year ago, before Genie existed as it does now. When we collected our original data set to do what we do now, whenever there were image links and links to graphical resources and stuff, we also pulled that in as well. We never had the opportunity to use it, but it's something we have in storage.

Again, like when we have the time, it's something that I'm super excited, particularly on the UI side, to be able to like leverage, particularly if you think about one of the things, not to sidetrack, but one of the things we've noticed is, I know SweBench is like the most commonly talked about thing, and honestly, it's an amazing project, but one of the things we've learned the most from actually shipping this product to users is it's a pretty bad proxy at telling us how competent the model is. So for example, when people are doing like React development using Genie,

For us, it's impossible to know whether what it's written has actually done what it wanted to. So at least even using the fine-tuning provision to be able to help eval what we output is already something that's very useful. But also in terms of being able to pair, here's a UI I want, here's the code that actually represents that UI, is also going to be super useful as well, I think. In terms of generally, what have I been most impressed by? The distillation thing is awesome.

I think we'll probably end up using it in places. But what it shows me more broadly about OpenAI's approach is they're going to be building a lot of the things that we've had to hack together internally in terms from a tooling point of view just to make our lives so much easier. And I've spoken to, you know, John, the head of fine tuning, extensively about this. But there's a bunch of tools that we've had to build internally for things like

dealing with model lineage, dealing with data set lineage, because it gets so messy so quickly, that we would love OpenAI to build. Like, absolutely would love them to build it. It's not what gives us our edge, but it certainly means that then we don't have to build it and maintain it afterwards. So it's a really good first step, I think, in the overall maturity of the fine-tuning product and API in terms of where they're going to see those early products. And I think that they'll be continuing in that direction going on.

Did you not, so there's a very active ecosystem of LLMOPs tools. Did you not evaluate those before building your own? We did, but I think fundamentally like... No moat.

Yeah, like I think in a lot of places it was never a big enough pain point to be like, oh, we absolutely must outsource this. It's definitely in many places something that you can hack a script together in a day or two and then hook it up to our already existing internal tool UI and then you have what you need. And whenever you need a new thing, you just tack it on. But for like all of these LLM ops tools,

I've never felt the pain point enough to really like bother and that's not to deride them at all. I'm sure many people find them useful but just for us as a company we've never felt the need for them. So it's great that OpenAI are going to build them in because it's really nice to have them there for sure but it's not something that like I'd ever consider really paying for externally or something like that if that makes sense. Yeah. Does voice mode factor into Genie?

Maybe one day, that'd be sick, wouldn't it? Yeah, I think so. You're the first person that we've been asking this question to everybody. You're the first person to not mention voice mode. It's currently so distant from what we do, but I definitely think this whole talk of we want it to be a full-on AI software engineering colleague, like,

there is definitely a vector in some way that you can build that in. Maybe even during the ideation stage, talking through a problem with Genius in terms of how we want to build something down the line. I think that might be useful, but honestly, that would be nice to have when we have the time.

Yeah, amazing. One last question. In your talk, you mentioned a lot about curating your data and your distribution and all that. Yes. And before we sat down, you talked a little bit about having to diversify your data set. What's driving that? What are you finding? So we have been rolling people off the wait list that we sort of amassed when we announced when I last saw you.

And it's been really interesting because, as I may have mentioned on the podcast, we had to be very opinionated about the data mix and the data set that we put together for sort of the V0 of Genie. Again, to your point, JavaScript, JavaScript, JavaScript, Python. There's a lot of JavaScripts and it's various forms in there. But it turns out that when we've shipped it to the very early alpha users we rolled it out to, for example, we had some guys using it with a C-sharp code base.

And C# currently represents, I think, about 3% of the overall data mix. And they weren't getting the levels of performance that they saw when they tried it with the Python code base. And it was obviously not great for them to have a bad experience, but it was nice to be able to correlate it with the actual objective data mix that we saw. So what we've been doing...

is like little top-up fine tunes where we take the general genie model and do an incremental fine tune on top with just a bit more data for a given vertical language. And we've been seeing improvements coming from that. So again, this is one of the great things about sort of baptism by fire and letting people use it and giving you feedback and telling you where it sucks.

because that is not something that we could have just known ahead of time. So I want that data mix to, over time as we roll it out to more and more people, and we are trying to do that as fast as possible, but we're still a team of five for the time being, to be as general and as representative of what our users do as possible and not what we think they need. Yeah, so every customer is going to have their own fine-tuned system

There is going to be the option to fine-tune the model on your code base. It won't be in the base pricing tier, but you will definitely be able to do that. It will go through all of your code base history, learn how everything happened, and then you'll have an incrementally fine-tuned genie just on your code base. That's what Enterprise is really lovely the idea of. Lovely. Perfect.

Cool. Yeah, that's it. Thank you so much. Thank you so much, guys. Good to see you. Thank you. Lastly, this year's Dev Day ended with an extended Q&A with Sam Altman and Kevin Weil. We think both the questions asked and answers given were particularly insightful, so we are posting what we could snag of the audio here from publicly available sources, credited in the show notes, for you to pick through.

If the poorer quality audio here is a problem, we recommend waiting for approximately one to two months until the final video is released on YouTube. In the meantime, we particularly recommend Sam's answers on the moderation policy, on the underappreciated importance of agents and AI employees beyond level three, and his projections of the intelligence of 01, 02, and 03 models in future.

All right, I think everybody knows you. For those who don't know me, I'm Kevin Weill, Chief Product Officer at OpenAI. I have the good fortune of getting to turn the amazing research that our research teams do

into the products that you all use every day and the APIs that you all build on every day. I thought we'd start with some audience engagement here. So on the count of three, I want to count to three, and I want you all to say, of all the things that you saw launched here today, what's the first thing you're going to integrate? It's the thing you're most excited to build on, all right? You got to do it, right? One, two, three. Reel without API.

I'll say personally, I'm super excited about our distillation products. I think that's going to be really, really good. I'm also excited to see what you all do with advanced voicemail with the real-time API and with vision fine-tuning in particular. So, okay. So, I've got some questions for Sam. I've got my CEO here in the hot seat. Let's see if I can't make a career-limiting move. So, we'll start this. We'll start with an easy one, Sam. How close are we to ATI? Okay.

You know, we used to, every time we finished a system, we would say like, "In what way is this not an AGI?" And it used to be like, very easy. You could like make a whole AMP that doesn't work in Kube or a Dota bot and it's like, "Oh, it does some things, but definitely not an AGI." It's obviously harder to say now. So we're trying to like stop talking about AGI as this general thing. We have this levels framework because the word AGI has become so overloaded. So like,

Real quickly, we use one for chatbots, two for reasoners, three for agents, four for innovators, five for organizations, like roughly. I think we clearly got to level two, or we clearly got to level two with O1. And it can do really quite impressive, cognitively, tasks. It's a very smart model. It doesn't feel AGI-like in a few important ways. But I think if you just do the one next step of making it very agent-like,

which is our level three and which I think we will be able to do in the not distant future, it will feel surprisingly, still probably not something that most of you would call an AGI, though maybe some of you would. It's going to feel like, all right, this is like a significant thing. And then the leap, and I think we do that pretty quickly, the leap from that to something that can really increase the rate of new scientific discovery, which for me is like a very important part of having an AGI,

I feel a little bit less certain on that, but not a long time. Like, I think all of this now is going to happen quickly. If you think about what happened from last step to this one in terms of model capabilities and you're like, if you go look at like, if you go from like 01 on a hardware problem back to like 4 Turbo that we launched 11 months ago, you'll be like, wow,

This is happening pretty fast. And I think the next year will be very steep progress. The next two years will be very steep progress. Harder than that. Hard to see a lot of certainty. But I would say, like, the math will vary. And at this point, the definitions really matter. And the fact that the definitions matter this much somehow means we're, like, getting close. Yeah. And, you know, there used to be this sense of AGI where it was, like, it was a binary thing and you were going to go to sleep one day and there was no AGI and wake up the next day and there was AGI. I don't think that's...

exactly how we think about it anymore, but how have your views and habits evolved? - You know, the one, I agree with that. I think we're like, you know, in this like kind of

Period where it's gonna feel very blurry for a while and the you know, is this AGI yet or is this not AGI or kind of like at what point? Yeah, it's just gonna be this like smooth exponential and you know, probably most people looking back at history won't agree like when that milestone was hit. We'll just realize it was like a silly thing. Even the Turing test, which I thought always was like this very clear milestone, you know, there was this like fuzzy period. It kind of like went ooshin' by, no one cared. But

But I think the right framework is this one exponential. That said, if we can make an AI system that is materially better at all of open AI than doing AI research, that does feel to me like some sort of important discontinuity. It's probably still wrong to think about it that way. It probably still is the smooth exponential curve. That feels like a good milestone.

Is OpenAI still as committed to research as it was in the early days? Will research still drive the core of our advancements in our product development? Yeah, I mean, I think more than...

There was a time in our history when the right thing to do was just to scale up compute, and we saw that with conviction. And we have a spirit of, we'll do whatever works. We have this mission, we want to build safe AGI, figure out sharing benefits. If the answer is rack up GPUs, we'll do that. And right now, the answer is, again, really push on research. And I think you see this with O1. That is a giant research breakthrough that we were attacked.

I think for many vectors over a long period of time that came together in this really powerful way, we have many more giant research breakthroughs to come. But the thing that I think is most special about OpenAI is that we really deeply care about research and we understand how to... It's easy to copy something you know works. And I actually don't even mean that as a bad thing. When people copy OpenAI, I'm like, "Great, the world gets more AI? That's wonderful."

To do something new for the first time, to really do research in the true sense of it, which is not like, you know, let's barely get SOTA at this thing or let's tweak this, but let's go find the new paradigm and the one after that and the one after that, that is what motivates us. And I think the thing that is special about us as an org, besides the fact that we married product and research and all this other stuff together, is that we know how to run that kind of a culture that can go push back the frontier. And that's really hard.

We love it. And that's, you know, I think we're going to have to do that a few more times. Yeah, I'll say like the litmus test for me coming from the outside, from, you know, sort of normal tech companies, how critical research is to open AI is that building product at open AI is fundamentally different than any other place that I have ever done it before. Normally you have

you have some sense of your tech stack, you have some sense of what you have to work with, what capabilities computers have, and then you're trying to build the best product, right? You're figuring out who your users are, what problems they have, and how you can help solve those problems for them. There is that at OpenAI, but also,

the state of what computers can do just evolves every two months, three months, and suddenly computers have a new capability that they've never had in the history of the world,

And we're trying to figure out how to build a great product and expose that for developers and our APIs and so on. And you can't totally tell what's coming. It's coming through the mist a little bit at you and gradually taking shape. It's fundamentally different than any other company I've ever worked at. Is that the thing that most surprised you? Yes. Yeah, and it's interesting how...

Even internally, we don't always have a sense. You have like, okay, I think this capability is coming, but is it going to be 90% accurate or 99% accurate in the next model? Because the difference really changes what kind of product you can build.

you know that you're going to get to 99, you don't quite know when, and figuring out how you put a roadmap together in that world is really interesting. Yeah, the degree to which we have to just follow the science and let that determine what we go work on next and what products we build and everything else is, I think, hard to get across. We have guesses about where things are going to go. Sometimes we're right, often we're not, but

But if something starts working or if something doesn't work that you thought was going to work, our willingness to just say we're going to pivot everything and do what the science allows, and you don't get to pick what the science allows, that's surprising. I was sitting with an enterprise customer a couple weeks ago, and they said, one of the things we really want, this is all working great, we love this, one of the things we really want is a notification 60 days in advance when you're going to launch something.

And I was like, I want that too. All right, so I'm going through. These are a bunch of questions from the audience, by the way. And we're going to try and also leave some time at the end for people to ask audience questions. So we've got some folks with mics and when we get there, they'll be thinking. But next thing. So many in the alignment community are genuinely concerned that OpenAI is now only paying lip service to alignment. Can you reassure us? I think it's true we have a different mindset.

take on alignment than like maybe what people write about it on whatever that like internet form is. But we really do care a lot about building safe systems. We have an approach to do it that has been informed by our experience so far and touch on another question which is you don't get to pick where the science goes of we want to figure out how to make capable models that get safer and safer over time. And

You know, a couple of years ago, we didn't think the whole strawberry or the Oban paradigm was going to work in the way that it's worked. And that brought a whole new set of safety challenges, but also safety opportunities. And rather than kind of like plan to make theoretical once superintelligence gets here, here's the like 17 principles, we have an approach of figure out where the capabilities are going and then work to make that system safe.

And O1 is obviously our most clever, but it's also our most aligned model. And as these models get better intelligence, better reasoning, whatever you want to call it, the things that we can do to align them, things we

to build really safe systems across the entire staff, our tool set keeps increasing as well. So we have to build models that are generally accepted as safe and robust to be able to put them in the world. And when we started opening up, what the picture of alignment looked like and what we thought the problems that we needed to solve were going to be turned out to be nothing like the problems that actually are in front of us and that we had to solve now.

And also, when we made the first GPT-3, if you asked me for the techniques that would have worked for us to be able to now deploy our current systems as generally accepted to be safe and robust, they would not have been the ones that turned out to work. So by this idea of

inter-deployment which i think has been one of our most important safety stances ever and sort of confronting reality as it's in front of us we've made a lot of progress and we expect to make more we keep finding new problems to solve but we also keep finding new techniques to solve them worrying about the sci-fi ways this all goes wrong is also very important we have people thinking about that it's a little bit less clear kind of what to do there and sometimes you end up backtracking a lot

But I don't think it's, I also think it's scary to say we're only gonna work on the thing in front of us. We do have to think about where this is going and we do that too. And I think if we keep approaching the problem from both ends like that, most of our thrust on the like, okay, here's the next thing we're gonna deploy this, what it needs to happen to get there. But also like what happens if this curve just keeps going? That's been, that's been an effective strategy for us. I'll say also it's one of the places where I'm really, I really like our philosophy of iterative deployment. When I was at Twitter,

I don't know, 100 years ago now. Ed said something that stuck with me, which is no matter how many smart people you have inside your walls, there are way more smart people outside your walls. And so when we try and get our, you know, it'd be one thing if we just said we're going to try and figure out everything that could possibly go wrong within our walls.

walls and we'd be just us and the red teamers that we can hire and so on. And we do that. We work really hard at that, but also launching iteratively and launching carefully and learning from the ways that folks like you all use it. What can go right? What can go wrong? I think it's a big way that we get these things right. I think that as we head into this world of

agents off doing things in the world, that is going to become really, really important. As these systems get more complex and are acting over longer horizons, the pressure testing from the whole outside world, really, yeah. So we'll go actually, we'll go off of that and maybe talk to us a bit more about how you see agents fitting in middle-to-the-ice long-term plans. What are you thinking? I think I'm a huge part of the, I think the exciting thing is this

This set of models, O1 in particular, and all of its successors are going to be what makes this possible because you finally have the ability to reason, to take hard problems, break them into simpler problems, and act on them. I mean, I think 2025 is going to be the year that's really good speed. Yeah, I mean, chat interfaces are great and they will, I think, have an important place in the world, but when you can ask a model,

When you can ask like ChatGPT or some agent something and it's not just like you get a kind of quick response or even if you get like 15 seconds of thinking and no one gives you like a nice piece of code back or whatever. But you can like really give something a multi-turn interaction with environments or other people or whatever and like the equivalent of multiple days of human effort and like a really smart, really capable human and like have stuff happen.

We all say that, we're all like, "Oh yeah, we're interested in the next thing, this is coming, this is gonna be another thing." And we just talk about it like, "Okay." You know, it's like the next model in the evolution. And we don't really know until we get to use these that it's... We'll of course get used to it quickly, people can use any new technology quickly, but this will be like a very significant change to the way the world works in a short period of time. Yeah, it's amazing. Somebody was talking about getting used to new capabilities in AI models and how quickly... Actually, I think it was about Waymo, talking about how in the first

10 seconds of using Waymo, they were like, "Oh my god, is this thing..." It was like, "Let's watch out." And 10 minutes in, they were like, "Oh, this is really cool." And then 20 minutes in, they were checking their phone board. You know, it's amazing how much your internal firmware updates for this new stuff very quickly. Yeah, I think people will ask an agent to do something for them that would have taken them a month, and they'll finish in an hour, and it'll be great. And then they'll have 10 of those at the same time.

I have like a thousand of those at the same time. And by 2030 or whatever, we'll look back and be like, yeah, this is just like what a human is supposed to be capable of. What a human used to like, you know, grind at for years or whatever. Many humans have been for years. Like I just now like ask the computer to do it. It's like done. That's why is it not minute?

yeah it's also it's one of the the things that makes having an amazing developer platform great too because you know we'll experiment and we'll build some agentic things of course and like we've already got i think just like we're just pushing the boundaries of what's possible today you've got groups like cognition doing amazing things and coding harvey and case tax you've got steve doing cool things with language translation like

We're beginning to see this stuff work and I think it's really going to start working as we continue to iterate these models. One of the very fun things for us about having this development platform is just getting to watch the unbelievable speed and creativity that are building these experiences. Developers very near and dear to our heart. It's kind of like the first thing we launched, many of us came building on platforms. But so much of the capability of these models and great experiences that have been built by

people building on the platform. We'll continue to try to offer great first-party products, but we know that we'll only ever be a small, narrow slice of the apps or agents or whatever people build in the world. And seeing what has happened in the world in the last 18, 24 months, it's been quite amazing. I'm going to keep going on the agent front here.

What do you see as the current hurdles for computer controlling agents? Safety and alignment. Like, if you are really going to give an agent the ability to start clicking around your computer, which you will, you are going to have a very high bar for the robustness and the reliability and the alignment of that system. So technically speaking, I think, you know, we're getting like pretty close to the capability side. The sort of agent safety and trust framework

I think be the long haul. And now I'll kind of ask a question that's almost the opposite of one of the questions from earlier. Do you think safety could act as a false positive and actually limit public access to critical tools that would enable a more egalitarian world? The honest answer is yes, that will happen sometimes. Like, we'll try to get the balance right. But if we were fully aware of it and didn't care about, like, safety and alignment at all,

Could we have launched 01 faster? Yeah, we could have done that. It would have come at a cost. It would have been things that would have gone really wrong. I'm very proud that we didn't. The cost, you know, I think would have been manageable with 01, but by the time of 03 or whatever, like maybe it would be pretty unacceptable. And so starting on the conservative side, I mean, people are complaining like, oh, voice mode, like it won't say this offensive thing. And I really

I'm a horrible company and let it offend me. You know what? I actually mostly agree. If you are trying to get a one to say something offensive, it should follow the instructions of its user most of the time. There's plenty of cases where it shouldn't. We have a long history of when we put a new technology into the world, we start on the conservative side. We try to give society time to adapt. We try to understand where the real harms are versus the more theoretical ones. And that's part of our approach to safety.

And not everyone likes it all the time. I like it all the time. But if we're right that these systems are, and we're going to get it wrong too, like sometimes we won't be conservative enough in some area. But if we're right that these systems are willing to get as powerful as we think they are as quickly as we think they might, then I think starting that way makes sense. And, you know, we like relax. Totally agree. What's the next big challenge for a startup that's using AI as a core?

I'll say, I've got one which is, I think one of the challenges and we face this too because we're also building products on top of our own models, is trying to find the kind of the frontier. You want to be building, these AI models are evolving so rapidly and if you're building for something that the AI model does well today, it'll work well today but it's going to feel old tomorrow.

and so you want to build for for things that the ai model can just barely not do you know where maybe the early adopters will go for it and other people won't quite but that just means that when the next model comes out as we continue to make improvements that use case that just barely didn't work you're going to be you're going to be the first to do it and it's going to be amazing but figuring out that boundary is really hard i think it's where the best products are going to get built up

Totally agree with that. The other thing I'm going to add is I think it's like very tempting to think that a technology makes a startup and that is almost never true. No matter how cool a new technology or a new sort of like tech title it is, it doesn't excuse you from having to do all the hard work of building a great company that is going to have durability or like an accumulated advantage over time. And

We hear from a lot of startups at YC, this is like a very common thing, which is like, I can do this incredible thing, I can make this incredible service. And that seems like a complete answer, but it doesn't excuse you from any of the normal laws of business. You still have to build a good business and a good strategic position. I think a mistake is that in the unbelievable excitement and updraft of AI, people are very tempted to forget it. This is an interesting one. The mode of voice is like tapping directly into the human API.

How do you ensure ethical use of such a powerful tool with obvious abilities and manipulation? Yeah, you know, voice mode was a really interesting one for me. It was like the first time that I felt like I sort of got really tricked in that when I was playing with the first beta of it, I couldn't stop myself. I mean, I kind of

Like I still say like, please to chat GBT, but in voice mode, I like couldn't not kind of use the normal niceties. I was like so convinced, like, ah, it might be a real, like, you know, and obviously it's just like hacking some circuit in my brain, but I really felt it with voice mode. And I sort of still do. The, I think this is a more, this is an example of like a more general thing that we're going to start facing, which is as these systems become

more and more capable, and as we try to make them as natural as possible to interact with, they're gonna like hit parts of our neural circuitry that would like evolve to deal with other people. And you know, there's like a bunch of clear lines about things we don't want to do. Like, we don't. Like, there's a whole bunch of like weird personality growth hacking, I think vaguely socially manipulative stuff we could do.

But then there's these other things that are just not nearly as clear-cut. Like, you want the voice mode to feel as natural as possible, but then you get across the uncanny valley and it, at least in me, triggers something. And, you know, me saying, like, "Please, thank you for chatting with me," you think, "No problem, probably a good thing to do." You never know. But I think this really points at the kinds of safety and alignment issues we have to start paneling.

Alright, back to brass tacks. Sam, when's '01 gonna support function tools? - Do you know? - Before the end of the year. There are three things that we really wanna get in for. We're gonna record this, take this back to the research team, show them how badly we need to do this. I mean, there are a handful of things that we really wanted to get into '01, and we also, you know, it's a balance of should we get this out to the world earlier and begin learning from it, learning from how you all use it, or should we launch a fully complete thing that is, you know,

In line with it, it has all the abilities that every other model that we've launched has. I'm really excited to see things like system prompts and structured outputs and function calling make it into O1. We will be there by the end of the year. It really matters to us too. In addition to that, just because I can't resist the opportunity to reinforce this, we will get all of those things in and a whole bunch more things you all have asked for. The model is going to get so much better so fast. We are so early.

This is like, you know, maybe it's the GPT-2 scale moment, but like we know how to get to GPT-4 and we have the fundamental stuff in place now to get to GPT-4. And in addition to planning for us to build all of those things, plan for the model to just get like rapidly smarter. Like, you know, hope you all come back next year and plan for it to feel like way more of a year of improvement for Turbo. What feature or capability of a competitor do you really admire?

I think Google's notebook thing is super cool. What do they call it? - Notebook 11. - Notebook 11, yeah. I was like, I woke up early this morning and I was like looking at examples on Twitter and I was just like, this is like, this is just cool. This is just a good, cool thing. And like, not enough of the world is like shipping and different things. It's mostly like the same stuff. But that I think is like,

That brought me a lot of joy this morning. It was very well done. One of the things I really appreciate about that product is just the format itself is really interesting. But they also nailed the podcast style voices. They have really nice microphones. They have these sort of sonoran voices. Did you guys see somebody on Twitter was saying the cool thing to do is take your LinkedIn

and give it to these, give us notebook LM and you'll have two podcasters riffing back and forth about how amazing all of your accomplishments over the years. I'll say mine is I think Anthropic did a really good job on projects.

It's kind of a different take on what we did with GVTs and GVTs are a little bit more long lived. It's something you build and can use over and over again. Projects are kind of the same idea, but like more temporary meant to be kind of stood up, used for a while. And then the different mental model makes a difference. I think they did a really nice job with that. All right. We're getting close to audience questions. So be thinking of what you want to ask.

So at OpenAI, how do you balance what you think users may need versus what they actually need today? Also a better question for you. Yeah, well, I think it does get back to a bit of what we were saying around trying to build for what the model can just

not quite due, but almost due. But it's a real balance too as we support over 200 million people every week on ChatGPT. You also can't say, "No, it's cool. Deal with this bug for three months or this issue. We've got something really cool coming. You've got to solve for the needs of today." And there are some really interesting product problems. I mean, you think about

i'm speaking to a group of people who know ai really well think of all the people in the world who have never used any of these products and that is the vast majority of the world still you're basically giving them a text interface and on the other side of the text interface is this like alien intelligence that's constantly evolving that they've never seen or interacted with and you're trying to teach them all the crazy things that you can actually do it all the ways it can help can integrate into your life can solve problems for you

And people don't know what to do with it. You know, like you come in and you're just like, people type like, and then respond, you know, "Hey, great to see you, like how can I help you today?" And then you're like, "Okay, I don't know what to say." And then you end up, you kind of walk away and you're like, "Well, I didn't see the magic of that." And so it's a real challenge figuring out how you, I mean, we all have a hundred different ways that we use chat GPT and AI tools in general.

but teaching people what those can be and then bringing them along as the model changes month by month by month and suddenly gains these capabilities way faster than we as humans gain the capabilities. It's a really interesting set of problems and I know it's one that you all solve in different ways as well. I have a question. Who feels like they spend a lot of time with O1 and they would say like, "I feel definitively smarter than that thing"? Do you think you still go by O2?

No one taking the benefit of being smarter than me. So one of the challenges that we face is like, we know how to go do this thing that we think will be like, at least probably smarter than all of us in a broad array of tasks. And yet we have to still fix bugs and do the, hey, how are

And mostly what we believe in is that if we keep pushing on model intelligence, people will do incredible things with them. You know, we want to build the smartest, most helpful models in the world and then find all sorts of ways to use that and build on top of that. It has been definitely a

evolution for us to not just be entirely research focused and then we do have to fix all those bugs and make this reusable and I think we've gotten better at balancing that but still as part of our culture I think we trust that if we can keep pushing on intelligence and self-referencing around down here people will build just incredible things with

Yeah, I think it's a core part of the philosophy and you do a good job pushing us to always, well, basically incorporate the frontier of intelligence into our products, both in the APIs and into our first party products. Because it's easy to kind of stick to the thing you know, the thing that works well, but you're always pushing us to like get the frontier in, even if it only kind of works because it's going to work really well soon.

So I always find that a really helpful push. You kind of answered the next one. You do say please and thank you to the models. I'm curious, how many people say please and thank you? Isn't that so interesting? I do too.

I kind of can't, I feel bad if I don't. Okay, last question and then we'll go into audience questions for the last 10 or so minutes. Do you plan to build models specifically made for agentic use cases? Things that are better at reasoning and tool calling? - We plan to make models that are great at agentic use cases. That'll be a key priority for us over the coming months. Specifically is a hard thing to ask for because I think it's also just how we keep making smarter models. So yes, there's some things like tool use

function calling that we need to build it that'll help. But mostly we just want to make the best reasoning models in the world. Those will also be the best agenda-based models in the world. Cool. Let's go to audience questions. How extensively do you dog through your own technology in your company? Do you have any interesting examples that may not be obvious? Yeah, I mean we put models up for internal use even before they're done training. We use checkpoints and try to have people use them for whatever they can and try to sort of like build new ways to

explore the capability of the model internally and use them for our own development or research or whatever else as much as we can. We're still always surprised by the creativity outside the world and what people do. But basically the way we've figured out every step along our way of how to, what to push on next, what we can productize, what the models are really good at is by internal dogfooding. That's like our whole, that's how we like feel our way through this. We don't yet have like employees that are based off of O1 but

I, you know, as we like move into the world of agents, we will try that. Like we'll try having like, you know, things that we deploy in our internal systems that help you with stuff. There are things that get closer to that. I mean, they're like customer service. We have bots internally that do a ton about answering external questions and fielding internal people's questions on Slack and so on. And our customer service team is probably, I don't know, 20% the size it might otherwise need to be because of it. I know.

I know Matt Knight and our security team has talked extensively about all the different ways we use models internally to automate a bunch of security things and take what used to be a manual process where you might not have the number of humans to even look at everything incoming and have models separating signal from noise and highlighting to humans what they need to go look at, things like that. So I think internally there are tons of examples and people maybe underestimate

You all probably will not be surprised by this, but a lot of folks that I talk to are. The extent to which it's not just using a model in a place, it's actually about using chains of models that are good at doing different things and connecting them all together to get one end-to-end process that is very good at the thing you're doing, even if the individual models have flaws and make mistakes.

Thank you. I'm wondering if you guys have any plans on sharing models for like offline usage because with this distillation thing it's really cool that we can share our own models but a lot of use cases you really want to kind of like have a version of it. We're open to it. It's not on, it's not like a high priority on the current roadmap. If we had like more resources and bandwidth we would go do that but there's

a lot of reasons you want a local model, but it's not like a this year kind of thing. My question is, there are many agencies in the government, the local, state, and national level that could really greatly benefit from the tools that you guys are developing, but have perhaps some hesitancy on deploying them because of security concerns, data concerns, privacy concerns. And I guess I'm curious to know if there are any sort of planned partnerships with governments, world governments, once

whatever AGI is achieved because obviously if AGI can help solve problems like hunger, poverty, climate change, government's going to have to get involved with that right? I'm just curious to know if there's some you know plan that works when the time comes.

Yeah, I actually think you don't want to wait until AGI, you want to start now, right? Because there's a learning process and there's a lot of good that we can do with our current model. So we've announced a handful of partnerships with government agencies, some states, I think Minnesota and some others, Pennsylvania, also with organizations like USAID. It's actually a huge priority of ours to be able to help

governments around the world get acclimated, get benefit from the technology. And of all places government feels like somewhere where you can automate a bunch of workflows and make things more efficient, reduce drudgery and so on. So I think there's a huge amount of good we can do now. And if we do that now, it just accrues over the long run as the models get better, we get closer to EGI.

Pretty open-ended question. What are your thoughts on open source? So whether that's open weights, just general discussion, where do you guys sit with open source? I think open source is awesome. Again, if we had more bandwidth, we would do that too. We've gotten very close to making a big open source effort a few times. And then the really hard part is prior to our organization.

put other things ahead of it part of it is like there's such good open source models in the world now that i think that segment the thing we always did in most enterprise like a really great on-device model and i think that segment is fairly well served i do hope we do something at some point but we want to find something that we feel like if we don't do it then we'll just be missing this and then not make like another thing that's like a tiny bit better on benchmarks because we think there's like a lot of good stuff out there

But like spiritually, philosophically, very glad it exists. Would like to figure out. Hi Sam, hi Kevin. Thanks for inviting us to Dev Day. It's been awesome. All the live demos worked. It's incredible. Why can't advanced voice mode save? And as a follow up to this, if it's a company, like legal issue in terms of copyright, etc. Is there a daylight between how we think about safety in terms of your own products on your own platform versus giving us developers kind of the

I don't know, sign the right things off so we can make art of advanced voicemail sing. You know the funny thing is Sam asked the same question.

Why can't this thing sing? I want it to sing. I've seen it sing before. Actually, there are things obviously that we can't have it sing. You can't have it sing copyrighted songs if you don't have the licenses, et cetera. And then there are things that it can't sing. You can have it sing "Happy Birthday" and that would be just fine.

And we want that too. It's a matter of, once you, basically it's easier with finite time to say no and then build it in, but it's nuanced to get it right. There are penalties to getting these kinds of things wrong. So it's really just where we are now. We really want the models to sing too. People were waiting for us to ship voice mode, which is very fair. We could have waited longer and kind of really got the

classifications and filters on, you know, covering music versus not, but we decided we would just ship it and we'll have more. But I think Sam has asked me like four or five times why we didn't have a nice feature. I mean, we still can't like offer something where we're gonna be in like a really bad need of hot water developers or first party or whatever. So yes, we can like maybe have some differences, but we still have to comply with the law. Could you speak a little to the future of where you see context windows going and kind of the timeline for

How do you see things balance between context with no growth and drag, basically, information retrieval? I think there's like two different takes on that that matter. One is like, when is it going to get to like kind of normal long context, like 10 million or whatever, like long enough that you just throw stuff in there and it's fast enough you're happy about it. And I expect everybody's going to make pretty fast progress there, and that'll just be a thing. Long context has gotten weirdly less usage than I would have expected so far, but

you know there's a bunch of reasons for that I don't want to go too much into it and then there's this other question of like when do we get to context link not like 10 million but 10 trillion when do we get to the point where you throw like every piece of data you've ever seen in your entire life in there and you know like that's a whole different set of things that obviously takes some research breakthroughs but I assume that infinite context will happen at some point at some point it's like

less than a decade, and that's going to be just a totally different way that we use these models. Even getting to the like 10 million tokens of very fast and accurate context, which I expect to measure in like months, something like that, you know, like people will use that in all sorts of ways. Great. But yeah, the very, very long context I think is going to happen and it's really interesting. I think we maybe have time for one or two more.

Don't worry, this is going to be your favorite question. So with voice and all the other changes that users have experienced since you all have launched your technology, what do you see is the vision for the new engagement layer, the form factor, and how we actually engage with this technology to make our lives so much better? I love that question. It's one that we ask ourselves a lot, frankly. There's this

And I think it's one where developers can play a part here because there's this trade off between generality and specificity here. I'll give you an example. I was in Seoul and Tokyo a few weeks ago, and I was in a number of conversations with folks that with whom I didn't have a common language and we didn't have a translator before we would not have been able to have a conversation. We would just sort of smile at each other and continued on. I took out my phone. I said, judge, I want you to be.

translator for me when i speak in english i want you to speak in korean you hear korean i want you to repeat it in english and i was able to have a full business conversation and it was amazing you think about the impact that could have not just for business but think about travel and tourism and people's willingness to go places where they might not have a word of the language you can have these really amazing impacts

but inside chat gbt that was still a thing that i had to like check if he's not optimized for that right like he wants this sort of digital you know universal translator in your pocket that just knows that what you want to do is translate not that hard to build but i think there's

We struggle with trying to build an application that can do lots of things for lots of people and it keeps up, like we've been talking about a few times, it keeps up with the pace of change and with the capabilities, you know, agented capabilities and so on. I think there's also a huge opportunity

for the creativity of an audience like this to come in and like solve problems that we're not thinking of that we don't have the expertise to do and ultimately the world is a much better place if we get more ai to more people and it's why we are so proud to serve all of you

only thing i would add is if you just think about everything that's going to come together at some point in not that many years in the future you'll walk up to a piece of glass you will say whatever you want they will have like there'll be incredible reasoning models agents connected to everything there'll be a video model streaming back to you like a custom interface just for this one request whatever you need is just going to get like rendered in real time and video you'll be able to interact with it you'll be able to like click

click through the stream or say different things and it'll be off doing like again the kinds of things that used to take like humans years to figure out and it'll just you know dynamically render whatever you be a completely different way of using a computer and also getting things to happen in the world that it's going to be quite wild awesome thank you that was a great question to end on i think we're at time thank you so much for coming that's all for our coverage of dev day 2024

We want to extend an extra special note of gratitude to Lindsay McCallum of the OpenAI comms team, who helped us set up so many interviews at very short notice and physically helped ensure the smooth continuity of the video recordings. We couldn't do this without you, Lindsay. If you have any feedback on the launches or for our guests, hop on over to our YouTube or Substack comments section and say hi.

We're especially interested in your personal feedback and demos built with the new things launched this week. Feel the AGI. All right, so you wanted to know more about OpenAI's Dev Day and what stood out to us. We're diving into all the developer interviews and discussions, and there's a lot to unpack.

Yeah, it's interesting. Open AI seems to be like transitioning, moving beyond just building these impressive AI models. Yeah. One expert even called them, get this, the AWS of AI. AWS of AI. Yeah. Okay, so what does that even mean when we talk about AI? So it means instead of just offering this raw power, they're building a whole ecosystem.

The tools to fine tune those models, distillation, you know, for efficiency and a bunch of new evaluation tools. Oh, and a huge emphasis on real time capabilities. Instead of just giving us the ingredients, it's like they're providing the whole kitchen. Exactly. They're laying the groundwork for, well, they envision a future where you can build almost anything with AI. I see. And one of the tools that really caught my eye was this function calling.

They used it in that travel agent demo, remember? How does that even work? So function calling, it's like giving the AI access to external tools and information. Imagine, instead of just having all this pre-programmed knowledge, you can search the web for you, book flights, even order a pizza. So instead of a static encyclopedia, it's like giving the AI a smartphone with internet. Yeah, precisely. And this ties into their focus on real-time interaction, right?

They see a future where AI can respond instantly, just like a human would. Which would be a game changer. Right. Like imagine voice assistants that actually understand you. Or even seamless real-time translation. No more language barriers. Exactly. That's just the tip of the iceberg, though. They really believe this real-time capability, it's key to making AR truly mainstream. Okay, so OpenAI is building this AI platform, emphasizing real-time interactions.

How does this translate into like actual results? Yeah. You know, real world stuff. Well, that's where things get really interesting. Let's talk about the 01 model and how developers are using it to like really push the boundaries of what's possible. So this 01 model, everyone's talking about it. One developer even said they built an entire iPhone app just by describing it as 01. Is that just hype?

I think there's definitely some substance behind all the hype. What's so fascinating about O1, it's not just about the code it generates, it's how it seems to understand, like the logic. The logic. Yeah. Like this developer, they didn't give O1 lines of code, they described the idea of the app.

And one, it actually designed the architecture, connected everything. The developer just took that code, put it right into Xcode, and it worked. Wow. So it's not just writing code. It's understanding the intent. Yeah, exactly. And this actually challenges how we measure these models. You know, even OpenAI admitted that these benchmarks, like what was it?

Sweebench. Sweebench. Right, which looks at code accuracy. It doesn't always reflect how things work in the real world. Right, because in the real world, you don't just need code that compiles. It has to be efficient, maintainable. Exactly. It all has to work together. And OpenAI is really working on this with developers. They're finding that UI development, especially in things like React, it needs better evaluation.

It's one thing to code a button that works and another to make it actually look good, you know, and be intuitive. Right. And it seems like this need for real world context, it goes beyond just like evaluating those models. There was a developer working with this code generating AI genie, I think it was called. Genie. Yeah. And it's more focused on those specific coding tasks.

But they found that its performance really changed between different programming languages, like JavaScript versus C Sharp, for example. And that just highlights how important the data is, right? Just like us, AI needs that variety to learn. If you train it on just one type of code, it'll be great at that, but...

Anything new and. Will fall flat. Yeah. So it's about making sure these models have a broad diet of data to learn from. That way they're more adaptable and ready for whatever we throw at them. So we've got AI that can build apps, understand what we want, even write different kinds of code. It's a lot. And it feels like things are changing so fast. How can developers even keep up, let alone like build something successful with AI? Right.

That's the question, isn't it? But it's interesting, you know, both OpenAI and the developers building with these tools, they kind of agree on one thing. You got to aim for what's just out of reach. So don't wait for the tech to catch up to your...

Like wildest dreams. Focus on what's almost possible right now. Yeah. Build for where things are going, not where they are today. You wait for that perfect AI. You might miss the boat on shaping how it develops and being the first one out there doing something new. Riding the wave, not chasing after it. Exactly. But and OpenAI really emphasized this, too.

Even with all this amazing AI, you can't forget the basics of building a business. So just because it's got AI doesn't mean it's automatically going to be a success. Right. You need a good strategy, know who you're selling to, and it's got to actually solve a real problem. AI is a tool, not a magic wand. Like having the best oven in the world won't help if you don't know how to cook. Perfect analogy. And then there's this other thing OpenAI talked about that's really interesting. Balancing safety with access.

for everyone. So making sure these AI tools are used responsibly, but also making them available to everyone who could benefit. Yeah. They're really aware that focusing on safety, while important,

could limit access to some really powerful stuff. It's a tough balance. It's like that debate around, you know, life-saving medications. How do you make sure they're used correctly, but also make sure people who need them can actually get them? It's complicated, no easy answers, but it's something they're thinking hard about. Well, it's clear that all this AI stuff, especially with these new models like O1, is changing how we think about tech, how we use it. Imagine walking up to a screen and it just...

creates a personalized experience for you right there, adapts to what you need. That's the potential. Like having a personal assistant in every device. It's exciting. But we got to be thoughtful about it, build responsibly. So there you have it. OpenAI isn't just building these cool AI models. They're building a whole world around them. And it's changing everything. It's going to be a wild ride, that's for sure. And we're just at the beginning.

Building AGI in Real Time (OpenAI Dev Day 2024) 02:09:14 Share