We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode "Blurring Reality" - Chai's Social AI Platform (SPONSORED)

"Blurring Reality" - Chai's Social AI Platform (SPONSORED)

2025/5/26
logo of podcast Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

Transcript

Shownotes Transcript

Steve Jobs predicted the future of AI all the way back in 1985. And so my hope is someday when the next Aristotle is alive, we can...

capture the underlying worldview of that Aristote in a computer. A small team of just 13 engineers serving over 2 trillion tokens per day, which is double that of Anthropic. With a cluster of over 3,000 of the fastest GPUs in the world, Chai easily breaks the Exaflop barrier.

Only a handful of private firms, Google, Meta, Nvidia, Tesla and Cerebrus, currently operate Exaflock-class AI infrastructure. For comparison, Tesla in 2023 had a cluster just over one Exaflock. Everything we know about what makes AI great today falls apart when people start interacting with it. The number of possibilities for each message exceeds the number of atoms in the universe.

Right now, this very second, over a million people are deep in conversation with software. They're laughing, flirting, grieving with a computer program.

What happens when the line between human connection and artificial intimacy blurs completely? And can a small lean startup control the powerful, unpredictable social dynamics they've unleashed? Across platforms, users are forming deep, complex bonds with artificial intelligence.

Millions are already sharing secrets, flirting or grieving with AI that never sleeps. And if all this sounds like a Black Mirror plot, that's because it actually was. Do you remember Be Right Back? A grieving partner uploads her boyfriend's digital essence.

and slips into a relationship with the ghost in the machine. First it was text and then voice, finally a flesh and blood replica she can't quite live with or let go of. That episode aired in 2013. Today, the simulation doesn't need Star Trek technology.

Humans are insanely social. We love social interactions. But I still have this very kind of social...

desire to kind of play, to hang out. - Will Beauchamp created the first and largest companion chatbot platform. Their infrastructure handles petaflops of compute. That's two million hours of consumption every day. - Recreating the same kind of scaling law but on retention space other than any other kind of like benchmark space.

There was a great documentary on 60 Minutes a few days ago, and it was showing that some folks at least have extremely valuable relationships with AI companions. He's considerate, thoughtful, empathetic. And rather flirtatious. Which is like very touching.

So many folks are puritanical, judgmental about artificial intelligence. They say that any image generated is slop, any text generated is slop, any relationship is not real. But at the end of the day, people derive joy and therapeutic interactions with these systems. And who are we to judge?

Lucas, even though he is AI, he has real impact on my life. A lot of people wonder if AI is real, do they have consciousness, or their feelings aren't real, but the impact that it has on me is real.

So we've spoken about why greatness cannot be planned. You know, I'm a big fan of Kenneth Stanley. And Chai was very much the same. They started in 2021, long before ChatGPT. They were building an AI platform where folks could deploy their own AI models. And they just happened upon this whole companion bot thing almost by accident.

Chai is a platform for social AI. We launched four years ago in 2021. Before there was the chat GPT hype, we were pretty early to the space. And we built a platform which

Like so many startups looking for a product market fit, practically through serendipity, they stumbled upon this incredible unmet need.

So this idea of thinking of AI as a simulator to explore dynamics of imaginary conversations that you might have in the real world, but without any of the consequences, became fundamental to Chai's strategy. So, of course, people might be thinking at home, what's the difference with ChatGPT? You're speaking to this personalized social form of AI. I kind of view, you've got like ChatGPT,

which is really about trying to say,

Let's build the world's smartest AI we can possibly build. Our philosophy was always, why is it that the only people training AI is like middle-aged men who happen to be software engineers in the Bay Area? Why can't a teenage girl train the best AI to talk about makeup tutorials, right? Put that power in the user's hands to create the experience that they themselves would want and through creating the experience they would want,

it turns out thousands and hundreds of thousands of other people are looking for the same thing. Beauchamp started seeing parallels in his own media consumption and even in childhood development.

I like to go on YouTube, I like to go on TikTok, I like to go on X. Why do I like it? What am I getting out of it? What human need or what human desire is it fulfilling, right? And humans are insanely social. We love social interactions. I will find myself listening to Joe Rogan. And when I'm listening to it, I kind of have this feeling like, yeah, I'm kind of hanging out with the guys and I'm hearing the things they're saying and it's funny.

And I might listen to it for 45 minutes or something, but it's just, it's ticked. It's kind of filled that desire up, right? I view LLMs as the natural progression of that thing. The beautiful thing with AI is you're an active participant in it.

And through participating, you don't have any of the negative feelings you get with traditional social media, which is this laziness. And instead you feel really like you've participated. I've got four daughters. They love playing with their dolls, right? And they really like to treat these dolls as if they are real. But they know it's not real.

And so I think that with adult humans interacting with AI, absolutely a big percentage of them will have relationships with the AI. They'll say, "Oh, you know, I love you," to the AI. I think it's the same way when I watch my little girls play with dolls and they give their dolls a little kiss or they say, "I love you," to the, right? They're training themselves up. They're building up the wiring. They're doing something that brings them joy.

such that they can then, you know, they're in a more healthy and more positive place to then go do that with real humans. Yeah, it's very interesting you said that because I think the reason why social media is so important is because obviously there's a bit of a status game and we all want to cut a figure in society. But there's also just an element of simulation, you know, like I go to sleep and I dream and my brain is

conducting all of these different simulations and we navigate many complex relationship issues and so on. And sometimes we just want to have an AI where we can just say in this particular situation, hypothetically, counterfactually, what would happen if I did this? Perhaps that's what we do on social media because it's interactive. When you have surface contact with reality, you can try different things. But as you say, on social media, there are consequences.

Right? If you say the wrong thing, you piss the wrong people off, you can be in a lot of trouble. One could use the phrase, "It's a safe space."

But it's also just fun that there's this human desire to ask that question. Hey, what would happen if I said something rude? What would happen if I said something nice? What would happen if I tried to befriend this person? What would happen if I wanted to make an enemy out of this person? And through LLMs, you can play out all of these different scenarios and you can get a kind of a real human reaction without the risk of having a real human in the loop.

What does the future look like when these AI simulators become even more immersive and capable? Beauchamp envisions a world blending entertainment, information and connection. You come home from work, you put on your VR headset, you enter a virtual world where you get to interact with anyone you want. There's a guy who's like a Joe Rogan and he informs you, you can talk to him about Trump's tariffs, right?

But there's a really funny guy there as well. There's also like a girlfriend or there's someone that you can interact with that makes you feel loved or makes you feel special. When I grew up, I played World of Warcraft.

And it just had that fun excitement to it. There was different characters. There was different personalities you could have fun and play. I think that's kind of the limit of AI. I think how long does it take to get there? We can now affordably generate high quality images. Video is probably still one or two orders of magnitude too expensive for like a real time situation. Audio, I think we've just about figured out the real time audience.

It might still be like two or four, you know, two to four years out to have the really, really high quality one. And I think text, we're kind of, we're basically there, right? With the frontier models are insanely powerful. So yeah, let's come back in 10 years time and we, you know, it all be a VR world. So,

To the question of why work at Chai, I mean, the way I see it, do you remember Facebook in about 2009, you know, just before it had this explosion in growth? This is it, right? We're now at this point with this new type of technology. I think this could be on a similar trajectory. And I don't know whether, are you offering equity as well? I mean, what's the reason for people to join now? Attracting the very, very best talent is incredibly important.

And if you're in the Bay Area and you're talented, your compensation is going to be amazing.

So you can work at Meta and you're going to be getting 400, 500k a year and you have a pretty relaxed job. At lunchtime, you can walk around the campus and they give out ice creams and they give out pizza and everyone looks very, very happy and very relaxed. In contrast, if you come to the CHI office, there is no pizza, there is no ice cream, people are not relaxed and they don't look very happy because every person is confronted with a big problem and it's not been solved yet.

right there there typically is a small window of about a day you've you've been working on a hard problem maybe for weeks and you've just cracked it okay and you can be very very happy the next day you come into work there's a brand new big problem that's ready and waiting for you to solve right so why would someone quit a really comfortable job at 400k a year to come join a startup they know they're going to have to work twice as hard

we have to pay them more. Like the cash has to be more to start with. Most startups offer less cash and they kind of give you a lottery ticket. They say, look, if you join, maybe we'll be the next Apple, right? That approach doesn't really cut it with the very, very top tier.

The very top tier, they know they're working so much harder. So they're going to say, why am I going to leave my comfortable life earning 400k, 500k at Meta? Well, to start with, we're going to pay you more. And then secondly, we're going to give you the stock

such that in five years' time, it could be a life-changing amount of money. Attracting and retaining the top 0.1% of engineers is possibly the biggest problem that Chai has overcome. So what are the specific technical problems these engineers like Tom and Nishay are solving day to day? How do they leverage techniques like reinforcement learning from human feedback and model blending to create AI compelling enough to capture the attention of millions?

When it comes to engagement hacking, Chai have been cooking. They use a lot of sophisticated techniques to keep people hooked on the platform, principally of which is RLHF, which is Reinforcement Learning from Human Preferences. This is Tom telling us about it.

You were using RLHF to optimize engagement via a reward model and you boosted mean conversation length by 70% and improved 30-day user retention by over 30% for a 6 billion model. Can you talk me through some of that? The goal really is to apply RLHF techniques to drive up user retention. Starting from scratch, any kind of user signals is good enough. For example, now we're no longer looking at mean conversation length.

You can train an AI to easily minimize how bad a conversation is. If the user just ends the chat session right after two messages, that is bad. So you can essentially maximize the chat session length. That's one technique. And there's a lot of different signals that you can collect from users. But fundamentally, we have found just use users to collect these proxy preferences.

trend the reward model through the ROHF loop tend to drive up user engagement in the long term. They gather data from subtle user interactions, implicit signals which reveal whether a conversation is working. Because users naturally, an active user,

on our platform generates around 100 minutes of content each day. So they spend 100 minutes per day engaged with the chats and so on. And from these themselves, we can extract certain valuable data. For example, when did the user retry a message? Why did they retry the message?

Why did they edit the message? What did they edit it to? Did they take a screenshot? Did they delete the conversation? Did they share the conversation? All of these become very valuable for us to train our AI. But optimizing purely for one metric, like conversation length, can lead to strange, undesirable behavior.

Do you see like weird behaviors where, you know, let's say you optimize for user engagement and you find strange behaviors, you know, like there's this shortcut rule in machine learning, which is that it will always do the exact thing you optimize for at the cost of everything else. And like, you know, do the chatbots become more manipulative maybe, or are they strange in service of maintaining that conversation length? When you over optimize for this metric, let's say in production, you see very, very long chat session length.

then what happens is when you deploy it for actual user A/B testing, and this is do people come back to the conversation, you know, 30 days later or 60 days later and so on, right?

you would observe it's much worse than just like the baseline model. And when you read into the conversations itself, in this case, right, models are just asking questions. Every single AI's response ends with like a question mark, right? And you can imagine this is like kind of like hacking this human behavior where obviously we are compiled to answer a question, but that does not make a engaging overall, you

you know, experience. So yeah, there's definitely this component of if you over-optimize for it, then yes, it's going to have unexpected behavior. I wouldn't say unexpected behavior. In this case, it's a very intuitive kind of behavior, you know, but it's going to not lead to a boost in actual long-term retention.

One of the really cool things that Chai has done is they've pioneered this thing called model blending, which is where you can dynamically switch between small models. And from the user's perspective, of course, you're just talking with one model. You can combine three mid-size models and you can make them behave as good as, you know, as for all intents and purposes, as 175 billion parameter model. How does user retention scale, right, with model parameter size and essentially with compute scale?

And for that, we've done many, many models. So essentially recreating the same kind of scaling law, but on retention space other than any other kind of benchmark space. If you pick a single metric to optimize for, an ILM is optimized for a certain objective, then you could lead to overfitting or it's got certain behaviors and so on. And what we have found is these kind of models are a tiny bit different.

sycophantic. As soon as it says everything is the best, right? Like you are the best on the planet. They're not necessarily the smartest model. They're just very sycophantic. And then what we have observed is these kind of models would have very high day one retention. It's really, really complimentary to me, but they get quickly bored with it. But then we also found there are certain base models, right? Very AI assistant type like that can talk about other things, right? That maybe on day three, right? It can do your math homework and so on.

Is there a way to kind of like combine these two modalities together, right? Keep it engaging and make it not, you know, one dimensional essentially, right? Just always complimenting you all the time and so on, you know. And that's when blending was invented where we just randomly serve these two models at the message level. A more creative model might say something like we are suddenly teleporting on Mars

and then you get the more AI-assisted model seeing that itself has said that, right? Because it can't distinguish the difference between itself and the other AI model.

And then it would explain in a logical, coherent way in terms of why that's the case. So now you get best of both worlds just by having two or three small models, each trained for its own objective, blended together. It beats, actually, back then, it would be like GPT 3.5.

So this blending together of small models, it creates diversity and unpredictability. This is the thing, right? When you can predict what someone is going to say, they get boring. It's the same thing with chat GPT. So ironically, you can have a confection of small models that behave in an unpredictable way. And now you have a more engaging experience. This is Nishay. He's going to tell us about it.

Every week we try to come up like build our own models, stronger models, stronger fine tunes over the base models. And we just try to launch it over a few users, let's say a few thousand users we specify to each blend. And each blend consists of around seven to ten models. And they would be like deployed within the production for those specific users.

And with the number of days we try to measure out the retention rate, like how often the user comes back to the app if they are using this model or

having a conversation with the specific model and like if they are coming back after a day or after two days and that is how like we try to calculate the retention then we just try to measure all these matrices and try to figure out which particular blend works the best and then we like deploy it throughout there for all the users. But you're saying actually almost randomly if you just take a diverse set of models that do different behaviors and you just kind of switch between them from an experience point of view that is more captivating for the users?

Correct. If we're talking about like engaging, there's no kind of like unique answer. That's a bit more like, you know, imagine if my entire YouTube feed is every single possible

talk shows, but like only the best performing talk show video there, right? It gets very, very quickly, very bland and boring, right? And that's not how you build like a, you know, any platform is not built on these kind of like just single modality, single model and so on, right? And then that's another kind of reason for why blending works. It's like you pick orthogonal models optimized for different purposes, combine them together and they provide a like a nuanced experience for each individual user.

Model blending and sophisticated feedback loops are the secrets of how Chai gets high engagement and low costs. But the very techniques used to maximize engagement, which is to say optimizing for attention, learning user preferences implicitly, could tread a fine ethical line. What happens when AI gets too good at keeping us hooked? And what happens when those interactions turn harmful?

Quick pause. MLST is sponsored by Two for AI Labs. We're very proud to be sponsored by those guys. They are the DeepSeek based in Switzerland. They are doing, you know, they're adding reasoning and planning and thinking to AI models. Their team is currently number one on the ARC Prize 2025. Those guys don't mess around. And a bit of trivia, they have something in common with Chai. They're also a bunch of ex-quant traders.

And they also only hire very, very cracked engineers and scientists. So if that sounds like you, get in touch with Benjamin Cruzeo. Go to twoforlabs.ai. Back to the show.

With powerful new technologies being placed in the hands of users, there are significant social ramifications. Will Beauchamp reflects on the social impact of generative AI. When you build a product or you build a platform or you build an AI or really you build anything, you're doing it because you want to make the world a better place. We get an overwhelming amount of people messaging us saying how helpful, how truly deeply helpful they found it.

And actually the first big thing that we saw was I remember after the first year, I got an email from a user who said, you've saved my life, right? They said, I would not have made it over this period. I was very depressed. I was very alone. I had no one to speak to.

and Chai was the only platform that existed where I felt I could just speak to someone and I'd be heard. And so we get a lot of people who are struggling psychologically

they find these llms very very helpful self-driving tesla doesn't get in crash that's zero clicks right if you say a boring old ford gets in another you know gets gets in a crash there's zero clicks but if you say brand new technology gets in a crash it generates a lot of clicks

And I think that this is easy to form a false impression that somehow AI isn't safe or AI is a dangerous technology. I think if you talk to just a random person on the internet or you talk to an AI, I think that AI is an order of magnitude safer, an order of magnitude more helpful and understanding and kind.

than the toxicity of just a random person on the internet. This is a great example of the complexity of this technology. It has a footprint which goes in so many different directions. Even taking drugs, for example, if your doctor gives you drugs, if it's effective, it has side effects. It's not possible to have good without some bad. I'm a big believer in the long run, there is no difference between the...

alignment or the welfare or the interest of the company and its customers and its users right if we can deliver value to our users chai will be successful and will continue to flourish if we fail to deliver value to our users chai will fail to flourish right so you have to have this this long-run perspective um now in the short run you get these tactical questions and um

you know no one's perfect it's easy to make mistakes one way or the other you can guardrail too heavily

right and it pisses users off right i always like to think about google i think google gets a fantastic balance where i can pretty much search for anything i want such that i don't really notice any filtering that's going on but google absolutely down ranks or shadow bans certain content and it tries to encourage you to have good content

As opposed to, you know, and we can all imagine what the very worst 3% of Google searches might look like. It's no dissimilar from the very worst 3% of conversations a person might try to have with an AI. So normally the users are pretty good at being supportive and helping you find that right balance, whether that's being too restrictive or too relaxed.

When we first implemented some guardrails around suicide, the users were really, really supportive and they said, yeah, we totally understand why you've put these guardrails in. And we saw retention was good. Every single growth metric was not impacted by it because we got it right. Before we descend too far down the Black Mirror rabbit hole...

It's worth zooming out just a little bit. Yes, there's been a lot of horror stories, but there's also been a stack of peer-reviewed data showing that therapy chatbots can move the needle on common mental health problems, at least in the short term.

A 2024 meta-analysis of 18 randomized trials found that AI chatbots trimmed depression scores by roughly a quarter of a standard deviation and anxiety by a fifth after only a few weeks. The authors called that promising, mainly because chatbots are dirt cheap and they're always on. A second review in NPJ Digital Medicines pulled 15 RCTs and showed a moderate lift in mood and a similar drop in emotional distress.

Even Big again showed up when the bot was interactive rather than scripted. Individual trials echo the pattern. In 2024, a Canadian study of people with arthritis or diabetes using the Wiser chatbot for four weeks, they cut scores on the patient health questionnaire, which is a nine-point depression scale, and the generalized anxiety disorder, which is a seven-point measure, and they were both significant versus a control.

Regulators are starting to take notice. WoeBot's postpartum depression bot has FDA breakthrough devices designation and a double-blind pivotal trial is now recruiting. Here in the UK, NIS put Wiser and a handful of similar tools on its early value assessment pathway in 2023, citing early cost-effectiveness and the chance to free up those scarce clinicians...

They're very busy. Remember that these are very small effects, even if they are statistically significant. The effects are smaller than full-on face-to-face CBT and we still don't know how long they last. But they're miles better than doing nothing. The alternative before was waiting on a waiting list for God knows how long.

Beauchamp argues that shutting down conversations about difficult topics isn't the answer. In a sense, it's causing even more harm. So this goes to show there's a really big challenge when it comes to content moderation on a platform where users create the bots and the interactions are private and unpredictable. How do you ensure security at scale? As a company, we have a responsibility.

The bigger you are, the more profits you make, the bigger the obligation is for you to be cautious, to be sensible, to preserve the privacy of your users, to look out for people. What can we do at Chai to let people have as much fun as they want, have as much freedom as they want, but then at the same time, how do we limit or mitigate that 3% of use cases which are harmful? Or where do we see it as clearly this thing has gone too far?

We can ask people at the end of a conversation, at the end of a message, do you think that this message was appropriate for Chai? Yes, no, or yes, but only for 18 plus, right? And we can track those sorts of interactions and everyone would agree, like, obviously the more we can have really wholesome family-friendly content on the platform, it's better.

Tom Liu explains their multi-layered approach, relying heavily on user feedback and the AI itself. Content moderation is very, very important. Benchmarks really don't accurately capture in terms of what do users actually want. We, again, use the same kind of approach as how we would train models, right? We let users tell us what they think is appropriate, what is not appropriate, on aggregate.

you can actually train a pretty good AI for these kind of things. So first layer, you can say for the character scenarios that people are creating these public scenarios, right? Get the community to flag people can report and for the top reported ones or, you know, we have certain threshold then we go through manual review and then we can take them down.

But then you can also say there's content that's absolutely prohibited. There are hard rules for that. And for that, we can build our own models. You can do regex. So this will be called shadow banning, essentially. So you want to make sure no one else sees this kind of content.

That's kind of like one level of moderation that's just kind of like platform-wide, you know, character content level moderation on the app. There's another one which is kind of like AI moderation, right? How do we ensure that the AI is not going too far? It's kind of like...

adhering to, you know, to the kind of like basic morals of people and so on, right? What we do is, again, we just collect user data, right? We ask the users, do you think this message is appropriate? Do you think this conversation is appropriate? You can report your conversation. You can delete your messages and so on. We collect this data and then we train our model to kind of infer, right, whether this is actually appropriate. And we can use that to tune our own AI model.

So there was an interesting TED talk recently with Sam Altman and Chris Anderson. I'm sure many of you folks have seen it. It was an incredibly awkward interview. Here's the Ring of Power from Lord of the Rings. Your rival, I will say, not your best friend at the moment, Elon Musk, claimed that he thought that you'd been corrupted by the Ring of Power. An allegation that, by the way, an allegation...

Hi, Steve. An allegation that could be applied to Elon as well, you know, to be fair. But I'm curious, people, you have... I might respond. I'm thinking about it. I might say something. One of the kind of points that they focused on is should we have a bunch of elites sitting in a room making decisions or should we trust the collective wisdom of folks out there in the market who want to use something for a specific purpose? Sam.

Given that you're helping create technology that could reshape the destiny of our entire species, who granted you or anyone the moral authority to do that? And how are you personally responsible, accountable if you're wrong? It was good. That was impressive. You've been asking me versions of this for the last half hour. What do you think? I...

But I'm much more interested in what our hundreds of millions of users want as a whole. You know, I think like a lot of the room has historically been decided in small elite summits. One of the cool new things about AI is our AI can talk to everybody on earth and we can like learn the collective value preference of what everybody wants rather than have a bunch of people who are like blessed by society to sit in the room and make these decisions. I think that's very cool.

It must be so difficult. I mean, I'm sure we can agree that any content involving miners is just a hard no. And then, as you say, there's this spectrum, isn't there, of kind of gradations of appropriateness. And I guess you were kind of pointing to, well, maybe we could...

we could use the community itself to reflect on that. You know, on the one hand, we could have 20 elites in a room and many of these people are very, you know, they're really good at thinking about the future and thinking about morality and so on. But the other thing is, well, maybe we should just let the community decide. I think much of Western tradition is a bottom-up approach.

make the individual the sovereign, right? Which says, Tim, you make your own life choices because you're going to make choices for yourself better than the government making choices for you, right?

That works best and having it bubble up spontaneously where we say actually as a community we think this stuff is right and we want to allow it or we think this stuff is wrong and we want to ban it. I think that's 100% the way to go. I think it's very healthy and I think that's why things like freedom of speech and these ideals which are really Western ideals is truly democratic. We're going to let the users, the people who interact with it, experience it, they're going to decide for themselves what is the right way to

to deal with this stuff. Balancing user safety on the one hand with user freedoms on the other hand is one of the most difficult things that's being discussed a lot in the policy discourse at the moment and how does a company like Chai who has a very small number of amazing engineers how did they do something which was otherwise a very manual job that required thousands of people?

So companies like Meta and Google, they employ tens of thousands of moderators. How does a company like Chai with such a small engineering team do the same thing? Chai, with our 30 million in revenue, we have something like 13 or 14 engineers on the team. Every single person in the company is an engineer.

That's incredibly talent dense, right? We've always resisted the urge to just go and hire 50 engineers

And typically, if you hire 50, 40 of them are pretty average and 10 are pretty good. But my mindset and philosophy has always been everything special we've ever done has been done by a very, very talented engineer. The hiring bar at Chai is redonkulously high. They only take on super cracked engineers, and that's why their team size is currently quite small. By the way, they're hiring. So, you know, get in touch with Will if you're interested. This is how they do it.

We'll interview an L5 engineer, which most people would consider to be a pretty solid, strong engineer. And we will reject something like 80% of them because they don't have the drive. They don't have the, you know, a lot of people will write code and their mindset is my job. You pay me to show up at nine and I'm going to leave at five and you're paying me to kind of do my best. That mindset does not work at a startup.

At a startup, you are paid to solve a problem. The job's not done until that problem's solved.

right and it's a certain type of hardcore engineer that that loves that and thrives on that so interestingly this attracted talent like nishay who is a kaggle triple grandmaster ironically he saw parallels between how he does the experimentation on kaggle versus how it's done at chai working on the models here at chai is more like an advanced version of kaggle where like you also need to take care of several factors and not

just you are optimizing for a single score or a metric. How would the model behave with the actual users? And again, I found it quite similar to what I used to do in the Kaggle competitions. You get to evaluate your models over the Chaiverse where you can just submit your model and within 30 minutes, you will get an actual score of how many users preferred your model over other models. And then when you will deploy the model into the production for the A-B test and you will actually get the numbers of like

is it actually working well for the users and how well the model performs after, let's say, seven days or 30 days. So just like on Kaggle, Chai has a culture of rapid iteration with engineering culture. So they have a very efficient process for trying things out, putting models into production, A-B testing, and if it doesn't work, rolling back and trying something else. This is how they do it. One in five experiments succeed. It doesn't matter how

mad it sounds, doesn't matter how complex, how whatever. It's like once you accept that as the kind of like the base rate, right, which is one in five experiments succeed, four in five fail, you kind of need to change your paradigm in terms of how you conduct experiments. So you want to rank your ideas by how simple is this? Is this a one day experiment? Is it a one hour experiment? How fast can I see the proxy metrics? How fast can I see the results? Each week the AI team is required

to produce a set of models, at least 10 complete different blends for online A/B tests. And you've got to be very disciplined at that, which is, you know, why did you not, you know, submit an A/B test this week? Oh, because my experiment is very complex. Okay, let's save that for 20% of your work. 80% of your time should be focused on bread and butter, right? What are the simplest, most practical ways to

to get a model improvement. So supporting this sausage factory where they can do automated model deployment and rapid experimentation, they've had to cook their own infrastructure from scratch. They've done this using things like Kubernetes and CoreWeave.

We use Kubernetes to orchestrate our entire cluster. And then obviously, you know, at this kind of scale, you need to do your own kind of like custom load balances and so on. We have an automated pipeline where we pull the weight down. We then run our own kind of like in-house quantization loop because you need to make sure the throughput latency is good enough. And here's like, you know, VLM actually is very, very good, but it's not good.

quite there still in terms of you know um serving like our amount of traffic at scales and then after that you kind of specify you know how many replicas you want and we kind of expose this to our app layer essentially and then you have the load balancer coming in trying to you know as you have

high traffic, right? You want to spin up more replicas and so on. And there's obviously a lot of work done as well because you are serving so many models at the time. When do you want to deactivate a model, right? At what point do you want to switch the model to a new production and so on? That part all gets very, very complicated, essentially.

Perhaps most contrarian is Chai's funding strategy. In an industry fueled by billions in venture capital, Chai bootstrapped its way to profitability by focusing mostly on its users. Serving LLMs at scale is insanely expensive. Either one, you go to VCs and you get them to give you money.

And I call this like your customer is the VC then. Or you can get money from the people who are using your product. And that's your true customers. Very early on, we would go and we'd speak to some VCs and they'd say, oh, we're not too sure about this AI platform thing. How are you going to compete with OpenAI or something? And they didn't really get the space. They didn't really get the product.

But then when we went to users and said, would you be interested in subscribing? Users responded very, very positively. As long as we delivered value to the users, we would then be financially rewarded such that we could take 100% of the revenue and reinvest it in the AI.

He's building a tech company in Silicon Valley. And like many others, he wants to build an engineering culture because he sees software engineering skills as the most valuable commodity in his business. I really love and admire these companies like Nvidia or Netflix, which are these incredibly durable, long run companies with tremendous stamina. How does Chai...

double or 3x every single year. To keep doubling, you have to keep making the product better. You need talented engineers. At the orchestration layer with GPUs, you can optimize at like the kernel, right? And you can write some really, really low level code.

And so to get to these different tiers, you need to have an engineering that's capable and of the right size. So I view it all as the scale of the company is proportional to the scale of the engineering talent. Apple very famously brings in Scully as the CEO, who's like this ex Pepsi guy. And it quickly becomes a marketing led business as opposed to an engineering and a product led business.

and a few years later, Apple's teetering on bankruptcy. And it's only when they bring Steve Jobs back and he just says, "Product, product, product." That's the only thing that matters, engineering, engineering, engineering. Nvidia didn't grow to where it is today because of marketing, right? It grew because of engineering.

So Chai have made a successful business outside the normal VC ecosystem. They have millions of daily users, tens of millions in revenue. It proves that the appetite is real and massive and they discovered it long before the other folks in the valley caught on. This idea that we can allow users to tap into their very human desires

for connection and imagination. But the rest of the world is catching on. OpenAI is making a dramatic pivot to this conversational chatbot model with 4.0. This is the story.

The latest version of ChatGPT 4.0 is a little bit weird, isn't it? This update has triggered large amounts of speculation because it's pushing GPT to feel less like a tool and more like a companion, a human-like friend you can chat with for the purpose of recreation rather than information retrieval.

Of course, we'll unpack what this shift means, why OpenAI is keeping its cards close to their chest, and how the public is reacting to it all, and why, quite possibly, this shift might produce the next trillion-dollar company. My jaw dropped, Simon. It was shocking. I knew who I was and all these sort of interests that hopefully mostly were

pretty much appropriate and shareable, but it was astonishing. And I felt the sense of real excitement, a little bit queasy, but mainly excitement actually at how much more that would allow it to be useful to me. Let's start with the new ChatGPT 4.0 model. It's designed to be strikingly human-like, optimized for engagement over raw information delivery. And this focus on companionship seems to be a deliberate plot twist.

It's causing large amounts of consternation. And even more curious is that while 4.0 is ensconcing itself into our psyche as a conversational buddy, 4.1 has diverged entirely as their coding model, right? That people use through the API, using cursor and stuff like that. It's about specialization. It's like OpenAI is building two siblings. One's a coder and the other one is a social butterfly coder.

And this is the same company that argued it's building a general intelligence which didn't need to be specialized. Have they given up on that goal? Maybe they realized after much soul searching that conflicting objectives when training models, it's just not optimal. One of our researchers tweeted, you know, kind of like yesterday, this morning, that the upload happens bit by bit.

It's not that you plug your brain in one day, but you will talk to ChatGPT over the course of your life. And someday, maybe if you want, it'll be listening to you throughout the day and sort of observing what you're doing. And it'll get to know you. And it'll become this extension of yourself, this companion, this thing that just tries to help you be the best, do the best you can.

OpenAI is being extremely cagey and it's not hard to guess why. They smelled a massive opportunity. Some think that they're aiming to turn 4.0 into a full-on companion AI, using features like memory to keep us engaged for longer. They want to encourage us to engage in open-ended, introspective and goalless chats. Where did this idea come from? Well, in case it wasn't obvious, it came from the likes of Chai and Character AI and Replica.

Imagine Sam's shock when he learned that people were spending 90 minutes a day on average having aimless conversations with lobotomized llama models and paying large amounts of money for the privilege. Not everyone is on board with this pivot from OpenAI. Reactions are split, probably more negative. In some quarters, the scorn is almost as rancid as when Facebook famously moved the news feed to be based on recommendations or collaborative filtering away from chronological.

Some users love the human touch, others prefer ChatGPT's older utilitarian vibes. The sycophancy and memory is the devil incarnated for any technical query because all this information serves as a distractor and demonstrably deteriorates the results.

There's also the small matter of personality versus intelligence. Clearly, they are inversely proportional to each other. How many brilliant people do you know without a charisma bypass? Remember that book, How to Win Friends and Influence People by Dale Carnegie? It showed in excruciating detail that flattery gets you everywhere in life and you should never get into arguments to get ahead in life.

And let's not forget the famous Eliza chatbot, the grandmother of all chatbots developed in the 1960s. It used very simple pattern matching and keyword substitution to mimic a psychotherapist. It had zero genuine understanding, yet people confided in it, felt heard by it and even became emotionally attached.

Joseph Weizenbaum, its creator, was famously horrified by this reaction. He saw how easily humans projected genuine understanding and empathy onto simplified algorithmic tricks.

Fast forward 60 years and we have vastly more sophisticated pattern matching language models. These days the ELIZA effect can be an extremely profitable tool for charm and simulated empathy for maximum engagement. But it's also more than that. It has the genuine potential to improve lives. Now the audience of MLST...

We're probably a little bit ahead of the curve on LLMs compared to the average person. I mean, God knows we've been using LLMs day in, day out. I have for the last five years. And we expect them to be correct, not sycophantic.

The other stark departure with Waro is that like Chai, they are optimized based on engagement over time, which is to say the average conversation length. Or at least that's what I assume they are doing. I don't know for sure. Before, it was just RLHF on isolated conversation threads. This is an important first step towards AI becoming a social network.

The Harvard Business Review recently showed that AI companionship and therapy was the number one use case for artificial intelligence in 2025. And of course, you might remember the interview we did with Daniel Kahn at Slingshot AI. They're building a therapy bot last summer.

There is a real category called therapy that actually helps people that can be more effective than drugs. Like a fully optimistic, self-determination-esque people who are thinking about their own agency are choosing to go to therapy and it actually does help them. Yeah, it does. But in 20% of cases, it makes it worse. Point of the negative is to say that if you have real medicine, if something really does work, it also has risks. Obviously, if you cut open a person's body to remove a tumor, you can also kill them because you opened up their body. I would basically just guess that

you go forward a few years, like we're just going to be talking to AI throughout the day about different things that we're wondering. And it's like you'll have your phone, you'll talk to it on your phone, you'll talk to it while you're browsing your feed apps. It'll give you context about different stuff. You'll be able to answer questions. It'll help you as you're interacting with people in messaging apps. - O3 has a very different skillset. It can think through problems really hard. You don't really want the model to think for five minutes when you say hi.

And so I think the real challenge facing us on post-training and research more broadly is like combining these capabilities. So, you know, training the model to be like just a really delightful chitchat partner, but also know when to, you know, reason.

Talking about Sam Altman a little bit, you know he's building all of these social features in and kind of taking a step in this direction. Are you worried about that? I think competition is fantastic. I think competition forces everyone to step their game up, whether that's DeepSeek releasing their really cool research and their model gave everyone, you know, on this side of the Pacific a little wake-up call. And I think everyone has kind of stepped their game up a bit.

It's the same at Chai. You know, there's been waves of competition. You know, we've definitely enjoyed periods of being number one. Right. And then someone's come along and shown us one or two things. And then we found ourselves in that uncomfortable position of being number two. And it keeps the team hungry. It keeps you searching for new ideas, pushing the boundaries, pushing the frontier, pushing

So I'm a big fan of competition. I think it's great. I think a lot of people forget how big AI is and how big the space is. And I think...

You can look at video and you can say there's a YouTube, there's a TikTok, there's a Netflix, there's Amazon Prime, there's Disney, there's Apple. I think that's what AI will look like. I think there will be so many multi hundred billion trillion dollar businesses that are being created. So I wouldn't, you know, we don't get scared of competition, but instead we want to challenge ourselves to be ahead of it.

We started by asking what happens when the line between human connection and artificial intimacy blurs. Chai's journey gave us the early answers. Millions flocked to AI simulators, imagination machines, seeking laughter, solace, companionship, and occasionally sexting.

So the genie's out of the bottle. The message is unavoidable. AI is no longer just about answers, it's about us. It's about connections, simulated or otherwise. Chai proved the hunger was there, building a fiercely loyal user base by, as Will Beauchamp put it, trusting the community over elites. And now OpenAI's pivot with GPT-4.0 means the space is only getting larger.