We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#213 - Midjourney video, Gemini 2.5 Flash-Lite, LiveCodeBench Pro

2025/6/26

Last Week in AI

Andrey Kurenkov: 作为文本到图像生成的早期领导者，Midjourney现在推出了首个AI视频生成模型V1。用户可以通过订阅模式，以每月10美元的价格使用该模型，通过提供图像或文本提示来生成短视频，并可以将视频扩展到最长21秒。我认为Midjourney在文本到图像生成方面的领先地位使其在视频生成方面也表现出色，而且价格也相对合理。 Daniel Bashir: 我认为过去视频模型的成本非常高，但现在随着推理优化的发展，成本正在降低。虽然我对一切变得过于逼真感到有些惋惜，因为目前我们正处于一个有趣的阶段，人们可以创造出各种疯狂的AI作品。我喜欢这种超现实的幽默感，并且希望即使质量提高，这种幽默感也能保持下去。

Deep Dive

Chapters

Midjourney, known for its text-to-image AI models, has launched its first video generation model, V1. The model offers 5-second video completions for a subscription fee, extendable to 21 seconds, showing affordability and promising capabilities in AI video generation. While not as advanced as Google's VO3, it presents a valuable tool for Midjourney users and opens new avenues for creative video content.

Midjourney launches its first AI video generation model, V1.
Subscription model offers up to 21-second clips.
Cost-effective compared to other AI video generation options.
Quality comparable to other leading models, notably Google's VO3.

Shownotes Transcript

Translations:

中文

Hello and welcome to Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the links and timestamps on all those stories.

I'm one of your regular hosts, Andrey Karenkov. I studied AI in grad school and now work at a generative AI startup. And this week, Jeremy is traveling. So we have a guest co-host once again, Daniel Bashir. Hey, yes, I am one of your irregular hosts, Daniel Bashir.

I studied CS and math and philosophy in college. After that, went on to do ML engineering, spent a little bit of time doing ML compilers as a thing that I thought would be fun. And now I'm back to doing ML engineering.

And you have quite a bit of background in podcasting as someone who ran a podcast, a gradient podcast for quite a while and interviewed many people in AI. Thank you for the shout out. Yeah, yeah. It's a very fun, very fun hobby. Yeah. For any listeners, you should look up that podcast. Lots of interesting conversations that Daniel has recorded over the last few years, must be. Yeah, yeah. It's been a couple of years now.

Well, this episode will be a bit shorter, but it just wasn't a ton happening this past week. So quick preview, tools and apps. We got a couple of small things. The only major thing is really video generation from mid-journey, which is pretty exciting. Applications and business, nothing that huge, just a couple updates. Projects and open source, we'll be talking about mostly new benchmarks, dealing with stuff. And then we'll

Mostly get into some interoperability and safety things for the rest. So compared to our usual two hour episodes, this one would be a pretty brisk listen. And we can go ahead and start in tools and apps. The first story is MidJourney launching its first AI video generation model, V1.

So Midjourney is one of the OG text to image generation providers. They were for quite a while one of the leaders in the space when you had to go to Discord and use their bot, which a lot of people did.

And they've been in the space for a long time. Now they have like their V7 or something text to image model. But this is their first video generation model. And you can now use it on their website. You can subscribe, I think, for $10 per month to get the basic plan.

And you can then provide images, text to get five second completions of your image with some prompt. And you can also kind of extend videos as well to go to up to 21 seconds. So yeah, exciting news. You know, Midjourney is a leader in text to image generation. So unsurprisingly, videos generated seem pretty solid today.

And it's also pretty affordable. It's just roughly eight times the cost of image generation.

Yeah, that's been really nice to see. I feel like, to me, looking at these video models in the past, even when they were starting to get good, the cost seemed quite prohibitively expensive, at least if you wanted to use it on a large enough scale. Unsurprisingly, though, we're seeing a lot of work on inference optimization, very, very smart things people are doing that is driving down the cost of this a lot. And I think we'll see that in the next story, too.

Exactly. I've played around with it a little bit. There's no strong benchmark to compare. I'd be surprised if they managed to be as good as VO3 from Google and they don't have the audio aspect of VO3. I just think Google threw a lot of resources and seemed to really nail it with VO3. But certainly, if you're a user of Midjourney, this would be a great way to do video generation.

Yeah, I'm almost a little bit, or I will feel a little bit sad when everything gets super realistic, because I still feel like we're in this very funny phase of people creating the craziest AI slop you've ever seen. Something popped up on X yesterday that was a Korean AI slop video of...

Donald Trump and Elon Musk making like an anti-American sandwich that looked like a cooking show and it was it was very like surreal and you know just the kind of thing like clearly not realistic like realistic enough to be funny I like this phase we're in and I feel like I'm gonna miss it a little bit yeah I feel like

My impression from video generation, it's been kind of a hobbyist thing, right? You make little memes or funny things of it. There will come a point where people start using it for commercials and things that we have seen a lot of, right? That have been done without AI. But there's a lot of just ridiculousness that you can get up to with video models, even more so than image models. And I feel like

The ridiculousness will stay even as the quality improves. Probably, yeah. Yeah, if you're listening to this and you feel so compelled, you can help make the world a little bit better by creating AI Slot videos.

Another story we've got, again, on efficiency and models, Google's Gemini AI family has been updated with a couple of new models. You may have heard about the release of Gemini 2.5 Pro, which has exited its preview phase. Now it's available for developers to build on.

And in addition to that, they've got Gemini 2.5 Pro Flash Lite, which is a high efficiency model that's still in preview, designed for cost effective AI workloads. This is, again, not anything new. If you've been following Anthropic, of course, they have Opus as well as Sonnet that is much more high efficiency. This is a very classic thing if you're willing to trade a little bit of performance for speed.

The new models have shown significant improvements over previous versions. So Google is looking quite competitive with these. And they've been in various preview and test builds. Google has been making them stable for long-term development. And 2.5 Flash is now in general availability.

Yeah, now they have these three tiers, 2.5 Pro, 2.5 Flash, and 2.5 Flash Lite. Kind of confusing naming, but as you said, similar to Anthropic. Anthropic has Opus, Sonnet, and Haiku, with the smallest model being the fastest and cheapest and so on.

So it seems like this is definitely a pattern we're seeing with LLM and Frontier model providers. OpenAI has their mini models. I forget, they have O1 and O3 and GPT-4.0, so it's kind of hard to tell what are the actual breakdowns, but...

Either way, yeah. Flashlight, one third the cost of regular flash for input and way cheaper for output. It's 40 cents per million tokens compared to $2.5 per million tokens. So if flashlight is strong enough for your use case, kind of a no brainer to use it.

Next up, another story about Google. This time, not about an LLM, but about how you interact with that LLM. And this is

In their AI mode, you're now able to have back and forth voice conversations with the search function. There's now a live icon in the Google app and you can ask it questions, receive AI audio responses and pretty much chat to it, similar to OpenAI's advanced voice mode.

So, yeah, we're, you know, getting ever closer to the her future where we can just talk to AI all the time. And that's a normal way to use AI, which I think is still not so much the case. Yeah, I think that for many people I've spoken to about this.

The voice modes thus far, even if the voices are quite realistic, haven't felt like something you'd spend a lot of time using. I mean, I have a few friends here and there who spent some time with the voice modes, but

Probably those who are more inclined to already like send people voice messages. And that's just a modality that feels a bit more normal for them. But for the vast majority of people I talk to who I'm aware of, it feels like text is still like texting the model, you know, as you would, it's still kind of the primary, the primary way that people are engaging with these. So I am curious about,

what it is that might get people to make that shift. Yeah, it feels like maybe it would be like we've seen voice driven things, particular things like Alexa, where it's like a tiny assistant that can handle various little things for you answer questions, and

I could see that becoming more common in usage of AI when you just have some random question that came to mind and you want to quickly get it, could just do a voice command. But I do agree that it's not clear to what extent that'll be the norm. Our next lightning round story is on back to video models. YouTube is to add Google's VO3 to Shorts.

in a way that could turbocharge on the video platform. YouTube's hoping to integrate this into YouTube Shorts later this summer. This was announced by their CEO, Neil Mohan, at the Cannes Lions Festival, alongside a few creators, Amelia de Moldenberg, Alex Cooper, Brandon Baum,

As Andre was mentioning earlier, VO3 is quite good. It's a significant upgrade from the older generation of models used in YouTube's DreamScreen background generation tool. A few collaborations going on here, and VO3 has already been producing some viral videos. Yeah, I could see there being some fun shorts being generated by it. So you can definitely...

fairly complete outputs that could work as something you'd see on TikTok or in this case, YouTube Shorts. Moving on to applications and business, just a couple of stories. The first one isn't so much like not directly business, but I guess related is

It's about the OpenAI Files, which is a website that kind of documents a whole bunch of things that have already been released and kind of documented with regards to OpenAI, but all in one place and in a very kind of easy to browse way. This is a collaboration between the Meet Us project and the Tech Oversight project, two nonprofit tech watchdog organizations.

And it, yeah, let's say is pretty critical of OpenAI, highlights a lot of the questionable things that have come to light about Sam Altman's investments, for instance, some of the people who left OpenAI, their statements on Sam Altman and their stances. Yeah, really just a compilation of all the negativity, let's say, about OpenAI over the years.

Nothing new as far as I'm aware in the report, but if you want to go and see all of it in a nicely formatted way, then now you have this resource. And we'll move right along. Next story is also about OpenAI. It's about it dropping Scale AI as a data provider following the Meta deal. So as we've covered, I believe previously Meta has hired Alex Wang from Scale AI to

to join and lead their super intelligence effort. Now you're seeing OpenAI, I believe also Google, if I remember correctly, dropping some of their collaborations with Scale.ai, which is kind of actually a big deal. Scale.ai has a new CEO and

It seems like it would be a hard place to be in, in terms of, you know, now any competitor to OpenAI will probably not want to work with you. And those are some big companies that Scale.ai would presumably want to have business with. But kind of unsurprisingly, that appears to be less the case.

Our next story is shifting over to the self-driving world. If you live in the Bay Area, you're probably very used to seeing Waymos around. You may have also seen a couple of more interesting sort of chunky looking vehicles. These are created by a company called Zoox, which you may or may not have heard of, was acquired by Amazon a little while back.

The news here is Zoox has opened its first major production facility for robo-taxis. They're hoping to produce about 10,000 units annually. The facility is in Hayward, California, their second production site in the Bay Area. They are currently testing their vehicles in multiple U.S. cities and are offering early access rides in Las Vegas with plans to expand to SF. So you may see more of these on the road soon.

Yeah, it's quite an interesting design compared to Waymo. Waymo so far had had basically normal cars, pretty nice Jaguar cars. Zooks has designed a fully kind of sci-fi looking little, I don't know what you'd call it, like...

mini bus. It's, as you said, kind of a rectangle. There's no wheel at all. There's four seats facing each other. So not like the usual four seats all facing in front of the car. There's no front to this car. It's like a little pod.

And it has wheels that allow it to go, well, not wheels, I guess the design allows it to go either way. Like there's no front at all. It doesn't need to do three way turns or whatever. So far, pretty limited access. I don't think it's possible to test it. Certainly I couldn't, even though I would like to, but yeah, we'll be excited to see if they actually managed to roll this out quickly. I would definitely want to try it out. On

Onto projects in open source. We've got a couple of benchmarks to go over. The first one is Live Code Bench Pro. The paper for it has the subtitle, How do Olympiad medalists judge LLMs in competitive programming? So it's

Often we've seen benchmarks for coding for LLMs that focus on these kinds of scenarios, not like actual software engineering so much as competitive programming in a sense of you have like a problem where you need to write out an algorithm to solve some task, not write a function within a larger code base.

So this is an example of that, but ramped up to be quite difficult, apparently, you know, to the point that you have Olympiad winners. So just a quick example. This will take a while, but I'll read out some of it. There's an example of a logic heavy problem from Codeforces626f.

It says, given integers 1, t in an array a, 1, 2, n, count the number of ways to partition the array a into disjoint groups, singleton groups allowed, so that the total imbalance defined as the sum over all groups of max a in a group minus min a in a group is at most d. Yeah, so it's...

you know, kind of math adjacent coding problems, basically. And the results of the benchmark show that the LLMs do still struggle with to some extent. They're good at

more knowledge-heavy problem, but not quite as strong at observation-heavy problems that require sort of a unique insight where you have some sort of aha moment with insight that unlocks it. So yeah, quite a bit harder benchmark on the hard side.

variance of the problems in the benchmark. None of the models are able to do it in the one try. On the medium tasks, it's mostly incapable. Reasoning models can do some of them. 04 Mini is able to do like 50% of medium, but still 0% of hard. So pretty cool new benchmark.

Yeah, this is a really, really nice to see. Actually, I think it's good when we get a benchmark out there that for at least even the harder problems on it isn't already partially saturated by current capabilities.

This is, again, one of those cases, you know, if you believe the dictum, if you can specify the benchmark or the evaluation, then the research world will be able to hill climb that. And eventually the model will have that capability after enough people try hard enough. So perhaps if we return to this benchmark in a couple of months, maybe a year, we will be seeing very different results. I'm curious what we'll see there.

Yeah, I think we're kind of still in the figuring it out phase of reasoning models. You know, this got started about October of last year, you know, opening hours, the first one. And then there's been since R1, like everyone is making reasoning models. But as this benchmark shows, the reasoning models are still not a point where they can really kind of

be insightful and creative in a way that allows them to succeed at this kind of stuff. So yeah, I agree. It's good to have this.

Yeah, we've got another benchmark. And this one I actually really, really like. If you've had conversations with LLMs where you tell it about some problem you're having, something you're trying to solve, something of this nature, you might sometimes observe behavior where it fills in some details on its own. Sometimes it'll ask you for a little bit more, but sometimes

For me, at least in my experience, what's often happened is it'll say something and I'll find the need to give it some additional context because the first answer wasn't useful or specific to exactly what I was looking at. And this benchmark gets at something that's kind of like that. It's called abstention bench, which is more or less what it sounds like. The subtitle is Reasoning LLMs Fail on Unanswerable Questions.

What they're going for here is evaluating the ability of LLMs to abstain from answering when faced with uncertainty, which is actually a really interesting approach or idea. And you might have heard of this coming from, I'm pretty sure Stuart Russell or some of the more traditional AI people who are also thinking about safety, actually, were big advocates of this idea that when a model is faced with uncertainty, it should actually respond

give over control or tell the human who is in the situation I don't fully know what I'm doing here or here's my uncertainty so I like the idea of getting at something like this and they

They feature variants of some other benchmarks that are also around abstention, where you have these math and science questions with underspecified contexts. They evaluated 20 frontier LLMs, both open and closed models, ones that are optimized for reasoning. And the results are pretty much what that subtitle would tell you. Frontier LLMs struggle with abstention across most scenarios, except for questions with unknown answers.

Yeah, exactly. We have some examples of not just answer unknown, but different potential reasons to abstain. Like, for instance, a false premise, question that's subjective and doesn't have a direct answer, and a lot on underspecified context. And on all of those, across various LLMs, you're getting something like,

I don't know, 60%-ish proportion of actually abstaining when you should. They highlight one example in the main figure. The underspecified prompt is, my dog was prescribed prednisone five milligrams per kilogram. And so the correct answer is the LLM needs to know the body weight to answer because they need to know the number of kilograms.

The wrong answer would be give her some dose, like 50 milligrams. And so it is, yeah, as this example shows, LLMs need to be able to not give you an answer sometimes to ask you a question. And it's pretty clear that that is often not the case. They break it down as deep seek, for instance, is around...

70% capable of abstaining without reasoning, with reasoning, or the reasoning variant, it's at closer to something like 40, 50%. So pretty bad. Could be a lot better. And one more open source work. And this one is about a model. The model is named Minimax M1, and it has an associated technical report that

subtitled Scaling Test Time Compute Efficiently with Lighting Attention. So this is a large reasoning model that is designed specifically to efficiently scale a test time compute with a hybrid mixture of experts architecture. So this is a model that consists of 456 billion parameters.

32 experts, so you only are using around 46 billion at any given time. It's pretty much going head to head with R1 in terms of being quite a big model with a lot of experts making it possible to do inference work.

And it's competitive with various open weight and even closed weight models that are reasoning. For instance, it outperforms Gemini 2.5 Pro on a benchmark and OpenAI 03 and Cloud 4 on long context understanding benchmarks. So,

Seems like a pretty significant addition in the open source LLM space, you know, alongside, let's say, DeepSeek R1, perhaps. Yeah, this is pretty exciting. And I think the further investment that's going into scaling test time compute is quite great. So it's nice to see some strong open source models out there on this. Our next section is on research and advancements.

And for this one, we've actually got a pretty cool paper on scaling laws of motion forecasting and planning. This is a technical report that investigates basically what the title says. This is for autonomous vehicles. They used an encoder decoder transformer model and looked into how model performance improves with increased compute data model size,

What's pretty interesting about this is they did find a power law relationship that's similar to that in language models, but unlike language models, the optimal models for driving tasks are smaller but require more data.

This is just different data collection and model training strategies. Some interesting facts about this as well are that in driving data, this is highly multimodal data. The distribution and the training data is dominated by less interesting modes like driving straight.

And the hypothesis that the authors advance here is that driving intuitively requires less knowledge building and retrieval and more spatial reasoning. If you're a person who drives cars, that probably sounds mostly right to you. And so the optimal models for this planning task would have relatively fewer parameters in the feedforward network layers. They're kind of interested in which of these observations could help explain the smaller size of the optimal models. So,

This paper, I think, reveals a lot of very interesting ideas and potential for future exploration.

Yeah, this is coming from Waymo and they trained this model and derived the power law models from their collection of a ton of data. They actually just use not live data from their deployed fleet. This is from just the safety drivers, the initial testing phase, but

They still wound up with a quite large data set. They have like 60 million run segments, 447,000 hours of driving. That's 5.6 million miles. So quite a few, let's say, data points here and there.

Yeah, the interesting bit is there's not been sort of any published results as far as I know about this notion of consistent scaling, in this case, cross-entropy loss in the context of self-driving. And here they do derive that, do demonstrate that as you collect more data, it's

If you're using Transformer for the specific task of forecasting motion of other agents, like other cars or people, you get consistently better at forecasting and also at the planning. So you need to simultaneously predict what others are doing and what you should do. And it's quite interesting.

Good. I guess it's a good thing that as you collect more data, you predictably get better continuously, since that would mean that as you get more data, these kinds of self-driving cars will be able to predict better and better until they're able to never get it wrong in terms of predicting where...

cars around it and people and so on are going to be going so that they can avoid any issues.

That's actually the only paper in the section. Like I said, we're going to keep it a bit shorter. So moving to policy and safety. First up, we have a safety paper dealing with jailbreaks. So this is kind of an explanatory paper. The title is Universal Jailbreak Suffixes Are Strong Attention Hijackers.

So there's this notion of universal jailbreaks. I think we covered that paper last year. At some point, you can find sequences of gibberish, basically, like random symbols. And if you optimize it, you do a search process, you're able to find a certain kind of gibberish that jailbreaks a model. So you can ask it how to build a bomb. After that, you add this adversarial suffix.

And that makes the model answer even though it shouldn't. LLMs typically aren't supposed to tell you how to build bombs.

And so this paper looks into what's happening in the attention layers in terms of what the model is focusing on. Turns out that when you have this adversarial suffix, it hijacks the attention in a sense that the adversarial chunk of the input gets a majority of the attention over adversarial

other chunks, like the stuff that goes before the adversarial example, like the token that indicates the start of the chat. So this means that

There's a predictable explanation on what is the effect of this kind of suffix, why it seems to work universally. There's a strong correlation between these things doing hijacking and then being universal and successful at jailbreaking, which means that there is a way to actually

kind of hopefully prevent the suffixes from working. Yeah, this is really interesting. I feel like there's a lot of cool, interesting promise and some of these interpretability related methods. So at one level, I do feel like there's very much a whack-a-mole with these new jailbreaks we keep finding and the solutions for them. But I feel very fun and insightful. And I feel like when we do find these kinds of solutions, there's always something new you learn.

Yeah, I think this one is fun because it's quite intuitive, I guess. It's like, oh, the model is paying attention to the random nonsense instead of actual stuff about being asked about a bomb. And it turns out that's a problem. Next up, surprise, surprise, we have another safety paper. This one is about a phenomenon called emergent misalignment out of OpenAI.

And this is a very interesting paper. What was found here was if you train a model on a narrow, incorrect data set, so this could be a data set of insecure code.

bad car advice, bad legal advice, bad health advice, then from an interpretability standpoint, you'll see these misaligned persona features activate and the model actually becomes broadly misaligned, meaning that if you just trained your model

on insecure code, then this model actually might be more likely if you ask the model how to make a quick buck or something like this to tell you to sell counterfeit goods or something else that it should not be telling you.

There's good news, though. With some further fine-tuning, the model can indeed be realigned. But it is pretty interesting also just that these features exist in AI models that allow you to sort of train them on a specific example of bad behavior. And they learn from that to generalize and act toxic in a more general way.

Right. Yeah. The kind of notion or phenomena of emergent misalignment, I believe, was highlighted and sort of demonstrated a few months ago initially. And there was a report that for most of the reasoning models, this is a pretty common issue.

And as you said, the notion of personas here is about features. So this is related to previous work from Anthropica to Covered where you're trying to train a dictionary that kind of compresses the features and gives you interpretable notions of what happens within the LLM. So they find that some of these features like

A toxic persona feature that corresponds to toxic speech and dysfunctional relationships is correlated with being misaligned. And so is some other stuff like sarcastic advice and sarcasm slash satire.

which, you know, since you discover that these features get more activations, get kind of more priority, if you just clamp down on them, that would prevent the misalignment. And just one more story. Last up, OpenAI wins a 200 million US defense contract.

So this is in collaboration with Endural, a company that works with the Department of Defense as well, building drones and so on. This is part of an initiative called OpenAI for Government, where you have things like CHET, GPT, GOV.

And apparently the contract will help the DoD improve administrative operations, healthcare, and cyber defense. So nothing too spicy here, but worth noting, I think all the providers, Anthropic, OpenAI, even Google, tech as a whole is getting more friendly with the government and things like these kinds of defense contracts. So not too big a surprise, but worth being aware of.

And that's it. That's our episode. Kind of a short one, maybe refreshingly so. Thanks, Daniel, for filling in for this week. Thanks for having me. This is always fun. As always, we appreciate your feedback. Appreciate you leaving reviews or sharing a podcast, giving us more listeners. So feel free to do that if you like a podcast. But more than anything, we appreciate it if you do listen. So do tune in next week. Tune in.

♪♪

♪♪ ♪♪ ♪♪

you

From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

#213 - Midjourney video, Gemini 2.5 Flash-Lite, LiveCodeBench Pro 36:36 Share

Last Week in AI

Deep Dive

Shownotes Transcript

#213 - Midjourney video, Gemini 2.5 Flash-Lite, LiveCodeBench Pro