We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

With Gemini 2.0, is Google So Back Baby?

2024/12/13

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Insights AI Chapters Transcript

People

David Citron

Demis Hassabis

Ethan Malek

Jaisalyn Konzelman

Kare Kevick-Soglu

NLW

知名播客主持人和分析师，专注于加密货币和宏观经济分析。

Tulsi Doshi

Topics

NLW: Google Gemini 2.0专注于代理时代，旨在构建能够与谷歌产品交互并执行代码的多模态模型。Gemini 2.0 具有原生图像和多语言音频生成能力，能够直接与谷歌产品交互，并接受流媒体视频作为输入。Google 推出了三个基于 Gemini 2.0 的原型代理：Project Astra、Jules 和 Project Mariner。Project Astra 是一款语音到语音的通用 AI 助手，可以访问 Google 搜索、地图和 Lens 等工具，并具有 10 分钟的会话记忆能力。Jules 是一款编码助手，能够创建多步骤计划来解决问题，修改多个文件，并为 Python 和 JavaScript 编码任务以及 GitHub 工作流程准备拉取请求。Project Mariner 是一款网页浏览助手，旨在模仿人类的网页浏览行为，代表了用户交互方式的转变。Google 还为 Gemini 1.5 Pro 推出了深度研究模式，这是一个长篇研究工具，能够制定多步骤研究计划，搜索和整理信息，并生成包含完整引文的报告。Google 正在改进其 AI 概述工具，使其能够处理更复杂的主题，并回答数学和编程问题。Google 推出了第六代 Trillium AI 芯片，该芯片在训练性能和能效方面都有显著提升。Google 在过去几年中经历了 AI 品牌故事的转变，从领先地位到被超越，再到如今的强势回归。Notebook LM 的成功以及播客摘要功能的添加，帮助 Google 重拾了在 AI 领域的叙事优势。Google Gemini 2.0 的发布获得了积极评价，标志着 Google 在 AI 领域的回归。Google 在 AI 领域拥有诸多优势，包括产品整合和数据获取能力，这使其在 2025 年有望取得更大发展。 Tulsi Doshi: Gemini 2.0 Flash 速度快，性能强，在编码和图像分析方面比 Gemini 1.5 Pro 有显著改进，并取代 Pro 成为旗舰模型。 Demis Hassabis: Gemini 2.0 Flash 的性能与 Gemini 1.5 Pro 相当，甚至更好，且具有成本和性能效率。 Jaisalyn Konzelman: Jules 编码助手在设计上注重用户参与，会在采取行动前呈现建议计划，并请求权限才能合并更改。 Kare Kevick-Soglu: 由于 AI 现在代表用户采取行动，因此需要逐步推进。 David Citron: 深度研究模式利用 Google 的信息检索能力来指导 Gemini 的浏览和研究。 Ethan Malek: Google 的深度研究功能令人印象深刻，能够生成组织良好且准确的报告，但可能存在错漏。 Sundar Pichai: Google 的 AI 概述工具已覆盖 10 亿用户，并将在未来一年扩展到更多国家和语言。

Deep Dive

Key Insights

What are the key features of Google's Gemini 2.0?

Gemini 2.0 features native image and multilingual audio generation, intelligent tool use, and the ability to accept streaming video as input. It can interface with Google products, execute code, and handle real-time interactions.

Why is Gemini 2.0 Flash replacing Gemini 1.5 Pro as the flagship model?

Gemini 2.0 Flash is faster and more powerful, offering significant improvements in coding and image analysis while maintaining cost and performance efficiency. Google is confident it will be the best model for most tasks.

What are the three prototype agents showcased by Google?

The three agents are Project Astra (a universal AI assistant), Jules (a coding assistant), and Project Mariner (a web browsing assistant). Astra can handle complex conversations and access real-time information, Jules assists with coding tasks, and Mariner can control web browsing activities.

How does Project Mariner differ from other AI agents in web browsing?

Mariner can take control of the Chrome browser, clicking buttons, filling out forms, and navigating the web like a human. It represents a new UX paradigm shift, allowing agents to behave more like users.

What is Google's new reasoning mode for Gemini 1.5 Pro called, and how does it work?

The new mode is called 'deep research.' It responds to prompts with a multi-step research plan, searches for and compiles information, and generates detailed reports with citations, saving users hours of time.

What are the performance improvements of Google's sixth-generation Trillium AI chip?

The Trillium AI chip offers a 4x improvement in training performance and a 2.5x improvement in training performance per dollar, with significant reductions in energy use. It is used for both training and inference.

Why has Google's position in the AI race improved compared to earlier this year?

Google's breakout AI product hit, Notebook LM, with its podcast summarization feature, helped regain narrative momentum. The recent Gemini 2.0 announcement further solidified its position, showing a return to form and leadership in AI.

Chapters

This chapter dives into the features of Gemini 2.0, highlighting its multimodal capabilities, including image and audio generation, and its ability to interface with Google products. It also discusses the improved speed and performance of Gemini 2.0 Flash, which replaces the Pro model as the flagship.

Native image and multilingual audio generation
Native intelligent tool use (interfaces with Google products)
Accepts streaming video as input
Gemini Flash 2.0 is multimodal and replaces Pro model
Significant improvements in coding and image analysis over Gemini 1.5 Pro

Shownotes Transcript

Translations:

中文

Google drops a slew of new AI features showing just how far the company's AI strategy has come this year. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes.

Quick note, friends, before we dive in today, this episode was caught up in the travel dragnet. And so once again, I am doing just a main episode. I think that probably on Friday, we will do an extended news episode to try to catch up on all the headlines that we missed. A little bobbly to end the year, but we are making it happen. And at least you are not missing episodes.

So what we are talking about today is an absolute slate of new announcements from Google. It is very clear that they were not content letting OpenAI have all of its fun with its 12 days of OpenAI or shipments or whatever they were calling it, and really wanted to come in and steal some of that thunder. We're going to talk first about what was actually announced. And then towards the end of the episode, I'm going to spend a little bit of time talking about what it all reflects in terms of where Google sits heading into 2025 vis-a-vis this AI race.

As I said, there was a ton that was announced, so it's going to take a minute to get through it all. The big banner headline was that this was Gemini 2.0. Almost exactly one year after their original frontier model, a model which at the time was trying to capture energy and attention as the first natively multimodal model, it's very clear where their heads are at when it comes to Gemini 2.0. It's right there in the subtitle of the blog post, Our New AI Model for the Agentic Era.

So what's actually in Gemini 2.0? First of all, it has native image and multilingual audio generation. It also features what Google are calling native intelligent tool use, meaning it can directly interface with Google products like search and even execute code. It also is the first model to accept streaming video as an input. And so when you take it all together, Google now has a model that can view something in real time, hold the conversations and take actions in the background. This release centered around improvements to Gemini Flash.

which is the version of the model that's designed to be fast and cheap. The first generation of Flash was text only, but it is now fully multimodal and has all the features of the larger models. That means it can accept images, videos, and audio as inputs alongside text and produced audio responses.

Tulsi Doshi, the head of product for Gemini, said, We know Flash is extremely popular with developers for its balance of speed and performance. And with 2.0 Flash, it's just as fast as ever, but now it's even more powerful. Based on Google's benchmarking, Gemini 2.0 Flash is significantly improved in areas like coding and image analysis over the Gemini 1.5 Pro. Google is in fact so confident that Flash will be the best model for most jobs that it's replacing Pro as the flagship model in the lineup.

Demis Hassabis, the CEO of Google DeepMind said, Effectively, it's as good as the current pro model is, so you can think of it as one whole tier better for the same cost efficiency and performance efficiency and speed. We're really happy with that.

The audio generation feature, which is new to Flash, was described as steerable and customizable. It features eight different voices, which are optimized across a range of languages and accents. Doshi said, The response to this was pretty good. Dan Mack on Twitter writes,

I kind of hate when AI influencers try to engagement bait by saying this is insane, but I must say this is in fact insane. Google beat OpenAI to the punch by allowing real-time video and audio interaction on your desktop with Gemini 2.0 Flash. This is for sure a new era of the AI age. And while a massive update to the Foundation model is a big deal, even they pointed out this is all about the agentic era. And so perhaps unsurprisingly, Google showcased three prototype agents built on the new model.

The first is Project Astra, an updated version of their universal AI assistant. The assistant is now fully speech-to-speech. Google demonstrated its ability to keep up with complex conversations, transition between different languages, and access other Google tools. The assistant can now access real-time information through Google Search, Maps, and Lens, which is a feature we haven't seen from an AI assistant to date. Astra now has 10 minutes of in-session memory and can recall conversations you've had in the past to enhance personalization.

The second agent is a coding assistant called Jules. And Jules demonstrates what happens when you combine reasoning models with agentic capabilities. Jules can create multi-step plans to address issues, modify multiple files, and prepare pull requests for Python and JavaScript coding tasks and GitHub workflows. And if this agent is what's behind the announcement last quarter that more than a quarter of all code created at Google is now generated by AI, then we could be in for something great.

Google has designed Jules with a lot of human in the loop. Frankly, likely more than they need in order to ensure safety. Jules will present a suggested plan before taking action. Users can monitor progress and permission is requested before merging any changes. Jaisalyn Konzelman, the Director of Product Management at Google Labs said, We're early in our understanding of the full capabilities of AI agents for computer use. Jules is only available to a select group of trusted testers at the moment, but will be rolled out more broadly early next year.

A third agent is the web browsing assistant called Project Mariner. And this gets out one of the most important UX shifts that we're seeing, where instead of trying to adapt ourselves to what AI and agents can do, we're just trying to get agents to behave more like us. Anthropic made a bunch of news earlier this year when they showed their version of a very nascent agent that could actually point and click on your screen.

and Mariner is of a similar ilk. The model can take control of the Chrome browser, clicking buttons, filling out forms, and using the web much like a person would. Google leaders called this a fundamentally new UX paradigm shift that we're seeing right now. Quote, we need to figure out what is the right way for all of this to change the way users interact with the web and the way publishers can create experiences for users as well as for agents in the future.

The demonstration showed the agent building out an online shopping cart based on a grocery list. The process was painfully slow, with around five seconds of delay between cursor movements. The agent also got stuck and asked for assistance multiple times. For now, the agent can't use the checkout by itself, a safety limit so it doesn't need to handle credit card details. And from a functional standpoint, the agent does work like Anthropic's computer use mode, taking constant screenshots to determine its next move.

Because of this, Mariner can only use the visible tab in Chrome, so you can't use the computer for other things while the agent is in control. Google feels very comfortable with this, though. DeepMind CTO Kare Kevick-Soglu said, Because the AI is now taking actions on a user's behalf, it's important to take this step by step. You as an individual can use websites, and now your agent can do everything that you do on a website as well.

As an added bonus to preview what comes next, Google said they are testing agents that understand video games. They said the agents can, quote, reason about the game based solely on the action on the screen and offer up suggestions for what to do next in real-time conversation. If you get stuck, the agents can also access Google Search to figure out what you should do next. Google

Google is testing the agents on games like Clash of Clans and Hay Day.

Whether you're an operations leader, marketer, or even a non-technical founder, Plum gives you the power of AI without the technical hassle. Get instant access to top models like GPT-4.0, CloudSonic 3.5, Assembly AI, and many more. Don't let technology hold you back. Check out Use Plum, that's Plum with a B, for early access to the future of workflow automation. Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices and establishing trust is more important than ever.

Vanta automates compliance for ISO 27001, SOC 2, GDPR, and leading AI frameworks like ISO 42001 and NIST AI risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI.

Over 8,000 global companies like Langchain, Leela AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at vanta.com slash nlw. That's vanta.com slash nlw. Today's episode is brought to you, as always, by Superintelligent.

Have you ever wanted an AI daily brief but totally focused on how AI relates to your company? Is your company struggling with AI adoption, either because you're getting stalled figuring out what use cases will drive value or because the AI transformation that is happening is siloed at individual teams, departments, and employees and not able to change the company as a whole? Superintelligent has developed a new custom internal podcast product that inspires your teams by sharing the best AI use cases from inside and outside your company.

Think of it as an AI daily brief, but just for your company's AI use cases. If you'd like to learn more, go to besuper.ai slash partner and fill out the information request form. I am really excited about this product, so I will personally get right back to you. Again, that's besuper.ai slash partner. Still, we are not done because alongside the agents, Google is also introducing a new reasoning mode for Gemini 1.5 Pro, which they're calling deep research.

This seems to be closer to a long-form research tool than a competitor to OpenAI's O1 model. In deep research mode, Gemini responds to a prompt with a multi-step research plan. Once revised and approved, the model then spends a few minutes searching for and compiling information. It then repeats the process several times, iterating on the information learned. Once complete, the model generates a report on the key findings along with full citations of academic sourcing.

Google is calling it an agent as technically it completes this process using Google search. David Citron, product director for Gemini Apps said, we built a new agentic system that uses Google's expertise of finding relevant information on the web to direct Gemini's browsing and research. Deep research saves you hours of time. Orton professor Ethan Malek, who has gone deep on advanced academic uses of AI, seems impressed.

He wrote, "...the new deep research feature from Google feels like one of the most appropriately googly uses of AI to date, and it is quite impressive. I've had access for a bit, and it does very good initial reports on almost any topic. The paywalls around academic sources put some limits." He did also include, "...I wish they had stats on the hallucination rate. I suspect better than an undergraduate, and it is more likely to miss subtle things than to get stuff completely wrong."

He continued, one warning to instructors is that the new Google Deep Research feature solves most of the issues with AI-created research assignments. Pretty solidly well-organized and written with accurate citations, it makes it very easy for students to skip or automate their research work. Bilawal Sidhu called it essentially perplexity on steroids.

Last couple of announcements. Google is, of course, deploying these new model capabilities everywhere, and one of the first uses is an upgrade to Google's AI overviews. The company says that the tool will now be able to handle, quote, more complex topics as well as multimodal and multi-step searches. They also said it can answer questions about math and programming. You'll remember that AI overviews were part of the narrative challenge for Google at the beginning of the year. Initially, they were widely mocked online due to things like suggesting glue as a pizza topping.

Still, Google CEO Sundar Pichai said, "...our AI overviews now reach 1 billion people, enabling them to ask entirely new types of questions, quickly becoming one of our most popular search features ever. We'll continue to bring AI overviews to more countries and languages over the next year." Lastly, on the hardware side, Google has unveiled the sixth generation of their Trillium AI chip. The chip is used for training and inference, competing with NVIDIA GPUs alongside the Nutranium chip from Amazon. They claim the performance improvements could fundamentally alter the economics of AI training.

They say that it delivers a 4x improvement in training performance compared to its predecessor, as well as a significant reduction in energy use. As a more tangible metric, Google is claiming a 2.5x improvement in training performance per dollar. Gemini 2.0 was trained exclusively on a Trillium cluster. And Google disclosed that they have built a 100,000 chip cluster, which they claim is one of the most powerful AI supercomputers.

In their announcement, Google didn't provide any comparisons to rival chip makers, so it's a little hard to know how the new silicon stacks up. However, the chips are now generally available to Google Cloud users, so it probably won't take long for us to find out. Taking a step back, Google's brand story across the last couple years of AI has been a really fascinating one. I think if you had gone a few years back, Google was the default leader, both from a real and an imagined perspective when it came to generative AI.

The launch of ChatGPT and the ascendance of OpenAI really upset the apple cart. And it wasn't just that. Not only was there now a consumer product out ahead of Google, but in early 2023, the meta also carved out a totally different space because of their approach to open source. For most of 2023, Google felt distinctly behind when it came to generative AI. Indeed, even one year ago, when Gemini 1.0 was launched,

The broad perception was that their hand had been forced, that the model really wasn't as far along and wasn't competitive yet with GPT-4, and wouldn't be until they released the most performant version of it early in 2024. Basically, Google had to do something, and so they had to announce Gemini 1.0 earlier than they might otherwise have wanted to.

Then in the beginning of this year, while we did get a GPT-4 class model in Gemini, we also got what I was just mentioning, AI overviews and search that told people to put glue on pizza. And of course, the whole controversy and dust up around the historically inaccurate image generation, which forced diversity into situations in history which were very undiverse. Think black Nazis.

In other words, it was a pretty brutal beginning of the year for Google. Slowly but surely, though, that has changed. Undeniably, one of the big reasons for that is that Google got a breakout AI product hit in Notebook LM. The addition of the podcast summarization feature, which opened up this totally new set of use cases and ways of consuming information never before available, really got this ship pointed in the right direction and a ton of narrative juice back in the Google house.

That set the tone, I think, for this announcement, which was comprehensive, had a lot of great stuff in it, and was received incredibly positively. People are excited about these new features. They're excited about Astra. They're not dealing with this cynically. And importantly, from a brand perspective, it's more of a return to form than anything else. In other words, people are saying, oh, that Google that we know that we would have assumed would be a leader in this space, they are back.

And that, I think, is exactly where Google wants its brand to be. The company has an incredible number of advantages when it comes to the AI wars. They've got a slate of products to integrate AI into and to capture data from that potentially make their AI products not only very useful, but already plugged into the systems that people are using today. And so if they can continue this momentum, they could be poised for an even bigger 2025.

That's not to say that there aren't challenges, because as we've been discussing when it comes to agents, it's sort of like all bets are off and everything is up for grabs once again. Still, you got to think that the folks over at Google are a lot happier heading into 2025 than they were heading into 2024. And I think that they should be. For now, though, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching as always. And until next time, peace.

With Gemini 2.0, is Google So Back Baby? 15:41 Share