We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Vision and Voice Are Now LLM Table Stakes

2024/12/14

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Insights AI Chapters Transcript

NLW: 近期AI领域重要进展集中在视觉和语音功能的整合上，OpenAI发布的ChatGPT实时视觉模式以及Google Gemini 2.0 Flash的类似功能，标志着视觉和语音功能成为大型语言模型的标配。这一功能的整合将极大扩展AI的应用场景，并深刻改变人机交互方式。OpenAI的ChatGPT圣诞模式虽然意义相对较小，但也展现了AI技术应用的多样性。此外，OpenAI可能提前获悉Google的发布计划，并在Google发布新产品的同时，宣布ChatGPT与Apple Intelligence的整合。 NLW: Apple将ChatGPT整合到Apple Intelligence中，增强了Siri的语音理解、上下文理解和多模态交互能力，但同时也暴露出Apple在AI领域落后的现状。Apple与Broadcom合作开发AI服务器芯片，旨在提升其AI能力，但芯片设计难度大，成功与否仍有待观察。 NLW: Microsoft发布的Phi-4语言模型，专注于小型模型的性能提升，尤其是在数学问题上的表现，这反映了小型语言模型市场竞争的激烈程度。Microsoft使用合成数据训练Phi-4，这是一种不同于传统方法的训练方式。 NLW: Anthropic发布的Claude 3.5 Haiku聊天机器人，其超长的上下文窗口使其能够高效处理大型数据集。Lumen Orbit公司筹集资金用于在太空建造数据中心，这是一种具有成本优势的计算方案。 Molly Kinder: 语音、实时视频和视觉的结合将对工作产生更大的颠覆性影响。 Alexander Gia: 相比Gemini，ChatGPT在描述事物和语言自然度方面表现更出色。 Zero X Bowen: Apple将Apple Intelligence包装成OpenAI产品的行为，表明其自身AI技术的落后

Deep Dive

Key Insights

Why is vision and voice integration becoming a standard feature for large language models (LLMs)?

Vision and voice integration is becoming a standard feature for LLMs due to the significant new use cases it opens up, such as real-time video analysis and enhanced voice interactions. OpenAI's recent announcement of Vision Mode and Google's Gemini 2.0 Flash have accelerated this trend, making it a baseline expectation for LLMs.

What are the key differences between OpenAI's Vision Mode and Google's Gemini 2.0 Flash?

OpenAI's Vision Mode focuses on balancing vision and voice input effectively, providing more natural language responses and accurate descriptions. In contrast, Google's Gemini 2.0 Flash overly emphasizes vision capabilities, potentially at the expense of language fluency.

When will OpenAI's Vision Mode be available to different user tiers?

Vision Mode is available starting this week to Plus, Team, and Pro tier subscribers. Enterprise and Education users will gain access in January.

Why is Siri's integration with ChatGPT significant for Apple users?

Siri's integration with ChatGPT enhances its ability to handle complex commands, retain context for follow-up questions, and provide text inputs. It also allows Siri to hand off questions to ChatGPT when it cannot answer them, improving overall functionality.

What is Apple's current position in the AI race compared to competitors like Google?

Apple is significantly behind in the AI race, as evidenced by its reliance on third-party products like ChatGPT to enhance Siri. Its AI strategy has been criticized as failing and lagging years behind industry leaders like Google.

What is Apple's plan for its first AI server chip in partnership with Broadcom?

Apple is partnering with Broadcom to produce its first AI server chip, leveraging its history of successful silicon design. The chip aims to improve Apple's AI capabilities, particularly in model training and inference at scale.

What is Microsoft's strategy with its new language model, PHY4?

Microsoft's PHY4 focuses on small language models, emphasizing cost-effective performance and synthetic data training. The model is designed to compete in specific areas like math problems and is available for research purposes on Microsoft's development platform.

What is unique about Anthropic's Quad 3.5 Haiku model?

Anthropic's Quad 3.5 Haiku is notable for its long context window of 200,000, making it excellent for processing large datasets quickly. It is also the smallest and fastest variant of Anthropic's LLM, excelling in tasks like coding recommendations and content moderation.

What is Lumen Orbit's ambitious goal with space data centers?

Lumen Orbit aims to build modular orbital data centers, scaling them into multi-gigawatt compute clusters by the end of the decade. The company believes this approach is a lower-cost alternative to building data centers on Earth, leveraging space-based solar power.

Chapters

The integration of vision and voice capabilities into LLMs is rapidly becoming standard, as evidenced by recent announcements from OpenAI and Google. This opens up many new use cases and is expected to significantly disrupt various jobs. While there's no clear frontrunner, this feature is quickly becoming a baseline expectation.

OpenAI's real-time vision feature for ChatGPT, initially demoed seven months prior, is now available.
Google's Gemini 2.0 Flash offers similar functionality.
The integration of vision and voice significantly increases the potential for job disruption.
LLMs with vision and voice capabilities are becoming a standard feature.

Shownotes Transcript

Translations:

中文

Vision and voice are increasingly just part of the AI tool set after the latest set of announcements. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes.

Hello, friends. One more slightly abnormal AI Daily Brief episode. Last couple of days, we have had to do main onlys, and so we are balancing things out with an extended headlines edition. This is all the stuff from the last few days that were smaller but still significant announcements. We're going to stack them all together, see what they all mean in concert, and we're kicking it off with the latest announcement from OpenAI's 12 Days of OpenAI slash Shipmas, which is real-time vision.

This is a pretty big one. Now, this feature was first demoed almost seven months ago and was recently featured in an episode of 60 Minutes. Draw whatever body part you want to quiz him on and have him label it. How does that sound? That sounds like a fantastic plan. When Brockman pointed his phone's camera at the blackboard, the AI started to quiz me. Let's start with the heart.

Anderson, can you draw and label where the heart is in the body? It understood what I was doing, even though my drawing was pretty crude. The location is spot on. The brain is right there in the head.

So obviously this opens up a whole world of new use cases and was something that people were really excited about. In terms of how long it took to come, it's not exactly sure what the delay was. It's possible that it just took this long to get it all right. It's also possible that in this case, OpenAI's hand was forced by Google's announcement yesterday of Gemini 2.0 Flash, which had a similar function. Either way, Vision Mode is here now. It's available to Plus, Team, and Pro tier subscribers starting this week. Enterprise and Education users will gain access in January.

As has become normal, the feature will be unavailable in the EU for the time being. This really does feel to me like one that's kind of hard to discuss in the context of a podcast like this. It feels as though this is likely to be one of those things that we literally can't imagine a period before it existed. There are just so many incredibly useful opportunities that this opens up and that will change the way that we interact with AI.

Molly Kinder tweeted, For the past two months, I've been testing ChatGPT Advanced Voice Mode for implications for work. One of my takeaways, it will be far more disruptive for jobs when voice plus real-time video and vision are combined. Exactly what OpenAI just announced.

Some people started comparing the models. Alexander Gia writes, just tested AVM with vision for two hours. Impressive results. Compared to Gemini, ChatGPT responds more accurate when describing things and its language is more natural. While Gemini overly focuses on vision, ChatGPT balances vision and voice input effectively.

Still, I think that if you try to go search around, it's not super clear that there's one clear frontrunner here. Basically, overnight more or less, this has become a base level and expected feature for LLMs. Now I should point out that that was not the only thing that OpenAI released yesterday. They also created a Santa mode for ChatGPT.

Adding some credence to this, this is actually something that I had already had chat GPT voices pretending to do. And there's going to be a lot of kids who have a very magical experience because of this. So while yes, it may not be as significant as some of the other announcements, I think this one is pretty fun.

Going back a day, I have to think that OpenAI had advanced knowledge of when Google was going to make its big announcements. Because on Wednesday, December 11th, the fifth day of shipmas, while Google was announcing an extensive valuable new set of products and agents, OpenAI was just officially announcing ChatGPT and Apple Intelligence coming together. The update came as part of the rollout of iOS 18.2, which features a number of new Apple Intelligence features.

Of course, since Apple intelligence was announced, the feature on the top of everyone's Christmas list was a version of Siri redesigned around AI. Alas, we are not there yet, although ChatGPT's support does go some way. Siri can now hand off questions to ChatGPT when it can't answer them itself. For example, coming up with a recipe based on ingredients in your pantry isn't in Siri's wheelhouse, but ChatGPT can handle it with ease. ChatGPT's voice mode is also being used to enhance Siri's ability to understand commands,

particularly when you stumble partway through a sentence. The example given was someone saying, "Siri, set an alarm for '08 no, set a timer for 10 minutes, actually make that 5." Previously, that command would have left Siri hopelessly confused, but apparently it now just works fine. Another benefit is that Siri can now retain context to answer follow-up questions. You could ask Siri for directions and then ask it what the weather will be like when you get there.

In smaller quality of life improvements, Siri can now accept text inputs rather than only being voice activated. It also now has an understanding of the Apple interface, so it can provide instructions on how to use an Apple device if it's your first time using a feature. ChatGPT has also been integrated into the Apple Intelligence writing tools and camera features.

Still, the whole thing just shows, frankly, how unfathomably far behind Apple is with this. This release is notably one of the first times that Apple has allowed a third-party product to interface with Apple software. And the fact that they had to shows just what a desperate state that they're in. Apple has said they're working on a version of Siri that understands personal context and can tap into the data stored on your phone, but that's still all in the future.

Zero X Bowen writes, Apple releasing Apple intelligence as a wrapper of open AI is a confess to the world that its own AI sucks. Its AI strategy has failed and behind the industry by years. And Siri is a total failure. This is harsh, but not necessarily all that far off. I was just discussing on a recent episode around how much better Google must feel positioning itself for 2025 as opposed to going into 2024. But Apple is still out in the tall grass, man.

And yet, of course, they're still worth paying attention to. Another bit of news from Apple this week is that they are partnering with Broadcom to produce their first AI server chip. Apple's history of producing their own silicon has been a rousing success. The A4, which debuted in 2010 in the iPad and later the iPhone 4, was a game changer for system on a chip design. The switch away from using Intel chips and Macs was another noticeable milestone.

The latest M4 series Macs in particular are able to produce some stunning results in AI, running models as large as 70B locally. Until now, Apple has been lacking their own AI chips suitable for model training and inference at scale. Currently, Apple intelligence features are served by silicon designed by other tech companies. At the recent Amazon reInvent conference, Apple gave a glowing report of Amazon's Tranium chips. But Apple also likely uses Nvidia chips along with the rest of the industry.

What makes this announcement enticing is Apple's history of producing silicon that's above the competition. If this chip is as good comparatively as the M4, it could push Apple's AI dramatically forward. But chip design is difficult and AI servers require an entirely different architecture to CPUs.

A reasonable question is whether Apple still has the institutional knowledge to perform at that level. That pedigree likely had a lot to do with chipmaking guru Jim Keller, who was at the company between 2008 and 2012. He was involved in designing the A4 and was in charge of setting the specs for the first two editions of the MacBook Air. He's now working on his own AI silicone as the CEO of TenStorent.

However, there are some interesting things about how they're attacking the problem. The Information reported the design will be focused on the chip's networking technology. Networking is one of the key limiting factors currently for AI training. Elon Musk's Colossus training cluster, which contains 100,000 network chips, was considered impossible right up until it was achieved. Scaling much larger than that will require a breakthrough in networking technology. I will say they aim to finish the chip design within 12 months, so we'll have more when that comes to fruition.

Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices and establishing trust is more important than ever.

Vanta automates compliance for ISO 27001, SOC 2, GDPR, and leading AI frameworks like ISO 42001 and NIST AI risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI.

Over 8,000 global companies like Langchain, Leela AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at vanta.com slash nlw. That's vanta.com slash nlw.

Today's episode is brought to you by Superintelligent. Every single business workflow and function is being remade and reimagined with artificial intelligence. There is a huge challenge, however, of going from the potential of AI to actually capturing that value. And that gap is what Superintelligent is dedicated to filling.

Superintelligent accelerates AI adoption and engagement to help teams actually use AI to increase productivity and drive business value. An interactive AI use case registry gives your company full visibility into how people are using artificial intelligence right now. Pair that with capabilities building content in the form of tutorials, learning paths, and a use case library, and Superintelligent helps people inside your company show how they're getting value out of AI while providing resources for people to put that inspiration into action.

The next three teams that sign up with 100 or more seats are going to get free embedded consulting. That's a process by which our super intelligent team sits with your organization, figures out the specific use cases that matter most to you, and helps actually ensure support for adoption of those use cases to drive real value. Go to besuper.ai to learn more about this AI enablement network. And now back to the show.

Speaking of both big tech and companies that partner with OpenAI, Microsoft have launched PHY4, a new version of their in-house language model. This is the first new generation since the release of PHY3 in April of this year. The release was noteworthy at the time for including an ultra-small 4B model in the lineup. It was one of the most performant models that could fit on an edge device such as a phone, and was a very cheap option for developers.

However, PHY3 never really made any waves. Rivals quickly released cheaper and more performant small models, and that field has only become more competitive in recent months. With this release, it appears that Microsoft is sticking to their philosophy of working on small language models. The model is a 14B similar in size to GPT-40 Mini, Gemini 2.0 Flash, and Claude 3.5 Haiku. Microsoft is highlighting the model's performance in math problems in particular.

PHY4 outcompetes in this category against Google's Gemini 1.5 Pro, OpenAI's GPT-4.0, and Anthropic's Cloud 3.5 Sonnet, all much larger models. In overall benchmarks, the model seems comparable to Lama 3.3, but not quite as good as GPT-4.0. Microsoft says the jump in performance is due to the use of, quote, high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations. The release contains only a single variant, with no update to the smaller versions of PHY 3.5.

Microsoft have released the new model with limited access for research purposes on their development platform, and the model will be available on Hugging Face next week. There's actually a lot that's interesting about this.

First of all, there's the constant question of what it means for the OpenAI and Microsoft relationship. Although it's pretty clear that even if that relationship remains persistent forever, Microsoft will increasingly hedge and think about its own models as well. But there's also the dimension of synthetic data. The Turing Post wrote, Unlike models trained primarily on organic web data, 5.4's synthetic-heavy training approach doesn't just mimic human-generated content, it redefines the learning process. Synthetic data is crafted for diversity, complexity, precision, and chain-of-thought reasoning.

Synthetic data isn't cheap filler. It's structured learning. Synthetic data ensures gradual logical progression, helping the model learn better reasoning patterns than messy human written web content. Another dimension of this is the fact that there is so much competition at the smaller model level. The AI wars are not being fought just on the frontier in the state of the art. They're being fought on the dimension of cost-effective performance that can work on a variety of devices.

Speaking of which, a small one on the same front, Anthropic has released Quad 3.5 Haiku as a chatbot. The smallest and fastest variant of the Labs LLM was previously only available via API. First released last month, the model surprised by beating Anthropic's flagship Quad 3 Opus model across certain benchmarks. In particular, the smaller model was well-suited for coding recommendations, data extraction and labeling, and content moderation.

3.5 Haiku still doesn't support image analysis, however, unlike the previous version in 3.5 Sonnet. In terms of use cases, 3.5 Haiku stands out as having one of the longest context windows on the market at 200,000, making it excel at processing large datasets quickly. Availability on the chatbot platform means 3.5 Haiku can now be used with features exclusively to that UX, including cloud artifacts.

Lastly, we'll round out today with a couple of stories around big, interesting new company fundings. One very interesting AI startup has closed their seed funding round after seeing what you might call stratospheric demand. Lumen Orbit has raised $11 million to build data centers in space.

The round valued the company at $40 million and was only open for a few days after more than 200 inquiries from VCs. The company immediately opened another round on top at a higher valuation to allow more investors in. The seed round was led by NFX with participation from Fuse VC, Soma Capital, and Scout Funds from A16Z and Sequoia. Lumen Orbit was founded only in January and emerged with huge buzz out of the Summer Y Combinator class.

The goal is about as grand as you would imagine. The company aims to create modular orbital data centers. The idea is that individual compute pods can be launched separately and launched into large solar arrays. The goal is to then scale these data centers into multi-gigawatt compute clusters by the end of the decade. For reference, XAI's Colossus supercluster currently uses around 150 megawatts.

As surprising as this might seem, this is actually viewed as a lower-cost option to building data centers on Earth. CEO Philip Johnson said, "Instead of paying $140 million for electricity, you can pay $10 million for a launch and use infinite solar." The first step is a demonstration satellite launch in May containing NVIDIA GPUs, followed by another test satellite with 100 times the compute the following year.

Johnson said, A lot of space companies take five years to launch. We are launching it in 18 months. We'd rather launch frequently with smaller changes than wait five years and launch with a ton of incremental changes. With the cost to launch satellites rapidly decreasing, the founders initially thought space solar would be a cool idea. Once they realized how difficult and energy intensive it would be to transmit the solar power back to Earth, they decided to simply launch the data centers into space as well.

I don't know about you guys, but it's hard not to get excited about the level of ambition when you see something like this. Anyways, that is going to do it for this extended headlines edition of the AI Daily Brief. Over the weekend, we will have a long reads episode. And then next week will be our last week of normal episodes. We'll have a bunch of end of year content in and around the holidays, as you would expect. Appreciate you listening or watching as always. And until next time, peace.

Vision and Voice Are Now LLM Table Stakes 14:44 Share