Remaking the UI for AI

2024/5/16

a16z Podcast

A

Anjney Midha

Anjney Midha 认为，大型语言模型和生成模型的兴起，对硬件和界面提出了新的要求。当前的界面无法满足这些模型对新型输入和上下文的需求，这需要重新思考人机交互的方式。他指出，计算的瓶颈在于硅，但最终会有人找到方法制造出功能相同的替代品。英伟达在训练工作负载方面的优势可能是暂时的，因为推理层是一个更公平的竞争环境，许多令人兴奋的创新正在这里展开。他提出了一种理解未来硬件的方法，即从客户需求或计算机历史演进的角度出发。计算机历史可以分为推理和界面两个发展脉络，两者共同推动了计算革命。当前我们正处于推理和界面转变的交汇点。他认为，未来的AI界面将以语音和音频为主，视觉作为补充，并需要解决隐私问题。为了实现这一目标，需要在输入、推理和输出三个方面进行硬件创新，其中推理层（即“大脑”）的创新最为活跃，一些公司正在为特定模型定制芯片。他预测，未来的产品将是小型模型的组合，它们协同工作，效率更高。未来的训练将更多地关注在个人数据上的微调，而不是大型预训练模型。他还指出，人类通过类比进行推理的能力，在计算机设计中可能是一个盲点，目前Transformer架构的局限性，可能需要新的架构来实现更高级的推理能力。对替代架构的投资不足，可能是未来计算的一个盲点。他认为，最佳的商业模式应该是雇佣关系，而不是免费服务模式。

Deep Dive

The costs required to bring a new form factor, the market is just insane. We've had a reasoning breakthrough p with large language models and generate models, and they're really hungry for new kinds of input and context that the current generation of interfaces is not providing. The north star of computing has always been the barrel, like a Steve jobs quote, to be the bicycle for the mind.

The history of hardware has sin the history of computers. The rate limit here is silicon. Get sand, and there's stones of sand in the world, and eventually somebody else will figure out how to make sand that does the same thing. We have adapted to interfaces over the last sixty years, but the way we interact with computers today is actually dramatically unnatural.

Hello everyone, welcome back to the exigency podcast. Now if you're regular around here, you probably have a sense for just important the topic of artificial intel e to in fact, IT is so important that we decided to create a new A I specific podcast, which to nose prize is called A I plus asic sensi. And now we did not consult A I are naming today.

You'll get to hear one of the early episodes from that podcast post by venture editor dare carrs there IT sound with sixteen e general partner on the media to discuss a very important question. Here goes, if we even use software platform through large language models, do we also need to rethink our hardware from the ground up door interfaces? Today, he pays with the data requirements across input reasoning and help put, in other words, what does the U.

I look like for A, I will IT look like a phone or something completely different. And what can we learn from the millennia human behavior long before the technology existed that might actually inform what's to come? Of course, there are many companies trying to solve this right now, but this conversation may actually give me cruise as to why no one has quite solved again.

This episode comes straight from our new AI plus basic gene feet. So if you like this episode, forget to subscribed to get all of a six disease latest A I content, including episodes around the office supply chain, open source rag and a whole. Of course, we will include the link in our show notes. And with that there take IT away.

Hi, this is Derek carr, and you're listening to the a sixteen Z A I podcast where we dig in to all things artificial intelligence with our in house team of experts as well as the founders, engineers and researchers, is working at the state of the art in this episode. I speak with a sixteen general partner, and I want to AI hardware will look in the years to come and why there's so much innovation yet to happen at the inference layer, among other things, he explains how he sees wearable devices evolved to take advantage of improvements and sensors and workloads, specific chips, and how the introduction of big company technology like the apple vision pro can actually lay the foundation for startups. But because he recorded right after in video as big gtc event as with our previous episode with the vine raw, we kick off the conversation talking about training workloads, verse to inference workloads and how an video came to dominate the former category.

As a reminder, please note that the content here for informational purpose is only should not be taken as legal, business tax or investment advice, or be used to evaluate any investment or security, and is not directed that any investors or potential investors is in any a sixteen z fund. For more details, please see a sixteenth that come slash disclosures I saw recently, like obama really support for AMD GPS. I think I saw someone compared video to sudden microsystems in the early days of the web, which seems like you might be wishful thinking. Are our skeptics kind of underestimating the the strangle hold that in videos .

has the two schools of thought, which is that in videos strong? Hold on. This is completely transitory. Their margins are over inflated because of supply chain crisis over the last twenty four months where we had this explosion and demand. But like products, onna, catch up.

And you know basically those folks will tell you that good of that, I will tell you, like the rate limit here is silicon. It's sand and there's tones of sand in the world. And eventually somebody else figure out how to make sand that does the same thing, this tons of on fanta.

Th, and I personally find that view is provocative but reductive, because IT IT doesn't on do your first principles analysis of the fact that training has these idiosyncratic needs, like a really robust software driver layer, right, that that can orchestrate thousands of these trips acting in unison. I think yes, on on that side of the debate, i'm certainly one who believes that the developer experience that NVIDIA started, by the way, investing in a decade ago is in its sort of later stages of compounding right now. And it's really hard to dislodge that.

What we may see is margins compressed over time because the budgets are shifting from training to inference. And I think that's actually where a lot of the exciting stuff is happening and and we're going to spend a balk time today, hopefully talking about inference because that's open season right now. Every time you have a new computer, a new software, primitive IT often results in news kinds of workloads that the incomes have a harder time keeping up with.

And I would argue the inference workloads like we like you mention with obama, for example, are entirely new kinds of of of computer work because we haven't seen before. And so that's a much more even playing field. I don't think there was a way for a video to invest in that kind of work out ten years ago because IT just in exist where as training fundamentally hasn't some shape perform been around for the Better part of a decade because deep learning has been around for that long.

And actually, I think this is a good time to introduce what I think is the is a useful mental model that I have about the future hardware. And I found there several ways to reason about hardware. One, you can work backwards from the customer, who is the customer here and what do they need? What are the APP in points? And then there's another way you can reason about is like reasoning from history and see what the progression and evolution of computer has been over time.

And I think the history of hardware has been the history of computers, right? And in my mind, if you look at the last sixty years or so of computers, that this basically modern computing, one way, one popular way to reason about IT is often in the hardware versus software fit. But I think there's another way to reason about IT, which i'll give full credit to, one of our founders in unk, karr, who who has spent a lot of time at, uh, discard building the first thing I bought there and and was reasoning about how to expose language models to large numbers of users.

Before a lot of people got the chance to experiment with these large language models, he basically believes there are two lines, es of of computing. There's reasoning or intelligence, and then there's in interfaces. And you can kind of go back in time over the last six years and basically break down every major computing revolution.

We pad to some fundamental progress in either the reasoning part of computing or the interface part of computing. And if you traverse the tree, if you go all the way back to the first neural networks in one thousand nine hundred and fifty eight, that's when we start to see reasoning start to happen with two neural networks. And then that would give way to some probabilistic graph models in the eighties that then LED to this idea, like G, P, U, accelerated deeper learning in the two thousands, which then gave way to transformers in the late to twenty tens.

And then now where the next phase of, like massive transformers, and you can say, okay, that's one lineage, which is the linkage of reasoning. And intel gent, in parallel, we've had this other lines and computing, which is interfaces, right? And you started with the command line and keyboards, and that kind of give way to the guy with a mouse when Steve jobs got inspired by ZARA spar candies.

And then that has LED ultimately to mobile inner faces with dutch as an input mechanism. And then I think the question is where where we going next. And I think we have really good reasons to believe that the next interface will be an a eye companion, that some combination of text, voice and vision that can understand the world. That's almost a Better predict of where hardware going because the history of computing is so far shown that whichever one of those lineages is undergoing A A moment of resonance with customers ends up dominating for the kinds of workloads that then get to scale for the next ten, fifteen years. And I think we're in the middle of both a reasoning and an interview shift, and that's what's exciting right now, right?

IT seems that what if you look at that, how you're explaining this is, I would say like we have the smart phone interface pretty well established at this point, which which runs A I influence. Like P, R computers have his influence, chips and them. But like those are like standard well known innovation at this point.

And what's news seems to be on the reasoning side as the alarms and foundation models and and the ability to interact with with a model that way. So thanks to reason i'm here and you're correctly right. Like this is where we are in the reasoning side of things. So the hard wear in her face now is to take that step forward.

That's exactly right. I I think you basically we've had A A reasoning breakthrough, like you're saying with large language models and generate models, and they're really hungry for new kinds of input and context that the current generation of interfaces is not providing. And I think that that's where we're seeing folks thinker with inner faces or an experiment with new interfaces for the first time in ways that just one possible before because the reason capability wasn't there.

So what's the heart like us? So just take into faces in the sense I like like we walk care on smart phone to seem pretty powerful. We've had voice recognition for a while. We've had of devices like these amazon echo devices in our homes or whatever that we could talk to and know they run some sort of model off in the cloud d somewhere. Where's that, that step function or that improvement from from what we have today that seems pretty capable and in some ways to what are explaining, which is like a whole new with and maybe data capture as a primary feature is kind of the dripping off point.

The north star of computing has always been to borough, like a Steve jobs quote, to be the bicycle for the mind right is to ultimately express or translate human thought into some set of act in the world that then allow humans to accomplish what they want to do in ways that are just impossible without the leverage of a tool, right? And so computers are in in their most grand, dramatic reality or expression.

Our tools for thought and action in a way that allows to accomplish things we never could without those tools. And so if you ask, okay, well, why aren't computer able to help us accomplish that north start today? There's a whole host of those reasons.

But I think you started at the top of that list and you work await on the first one is they're pretty done right at inferring our intent about the world. humans. We have adapted to interfaces over the last sixty years, but the way we interact with computers today is actually dramatically unnatural, mostly because we are compensating for the lack of the computer ability to understand our intent and translate that into action proactively.

There's a paradigm in computing which is declarations versus imperative, right? The idea being with imperative. You are extremely prescriptive about what you want the computer to do.

You say, open up this file and then, you know, here's a set of instructions I want you to do, almost like you instructing a dog ler, right? And declarative is when you you just declare essentially a goal that you have in mind the way would in tract with an adult and say, hey, please go book a flight for me or whatever that, where do your goal is? And then IT goes reasons about how to do that, right? And humans are phenomena doing that.

And computers are nowhere close at translating thought into action. And so I think the big limitation is that computers just aren't smart enough to translate thought into action. And so if you ask, what's the big blocker there? If general models are in fact capable of reasoning, then why have we seen the computers cross that reasoning step, that the intelligence step? And and I think there's two big problems.

That one, there is a fundamental context problem. The wage generated models work is they're only as good as you can prompt them. They only as good as the questions you ask them.

That's why I took ChatGPT, which was which is essentially just uh, a slightly different context form factor to ask questions of GPT three which around and A G P five, which had been round as a raw next token prediction and point for like eight nine months before chat P T. Came up. But nobody really did anything interesting with IT. And then when they package them up as a chat interface, the rest is history. IT turns out just allowing humans to prompt the model and talk to you and give IT context in a way that useful makes a massive difference in the value you can derive these models.

While we've seen how valuable that is for pure text generation, when IT comes to taking action in the world and doing something, predicting the next action that you wanted take and taking that action for you, there's no way close to that interface being able to see what you're seeing, listen to what you're listening to, hear what others are in the room are hearing, look at where your eyes are tracking, and infer all of that context about what the the human wants to achieve, and then proactively do that for you. Because right now, we're kind of basically forcing these models to try to understand us to A A little straw, right? We're all they can get is like a tiny representation of reality to what we can type in via text chat.

But we're losing all the other context about reality. And so there's an interface problem where there's no interface today that seamlessly just captures the entire world that you're interacting with and then translate that into a prompt for these general models. Now the solution space is faster to think about because you could argue, well, on the smart phone is probably the most dense ely packed sensor world sensor ragg that we've ver had, right?

We've got just the one I have here has three rear facing cameras, one one ford facing a debt sensor RGB sensing on there IT has A X, L excel matter motion. IT has GPS IT knows where am I am. It's got microphones in there. You can see what i'm seeing.

And so you might go, well, what are you talking about? Millions of people. They now have a full sensory in their pocket. The issue is that when those sensors are all sitting in your pocket in a way that actually can't seamlessly capture the world and insert itself proactively into your day, it's useless as an interface.

You need an interface that can provide the inference and point sufficient continuous context about the world that the user is facing so that then the reasoning layer can start making usefull predictions about what you want to do. And I just think we haven't unlocked that, that form factor yet. And but voice comes close vision.

I think we will be a huge part of IT as we're seen with multi model models. And I think that they are on the margins, huge gains to be had with precise input of that, that mira thought like eyes tracking. I don't if you've tried the the apple vision pro out, but the entire premise of the vision pro interaction system is that you replace a mouse with your eyes and the either extraordinary good at looking somewhere that your brain wants to go, it's actually the fastest response, the human body that that follows thought. And so if a computer interface can infer what you're about to do, because IT has access to your intent via your eyes, now if I think suddenly you start to shave several seconds of latency betwen a computer understanding what you want to do and actually doing. And and the history of human computer interfaces is right for the examples of how like shaving off marginal amounts of latency that may seem like trivial, make dramatic changes in in user adoption.

What is that what does that look like? Because like we have we have company's experimenting right now with with like pendant and pins and and that sort of thing. But also, you look at you sometimes with these wearable headsets, maybe you have to be corded in, maybe you have to eat, you have this giant thing on your head because mean, you have to pack the sensors and pack a powerful enough chip into the device itself. So where do we have actually you have to get to from harper perspective.

that's a great jumping off point, right? Because I think you can break down the hardware into three major buckets. One is what's the input that you need? What's the what's the full context window that IT has over your life? So they are just a bunch of hardware and require to accurately sense the world and that as a sensing problem. The second is the actual hardware required to then process all of that input and make sense of IT.

And then the last step is the output, right? How do you relay the results of that reasoning, step that medal, step back to the human and do IT in a sufficiently integrated, seamless loop that you can have a conversation with your computer in the same way that when you see a program or in flow state where they're like literally programing in their I D E, but they make up, they make a uh, syntax or and the I D intelligently says, hey, here's your syntax or the programmer then integrates that into their workflow. I'm a terrible programmer, and so I ve had the privilege of working with some remarkable programmer in my life.

And that you you can just tell when they're inflow, that they're almost bionic when they and their computer melt into one. And that's primarily result a bunch of really great decisions that we've made along the way about how the computer talks back to you about the inference or the reasoning is done. Big picture.

You can think of the hardware need, as in three step, three buckets, input reasoning and and output human anatomy analogy. Here would be eyes, years and fingers to to, to, to was going on in the world that the step on the sensing, then you need a brain to make sense of all that input. And then ultimately, like we have our abandon es, you need something to manipulate environment around you and actually take action.

Traditionally, most innovation happened in in this area in robotics. IT was a well contained kind of discipline of computer science to reason about how to bring those three things together in a way that was research friendly. Robotics research is traditionally happened in industrial labs where they don't have to interact with humans very much other than constrained environments like warehouses and so on.

Today, what we're seeing is a dramatic shift where the general model breakthrough have resulted in tons of research outside of robotics in consumer land. The form factors are everything from pendants and little pins on people's colors to pairs of glasses that just blend seamlessly into everyday life and don't look any different from glasses that you are I would wear if we were just had prescriptions on. We are seeing companies that are thinking about more invasive, but embedded devices literally might be a brain computer and like a chip right in in institute.

If you are asked me, hey, where we seeing the most things go from, like science to engineering, where IT might actually be in people's hands, everyday people's hands. Soon, certain me, the most innovations happening in the middle step, the brain step, which is where inferences is today. There are companies who are making chips that are custom designed for the specific models where they're literally burning the weights.

This was an idea that I first heard from. David holds the founder my journey early on. His intuition was that diffusion models were so effective at image generation that eventually you'd have only a handful of models that we're handling the bulk of influence workloads for image generation.

And at that point, when you have sufficient scale and volume of a different a certain type of reasoning, the brain is just having to one type of task called a long, you can actually make a special trip that just does that kind of task, right? And you um so you burn the model weights s into the chip that dramatically reduces the the flexibility of things that brain can reason about. You essentially turn from a big brain into a small brain, but you get you can get like extraordinary orders of magnitude of speed up hundred x, one hundred fifty x, two hundred x.

And that's generally been the history of computing as well. When you have sufficient maturity of a certain title workload, usually the chip sector makes a uh a chip just for your calculator and just for your fridge and so on. And so I think we're about to start seeing that with generative modeling as well on the input and output side that been such a difficult place for startups to make a dent because the cost required to bring a new form factor, the market is just insane.

And so the closest that I think we've seen companies come, our teams like oculus, right, the who brought one of the first really mass market virtual reality forever factors to market and their hack was the pig back off the PC and smartphone supply chain that had grown up in the preceding decade, that had resulted in the cost of a bunch of individual components dramatically falling. That resulted in the huge manufacturing ecosystem in china that they were then able to basically go to shopping off the shelf, shopping dcb a bunch of parts together. And I literally think the first decay one screen was a uh h samsung or N L G smart phone screen that if you ask the founder, I think the founder of have the story where paraphrase arly, but the display manufacturers in and believe that you can refresh the pixels at a fast RAID for virtual reality.

And they actually hacked into the driver and said, no, look, we can do IT. So I think what what we need to be paying attention to on the the hardware side is, are there really low cost form factors that startups can innovate around because there's an existing supply chain that that an incumbent like an apple or or google has essentially subsidized that scale over the last decade. And that's what's so exciting about things like the apple vision pro is when apple gets into the game, results in a second and third effects of like new supply change, showing up for sensors like depth sensors and light and bath through mixed reality displays that then give startups the license to go experiment with new form factors.

Whichever company figures out how to capture a vision in a way that sufficiently high through for A A user to be able to prompt, model and say this is what i'm looking at, is able to hear what the user is listening to because the audio is a very large amount of the context that drives making in our daily lives that that is make a breaks. So I think somebody who figures out the combination of audio and video, boat sensing and output in a way that's very easy to summer throughout your daily life so that you're not having to say, hey siri, hey alex, like that the wake word to your world thing has not worked right that paratime that is not working at scale. It's it's ruthlessly disruptive to our like, to our to natural conversation. So I think the universe will look natural IT will look like its voice and audio heavy with vision augmenting. I think the reasoning later.

yes, IT does seem like that is a disrupt IT, but you feel like it's unnatural in terms of just how engaging with with the world around do the same way that I think when google glass came out. I think that was cool, but I was that was quite a wild go and agree, think about IT. But but yeah, just IT seemed unnatural. I think that seemed natural to have some walking out the camera on the face. I mean, maybe and maybe maybe it's gonna a lot more precedent I did at the time.

but i'm generally pretty, Linda, about interface design, which is that if there's no good form factor that proxies device your building that goes back hundreds of years IT is remarkably its hard to change human behavior so that the new paradigm works. Arguably, you could say smart phones were not a new interface for humans. We had been used to putting things in our pockets for years where I was wallets or or notebooks.

And we had been used to like tapping and and interacting with notebooks, whether that with a pencil and so on. I think the innovation there, of course, was the that Steve jobs figured out that stylist is are pretty natural, even though they look like bans and pencils that humans have used for forever. But turns out what came even before a styles was the human finger.

And so I think they google class was IT IT completely violated all the laws of social Linda, right? There was nothing natural about a display floating in front of your eye with a camera. This is why I actually think it's extremely unlikely that if somebody y's building, if there's a company out there experimenting with wearable able, that is sensing the world that if IT, if IT records humans, that IT will give mystery adoption because that's a fundamentally, that's not a Linda natural behavior.

So that's a confirm. That is a trust and privacy issue where you have to find a way to capture visual context about the world. And you have a bear glasses that had award tracing a visual sensor that that could tell the device what you're seeing but not record.

Then I think that would be a breakthrough. And I think a big mistake that some of the big hardware manufacturer today have made as we've rushed to bring recording classes to market and people on trust those people haven't welcome there. Was that scale into their lives yet.

What do you think he takes to actually break through that privacy or that you, I mean, security? No, maybe with less or so, but like that privacy angle here, and get people to embrace some of these technologies.

Look, I think apple is a great case study here. The is constant dialogue between a technical solution and a consumer or and user promise, where apple has basically said, look, we're going to basically not give ourselves the capability to look at certain kinds of data on your device. So they have a apple has a secure enclave on device on the iphone.

And a ton of the smart phone camera processing actually just happens on dev. Actually, almost a hundred percent of a apple's native computer vision processing on your photos happens on device that actually never leaves the device. Now we've offered sort of cloud services on top, right that you can back up your stuff for storage, like to I cloud.

But when actually comes to raw inference like intelligence about your data, that's all happening locally. And so when you brought up alama early in the example, I think why I we're seeing so many developers flocked to obama is because there is a lot of demand for from consumers to interact with language models in private ways. And that means the they're gonna to figure how to get the models to run locally without ever leaving, without ever the users context and data leaving the user device.

And that's going to result, I think, in a reiss's of new kinds of chips that are capable of handling massive workloads. S of inference on device, we are yet to see those unlocked. But the good news, open source models are phenomenal, unlocking efficiency.

The open source language mode lego system is just so ravenous, like when a new most when mistrust new model came out. There are mixture of experts. Open source model, few months ago, I did literally took like less twenty four hours for somebody to quantize IT and and then add G G M L support so that you could run locally within the week.

Even though originally like just out of the box, that model was actually really hard to run on anything less than two forty nineties said I had to get IT up and running on my, i'd like two gaming cards at home and then like heat machine by my desk. But you know, by today you can run a mix tro model purely because of software improvements on much a smaller on on single chips. You'll get a few gains from the open source ecosystem that then allow use cases to be sufficiently figured out that then the hardware guys go, oh, let's make special chips just for these workload.

And that's what happening. I think there people bringing diffusion model chips to market and transformer model chips to market that are just good at that workload so that when the startup says you can trust us, the user knows that they don't have to. Its engineered by design, right? So think I think doing no evil by design results in can do evil. And I think that's the strongest commitment you can make to your customers.

I guess he doesn't hurt when her, in fact, helps would be the world the more formative statement where to state that IT probably helps when the company comes and doesn't have, let's say, a legacy business to support that dependent on showing you ads or that's dependent on otherwise using your personal data in some ways that fair mean?

yes. So that okay, that's a good question, which is can you trust an eye companion that is being paid by someone other than you? And I do believe advertising is gonna be remark basically on in its current ship and form almost dead on arrival for most intelligent interfaces that people trust with with daily actions. Know it's one one thing to trust a model with next token prediction, where you've asking for the next word, you should use an N.

S, A, when we make the leap to next action prediction where you're give the agency to act on your behalf and represents you in the world, then the fundamental misalignment between you and the agent who's doing things for you is that if that agent not be getting paid by you, then you may not be all the trust. Now I think there maybe flavors of advertising that work that don't look like the google add format today. We are like you stuff for bid links ahead of the seven or eight good ones.

And I think we're actually seeing google struggle with out a little bit. But you're right. I think the most online business model would just be the same way you hire a human to help you out with a task as an employee, you are hiring an agent, you're hiring a computer and you're paying for IT and you're really the employer in that case.

And that that's the most a line business model. I think when IT comes to the this next wave of generation computers is do not think about them as tools, what they can, while the ultimate impact that they can make on your life is that of a tool. The business model relationship should be much more in to an employee employee than I think just a third party, two two, that somebody else subsidizing free use. The classic, if you're not being, you're the product I think exists here. And that and the the failure modes, more cats, rop hic, when you're trusting to do things on your behalf.

I find interest right now when you look, say, subscriptions to chat pt, subscriptions to any of gender model services perplexity have been new named. I like like people are willing to pay for for search. Now that's Better people. I willing do a pace of the subscription for for the thing. And maybe that maybe that doing indicate yes, said that this is the thing people would pay for going forward. Like we have cross that bridge where we would realized, to your point, like free, is that free and something something truly has value because you can come in and say this is what we do in the day we make our money and to take or leave IT. But but no, no, the service are getting for the Price for what you're paying.

Yeah look, I think the long ark of tech has shown that the marginal cost of computer over time just generally convergence to zero, right? And we're in this really weird phase right now because general models are so new, we haven't seen the dramatic reduction and cost happened that more law should be driving for computer. And so as a result, the best services, general model services in the world are immigrants.

Ices that because it's expensive to run inference, right? like. But what is crazy is like, to your point, is that people are willing to pay for that. That's how much economic values being unlocked by general models at this point.

Now i've been personally involved either as uh, advisor or investor or as an Operator with at least five companies generative model companies that have exploded past thirty to fifty million in revenue in in subscription revenue in their first twelve months of monodist ing. And this is all happened in the last two years, right? It's insane. And I I think that's because when the model actually accomplishes a task for you that you didn't think you could do on your own, whether that's generating an image with my journey or is getting an answer from perplexity that would have taken you hours to do by yourself, or generating A A podcast of your voice using eleven labs from your text, these are things you'd actually have to go hire people to do earlier.

And IT turns out when IT bundle up as compute and offer and you can call on a twenty four, seven hours a day, charging twenty box a month for IT is not even A A hard ask of customers because the comparable is is legitimately think hiring a human to do that test for you, that cost varies anywhere from minimum wage per hour to some humans who look ally can't find to accomplish a test you want on the time. And you and so I don't think it's I don't think I want to be clear, don't think these models are replacing humans. I think there filling gaps in economic uh, demand that weren't being filled, these issues that were not being filled before and they creating new categories and a twenty dollar on subscription for that today is not a hard, hard ask, but a over time, I see those those Prices will decline because the cost of martial cost of computer will converge, closing closer to zero to tide us back.

So we started off talking about GPU and kind of the training side. If the U I. Of AI becomes like mostly model daily capture on device. What does the training process and system look like? And then the model, the building process look at, we're just pumping and now I know quantify the valley was a date of that, that we're talking with generating here.

yeah. So look, I think there was the a moment in time about twenty four and thirty six months ago where everyone was like, oh, of course, bigger is Better. The big ger the model, the Better is going to be.

And size is everything and we're gonna GPT ten be this insanely large two hundred tWilly promoter model that will be this big god. And the products should just be increasingly rappers on top of that model. And that reality has not come to bus.

What is instead happening is that the most useful products are combinations of different models. Matter is sahai a, who has a great paper on this recently from his lab, yt berkeley, that that calls the compound systems. They did a pretty systematic study of all the most used products today that use generated models.

And IT turns out there are not single monolithic models. The combinations at these compound systems of different models acting in unison. So i'm a big deliver in the idea that the future products are going to be swarms of small models working together to solve the task cheaper, faster, more efficiently than just one big in a mega brain that can do IT.

And then when those teams are of models encounter test that they can't solve themselves, then they will call out to a larger model that might be in the cloud and then ask that model to do what might be a multiple ongoing problem. When sometimes when you need invent the theory of relativity, you do need to go ask and einstein for help. But most things about you throughout your day today, I don't need you to help me with.

Instead, what I do want is a really great, efficient team of people who are specialist, said something, acting and close unison to to each other the way companies work together. And so I see a future where everybody, he's got a personal team. I is working for us just the same way companies are often in service of a customer.

And I think what's gonna happen as these inference workloads are going to become combinations of quickly attacking a task with that me and then offloading the test that that they can solve themselves to bigger and bigger cloud hosted inference workloads at the same time. What that means for training is that actually bigger and bigger training runs may not be that important. Do everyday you are everyday lives? What might actually be really important is training and fine tuning models, base models on, on, on individual data.

One of the biggest unlocks that bite dance which made tiktok had was in this intense per personalization of the aleg gorie m, so that within three swipes of somebody opening up tiktok, they knew what Derek really wanted to see next. And the concept behind the scene is basically a personal embedding for everybody, right? Like every individual consumer has a personal embedding that understands their preference is so deeply that you're able to serve them with what they want next, whether that's searching for a restaurant that you want to go to or just like doing taking action for you and calling you, uh uh, a taxi, if that's what you need.

And I think that in that future, a lot of the training, like what's currently happening in the pretrail ing phase of model development, will start to make its way into post training, what's currently called fine tuning or customization, right where you got you once you have a good enough base model for most tasks, then you can start to find tune on what the individual wants on their individual user said. And so you don't need a massive model to keep reasoning about every individual user. You actually are just a good enough be a model that then learn specifically about you. And that happens in in the post training step IT seems .

like we're reliably bad at predicting the future when to come to the certain things. I think I think most people jump to like I if I didn't seem to predict the internet or the smart phone, which might to have the biggest advance, we actually so I can you ask you like to know that open yourself up to that same mistake. Where do you think we might be missing some areas for improvement when we're thinking about how ai developed like that? Do we have light spot space on what we're currently doing that, that might limit how we develop these .

things going forward? Humans are so good at reasoning by analogy, right? And and likely so because millions of years of evolution have showed us that battle matching is a really good skill in in life.

If your ancestors have seen a lion before and have associated that with danger, next time you see a thing that kind of looks like a lie, you should probably reason about IT like the way your brain, your ancestors have learned over millions of years. And I think that well, that that serves us well in most daily life. It's actually serves us really poorly in computing, because I think we keep looking to biological metaphors to guide computer design.

And for a long time, A I was in this like weird research path, where most of the I research community just believe that the bath unlocking sort of general intelligence would be, you had to first figure out how brains worked, human brains worked, and then you can replicate that in silicon, right? And so there was just like the decades of research, and at many darpa and D O D and university founded loves, that was this set of neuroscience, first approach to inventing computers that has proven to be mostly a distraction like IT. Turns out that just predicting the next token or the next word a model should say, is a remarkably useful way to attack intelligence and design computers instead of getting computers to learn like human beings.

I will say what is now happening is that because transformers are so remarkably effective at what they do, most major industrial labs have double down on that architecture. It's not clear whether that will result in multi step reasoning of a kind that is, is, is essentially unconstrained, right? It's not clear that the current architectures s will get us all the way to the end goal, which is all everyone has a definite what the end goal.

But let's say the end goal is a kind of computer that's able to do almost everything we want as humans and take all the gregory out of our lives and allow us to be the the ultimate bicycle for the mind. Forget bicycle, let's say we want, we want computer to be the interStellar travel of the for the mind. It's not clear that the current architectures s we have of these models will get there.

But because they works so well, the bulk of research dollars are going to fund optimizations in the current architecture. That's a blind spot because what we may actually need is a fundamental new architecture to unlock the style or travel for the mind. While there are some promising startups trying to do that, IT is a pretty capital intensive game.

It's not for the faint of the heart. And so what we do risk, I think, as industry is hitting a point where scaling laws don't hold, our current architectures actually do plato. And then we are back at the kind of slow down that we had with A I for the best three winters over the last three, three decades, where many of the the loss curves or the ability for models to predict reality basically hit walls.

They started off being super promising and and they hit a wall. So far, signs are that's not happening. But if there's one blind spot in the future of computing is that arga n architectures are insufficient and that we haven't invested enough in alternative of backup PS to overtake those.

Now i'm optimistic, but this time around, the sufficient excitement and the ecosystem vote from everybody from the harvard providers at the compute level like video, the cloud providers and startup s and ultimately investors like us who really excited for new york ks. And so you know, there are people out there who are thinky and experimenting with those that unlock new interfaces, unlock the next phase computing. Yeah, that's what we're hear to fund. I just wish more people working on those.

Thanks for listening, everyone. I thought that was a super insight for discussion, and I hope you did to we're just getting warmed up. We have more aisou to come shortly. But in the meantime, feel free to rate the show and let us know what you think so far.

Remaking the UI for AI 40:05 Share

a16z Podcast

Deep Dive

Shownotes Transcript

Remaking the UI for AI