We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 50: Fireworks CEO Lin Qiao on Why There Won’t be a Single Model, Will Hyperscalers Win Inference & AI Use-cases with PMF

2024/12/16

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

Lin Qiao

Topics

Lin Qiao: 我认为Fireworks从一开始就考虑到了这一点。Fireworks是一个专注于AI推理的生成式AI平台，其首要目标是提供最佳质量、最低延迟和最低成本的推理服务。然而，推理并非简单的单一模型服务，它远比这复杂得多。未来AI推理系统将是一个复杂的系统，它将整合数百个小型专家模型，并具备逻辑推理能力，能够访问各种API和数据库。单一模型由于其概率性本质和有限的知识，难以提供始终如一的准确结果并解决复杂的现实问题。控制模型的幻觉至关重要。此外，许多客户使用我们的平台来解决复杂的业务问题，这需要整合多个模型和多种模态。例如，我们现在进行的对话就涉及音频和视觉信息的处理，以提供良好的交互式体验。许多面向消费者的应用程序也需要处理多种模态的信息，甚至在同一模态内也需要使用多个专家模型，例如大型语言模型中用于分类、摘要、多轮对话和函数调用的不同专家模型。单一模型的知识有限，其知识仅限于其训练数据，而训练数据是有限的，并非无限的。现实世界中大量信息存在于API（公共API或企业内部专有的私有API）的背后，如果没有直接与企业合作，就无法访问这些信息。因此，我们认为下一个挑战是如何超越单一模型服务。我们需要的是复合AI系统，它整合多个模型、多种模态以及各种API和数据库，以提供最佳的AI结果。

Deep Dive

Shownotes Transcript

Translations:

中文

I think folks are really going to enjoy this conversation. Without further ado, here's Lynn.

Well, Lynn, thanks so much for coming on the podcast. Really appreciate it. Thank you for having me. Yeah, I've been looking forward to this one for a while. And I feel like it's probably a really interesting time to get to chat with you. Obviously, you guys have built just an incredible platform for developers and folks that are building in the AI space. And there's always-- it's never a dull moment in the AI world, and I feel like there's always things changing. And one that I'd be really curious to get your take on is obviously, I think everyone seems to be talking about test time compute right now.

and 01 and many models that seem that they will shortly follow. And I'm wondering, how does that change your kind of strategy or product that you offer at Fireworks, these types of models?

Yeah, I think from the inception of Fireworks, we always had that in mind. What is Fireworks? Fireworks is a Gen-A platform with laser focused inference. And here the top level goal is to deliver the best quality, lowest latency, and lowest cost. So in the inference stack, and I would love to dive into details there, because inference is not a single model as a service inference. It's not that simple.

I just kind of alluded to test time inference, test time quality scaling law. The inference system we envision in the future is a complex

inference system with logical reasoning, with access to a lot of hundreds of small expert models and so on. So the level of problems we're trying to solve is not, hey, this is just API call, and now it's done. It's never this simple. And so it sounds like you envision a world where the user puts in a query and at Fireworks, you're doing the routing and figuring out what is the best performant model for whatever that query is as a response. Right. So I think

Let's take a big step back of what the problem we're trying to solve here. So there are a lot of limitation of models. And its limitation comes from the models are not deterministic, it's probabilistic by nature. It's not desirable when you want to deliver factual results, always truthful result to your end user.

So, controlling that, hallucination is extremely important. And the second is a complex business problem we have many customers using us, it has to assemble across multiple models, across multiple modalities to solve those problems. As we're talking right now, right? So, as kind of a lot of applications are to consumer to personal developers, we're not communicating, texting each other.

I'm processing audio information and visual information to kind of deliver a good like interactive experience a conversation we're having right now similar to a lot of JNI based native

applications consumer personal facing they also need to process multiple cross multiple modalities and even within same modality let's take LMS there are many different expert LM models specializing doing classification summarization multi-turn chats to calling and they're all

slightly different from each other. So that's kind of single model is very limited if you want to solve a real world problem. And then last but not least, single model is very limited in knowledge. It's knowledge is limited by its training data. Training data is finite, not infinite.

So a lot of real information, real world information lives behind APIs. They are public APIs or even proprietary private APIs within enterprise that there's no way you can access to it without working directly with enterprise. So across the board, we envision that the next barrier is how to go beyond single model as a service. So the world needs

a notion called compound AI system, as in multiple models across different modalities along with various different APIs holding knowledges including databases, storage systems, knowledge base need to work together to deliver the best AI results. Yeah. But as you think about building these more complex compound AI systems, what are the tools you think developers will need to be able to effectively build these? Maybe we can first start talking with

like a polar opposite design points, right? So one design is called imperative, where you have full control of what is the workflow and what are the inputs and outputs, everything you want to make it deterministic, right? So that's kind of one level of design. The other design, so you basically designed the house. - Yeah. - Right, dictate that. The other design is called declarative,

you define the what, the what problem this system should solve for you, but then you let the system figure out the hows. So in general, in the industry, it's not unique to AI. In general, we have these two school of thoughts in approaching the system design. For example, in the database world, it's kind of full of those kind of examples.

SQL is an example of the color layer. Yeah. Where your data engineers or analysts, they define exactly what we want to retrieve out of the database, and database management system figure out the execution plan, the best, most efficient execution plan.

And then the ETL process is declarative and you can say define, oh, here's the source and what are the processing logic, how you trigger the next step and how do you do backfill, very imperative. And those are nothing wrong with either of those approaches. Those are different, just different approaches. And the file works based on our

PyTorch experience, we want to deliver ultimate our design principles. We want to deliver the simplest the user experience and hide all the kind of nitty-gritty details on complexity in the back end as much as possible without hurting the speed of iteration. And it's a very fine balance to strike there. So we are leaning more towards a more declarative system.

with full debugability and maintainability. What are some examples maybe of where you're on the line and you're like, "We could do it this way that would be more declarative or a different way." What are some of those trade-offs you have to or you think about making? Yeah. So for example, when we started, we started with the lowest level abstraction which is single model as a service. Yeah.

where today we provide hundreds of models across large-angle model, across audio models, transcription, translation, speech synthesis, and alignment, across vision models, where the source can be PDF images, screenshots, and so on, and embedding models, image generation models, we're adding video models. So those are kind of the lowest level of building blocks.

and developers can just assemble what they want on top of that, but there's a lot to assemble. A lot of pieces to assemble and the quality control is difficult because every week there are multiple models get released. And they constantly asking us,

we want to try this model, what's your suggestion? Should we even try? Then what is our production stability, right? And how to do version control, all this becomes a problem for them. But we start there, so we know how we can understand how the industry is evolving. But then we quickly realize there's a huge gap of usability.

Especially when it comes to enterprise and there's a huge gap there and we want to fit in the gap. When you're talking to folks, why aren't there more Gen AI applications out there? So I think there are a lot of barriers actually. One is there's no one model fits all. That's our observation. It's kind of the nature of the training process.

The training process is a very opinionated process that you have to pick which subset of problem across like thousands of problems in the world you care the most about and you devote the biggest amount of resource and money to acquire data and make sure your data quality and a variety of data in that space is the best. And which subset of problem you don't care and we're in the middle, right?

So because of that, the end result of the model will be really, really, really good at doing things and then really bad at doing certain things. So that's just the nature of model training. We believe future are the hundreds of small expert models. Because when you shrink the problem down to a narrow space, it's inevitably much easier for small model to thrive in pushing the quality.

And as a matter of fact, in the open source world, this is really great for the open source community because with open source-based models, it gives a lot of control for you to customize your model. There are a lot of model providers, they focus on post-training or fine-tuning and deliver specialized model

giving back to open source community that's really, really good at solving certain problems very well. So we believe the future are hundreds of small models. I think that's kind of where we see the enterprise world is moving towards. They want to have more control.

They want to have steerability. In this world of hundreds of small expert models, I can imagine one world in which there's so many people taking open source models and tweaking them in some ways. And so Fireworks just has a big library of them. And I'm an enterprise, and I have a problem. And you're like, hey, here's a dozen really great models that we'll orchestrate together. I could also imagine there's probably some cases where you tell an enterprise, hey, you should really fine tune for this use case. Or you might actually, maybe in some cases, even pre-train a model to some extent.

How do you think about where that plays long term? How do you think in five years, will enterprises be fine tuning a lot of models or pre-training or how does that evolve? We deeply believe in customization, again, as I mentioned. But the process is kind of, it's not, I would say it's not straightforward. It's not like, oh, one minute done, right? So it's longer than that. So we see a fine balance between

like fine tuning and problem engineering. There's a strong trade off there and that's where we are working on, because we believe in customization, we're working on how to make customization extremely easy.

So the balance is falling, right? So it's much immediate. You can see the result immediately. It's much more responsive, interactive using problem engineering. That's the nature of it. And we see a lot of enterprise or developers start with problem engineering. So, hey, is this model steerable towards the direction I want it to do? And I can test it out quickly. But then we see thousands of lines of system problems. At some point, if you want to continue to steer it, you've got to stop.

Because you don't know where to change and it's going to all get lost. So solving a very complex system problem management is a problem itself. So actually that eventually is the key to how to make customization easy.

is to solve that problem. And we are going to launch a product to solve that problem. Interesting. Yeah. So, and then we see, then at that phase,

leave that aside, in that phase you have thousands of line system prompt then what do you do? And usually that's the time, it's the prime time for you to do use fine tuning, absorb that system prompt into the model itself. Because by that time you should already proven the model is durable to solve your problem. You are able to let the model follow your instruction. Your instruction is pretty much the data you provide to the model.

And then, you know, it would be good to do fine tuning. And that already means you're probably evolving from pre-product market fit into post-product market fit, product scaling. And once you absorb the long-length system prompt into the model, the model can run faster also and cheaper and also with high quality. So we see that movement. It's a very organic movement. But still, getting to the stage of managing, like even kind of before you absorb into fine tuning,

managing these long process and promise do a very challenging thing. - Do you think pre-training ever makes sense for enterprises? - Pre-training is kind of the whole entire industry is going to, hey, pre-training is being consolidated, right? So into hyperscaler, that pretty much the reality. We do feel like

We do see some enterprises do pre-trained models because they have this kind of core to their business or for other reasons.

But yeah, so, but the differentiation, the question we see a lot of enterprise asking themselves is, or even start-ups asking, what's the differentiation? Because pre-training is very expensive and the RI has to justify you through a lot of money and the human resources into doing this. And if you can just do post-training on top of very strong base model, then RI is much stronger and you are much more agile to test different ideas. You've said an interesting vantage point because you see so many people building

on top of fireworks, like what use cases for Gen-AI do you feel like have product market fit today? I would say there are, I would classify those use cases probably into different buckets. In summary, I think most of use case we've seen successful getting adoption is kind of human in the loop automation. Not human out of the loop automation yet. So my hypothesis is

A gen AI system has to be human, debuggable, understandable, maintainable, operable. If a human cannot evaluate it, cannot maintain it or put influence over it and operate that in production, it will be really, really hard to get adoption. So because of that nature, we see a lot of kind of vibrant

kind of product offering from all different kind of assistant to human.

in different categories. We have seen people building assistant to doctors to make their scribing much easier. Everything assistant to teachers and students and people want to learn foreign languages, educational. We have seen assistant to coding. Of course, that's a highly competitive space, but we have worked very closely, for example, with Cursor or Sourcegraph and many other great companies.

We have seen assistant to, like medical assistant, apparently there's shortage of more than a million nurses in North America. And there are a lot of patient need being like paying attention to. So that is a whole slew of like to consumer, prosumer, like developer space. So that is one. The other is kind of more B2B.

As in there are automation like call center automation. But more like call center automation is a great example. It has, you can build a system to human agent to make them more productive and answer the questions better or replace human. We have seen a lot of success of

building this assistant to help human agents, right? So that's another example. A lot of process of optimizing business logic, business workflow, efficiency, and so on. And then what about on the model side? What have you noticed in terms of the models that are actually getting used by companies? We see a lot of convergence of variation of LAMA models.

I think that's a testimony to the quality of the model. And it's a very strong-based model, very, very good in instruction following, and very good for tuning. And of course, meta-backing it is very important and see a lot of enterprise adoption there. Obviously, I'm sure something your enterprise customers think a lot about is just like,

like evals. And I'm sure you obviously have thought a lot about this. What's kind of the current state of evals? And how do you think about the right building blocks to give enterprises that might have very different use cases and as a result, very different ways of evaluating success?

Many enterprises do at the stage of evolve is a vibe based Which is not surprising right so yeah, so They are but they quickly realize I mean for the early stage product development is kind of mobile basic I just want to get a sense of how this product is

like feel like with different models and so on, right? And they quickly evolve to the stage that they are consciously building evolve and they know this is kind of investment area, but for them to stay on top of the state of art,

They have to be able to evaluate and also they have to be able to evaluate once they get into kind of more deeper problem engineering or fine-tuning Like how to just evaluate quality is important. They cannot always go to A/B testing A/B testing is the ultimate process to determine the product impact but it's longer cycle so

It is funny that we talked to some, I think, Sourcegraph's customer of yours said they don't even bother with evals. They just release it out, A/B test it, and they'll know pretty quickly from their developers which model's better. Yeah. But the evals are very important because

And from our customer's point of view, they realize investing in generating good eval data set and adding more and more to it can give them a very clear view.

what matters, what doesn't matter. Because in this process of catching up with the constant moving train of those small expert model, the model is not just every week there's a better model like bidding leaderboard. The model is also getting more and more specialized. And they will also, many of our customers, they start from their product is open-ended.

they don't have a strong opinion what kind of question they should expect from their user to their they start to get more more clear opinion and they're hard in their product design into open ended product to uh like specialize the product in different product features and then from those product features they want to have specialized models to solve this problem so it's a natural evolution evolution of product development and i guess like you know

Maybe just talking about some of the stuff you guys have built. Obviously, you're taking the great models that exist in the open source world, and then you also build your own models, right? You built F1. Maybe talk a little bit about why you decided to do that. So we released F1 under API, and that's a model API. Basically, how you use F1 is as if you are using a new model. But F1 is really a complex logical reasoning inference system.

So, which underneath, we haven't talked too much about what's underneath F1, but it's kind of here is a sneak peek of what's underneath. So underneath there are actually multiple models and also there is a logical reasoning steps that we implement in our system. So then actually building a system is very complex. It's not

it's not like just regular single model as a service inference. And also, we have to solve a lot of quality related problem because now you let the model talk with itself, let the model talk with each other, and what kind of information they communicate to drive quality and they are sensitive to that, right? And how to do quality control in this complex system

I think the complexity is actually higher than building a database management system in essence. And then of course, because there are multiple steps involved and then your overall inference latency and cost becomes more interesting. So that's our bread and butter because we're a team like really specializing optimized for that.

And then we view the, like when we start a company, we build a lower level of single model as a service. That's a building block towards this complex logic reasoning stack. - Interesting. And then obviously I think a big part of it is function calling as well, right? And like what are the barriers, I guess, to getting function calling even better today? It feels like we're still kind of in the nascent stages of that in many ways. - So function calling, like many people recognize that actually, we see a lot of, after we launched F1, we have

we have a waiting list for people to get in and we talk with many of them. Apparently, their use case, the majority of use case is actually building agents and they need function calling. So what is function calling? Function calling is basically an extension point for this model to call into other tools to enhance the quality of the answer. But function calling is not just about calling into one tool.

It's actually very complex if you think about it. Usually people use those function calling model in the multi-turn chat context, right? Yeah. The model, it should be, it need to be able to hold a long context of what has, what the conference, and then use that context to influence which is best tools call. And oftentimes they also need to call into multiple tools up to like,

could be up to hundreds of tools, tool selections. And also they need to, it's not just call one tool at a time. Oftentimes they also need to call multiple tools in parallel and in sequential. So it's a complex coordination plan in one shot. So our function calling is able to do parallel, sequential, complex planning and then orchestrate and execute on that plan.

Of course then the position is important like if I ask the model to like do a January chart of a stock price of top three cloud providers Underneath it will return the answer quickly and they show you a chart but underneath there's a search to find out top three There's three parallel function calls to get the stock price and then one call into the chart to get the chart back.

So this is a very simple example, but it demonstrates the complexity of orchestration and the model needs to have very strong capability of understanding if you plug in your tool, when to call this tool and how to drive that position is very important. So that makes the tuning process very complicated, but we have been investing this space for now

about a year. So yeah, so like we start to see, like when we start to release the first function holding model, I think we're ahead of the adoption curve. The use case hasn't kind of

And how do you think about like what, you know, why decide to build that versus I'm sure if you waited at some point there'd be, you know, a good open source model with some of those capabilities. Like how do you kind of figure out when to build things in-house on the model side? Yeah, so we haven't built on top of open source community, right? That's absolutely, it's the case. And we are betting on that because we believe the hundreds of small expert model will come from the open source community.

So that is directly aligned with our vision. But at the same time, we're also heavily investing in the compound air system where it's all about composing those hundreds of small expert model to solve a complex business task in a very easy way. And the critical part of this composability lies in this layer of

orchestration, so the model who is intelligent enough to be able to call into different tools. And here you can think about the general way of thinking about tool is each individual small expert model is also a tool.

in addition to this particular large language model. So then it becomes a critical ingredient of tying everything together. And we will invest in the strategically critical area that we cannot just wait to see what happens. Where do the reasoning models end up fitting in? I mean, obviously, at least the ones we've seen today are, you know, I think of them as bulky and generally good at reasoning of all things. Do we end up with like small expert, you know, like testers

test time compute intensive models as well? Or like, do you imagine you will only kind of route to those in very specific instances or like, as you think about these compound systems, I'm just curious, like where these O1-like models will fit into that?

I think even for reasoning, there are different paths to solve reasoning problems. And there will be different models specializing in different paths. One path is to have a very strong base model itself to do self-inspection. And a lot of technique already being discussed with chain of thought, tree of thoughts, and

backtracking and all these different technologies. So that's this one. And there will be a new set of models. They can do logical reasoning, not in the prompt space,

This is the thing I'm very excited about, but in latent space. Because you can think about when we think, we don't have to talk in words, our thinking process. We can, but we don't have to. I'm pretty sure when we process a problem in our head, we're probably thinking a different space. Similar to the model as well. There are a lot of active research happening to how to make

thinking process much more efficient and much more native to the process. So I'm very excited about that kind of research. There will be other flavor of logical reasoning. So we are not, we don't want to be very opinionated about which one is going to win. Instead, we're going to kind of integrate all these different flavor into our logical reasoning process.

I love that. I guess outside of what you're doing at Fireworks, what do you feel like are some of the other big unsolved problems in AI infrastructure today? I think we have seen a lot of movement into building agentic workflows. And again, I think we're still at the early stage of figuring out what is the right user experience.

what is the abstraction, the whole industry, right? What is the right abstraction? What we should hide behind the system, we should expose to developer. We're still kind of in very early stage and there it's, I think it's very experimental right now. But where the abstraction lies will determine the complexity of

the infrastructure hiding behind the abstraction. But we, I think we now start to form opinion around that. And in a month or two, we're going to kind of GA F1. And so, okay, our thinking process is building F1 is one is kind of, F1 is our own exercise to understand the system abstraction and the complexity.

of building a logical reasoning engine. And from there, when we GA, we want to expose developer-facing plugins and how developers can build their own F1s. So yeah, so I'm very excited about that direction. That's super cool, because obviously you learn the tools and the abstractions that are necessary by building it yourself, and then you basically allow developers to recreate that. I love that as

as an approach. I guess you know a lot of, it feels like on the hardware side, I mean obviously everyone's using NVIDIA, but there's a lot of competitors that have been popping up. And I'm curious like what you make of that space and the viability of some of those efforts. I think you guys support AMD chips for inference on the platform. Like, you know, when does that make sense for developers and how do you see that kind of evolving? In general, we see

a kind of scarcity in the developer space that they understand hardware, low-level hardware optimization. And we also see a very

big change in the kind of hardware development cadence before it's kind of every three years there's new hardware skill. Not every one year there's a new hardware skill for each hardware vendors and then the whole hardware space is moving very fast. And how to kind of access to the best hardware and by best I really mean there's again there's no one size fits all so it really depends on your workload pattern. Even for the accessing the same model there's no best hardware for one model even.

because your workload distribution is gonna determine where is your biggest bottleneck. And different hardware skew is gonna be the best for solving, kind of removing certain bottleneck the best. So we even at the fireworks layer, we absorb

the burden of integrating and determining which hardware is best for what kind of workload. Even for one workload, when you have mixed access pattern, we can route to different hardware. So yeah, so we want to really alleviate the concerns and burden of developer. They should focus on building product, and we will take the complexity of managing and optimizing for the hardware. Yeah.

It seems like you're obviously super focused on these compound systems. There's obviously always going to be some set of folks that just want to call a Lava model as it is and run it. In that world, it feels like for a while there was really a lot of competition on the inference side. A new open source model would come out and everyone would be offering their prices and it would get lower. In that game, do you think that eventually the hyperscalers are best suited to win just straight inference on an open source model? How does that develop?

That's a very complex question to answer. I think all hyperscalers, they want to be Apple. So they want to build an iPhone. They want to kind of build a vertically integrated stack because

Because they can. But again, the direction we're heading towards is we feel like the future lies in hundreds of small export models and we want to harvest that energy and build a compound air system that leverage the best models to solve a complex problem.

So in my view, I feel like the hyperscalers, what's the biggest benefit? The biggest benefit is that to solve the problem requires a lot of resources in terms of money and the human. Those examples are like building data centers, acquiring powers, and you store tons and tons of machines, deploy them, light them up, and turn up large-scale storage and compute it.

cloud provider please solve those problems, right? So this kind of massive problem at scale. And we specialize in solving problems that requires a lot of the combination of engineering partnership and deep research. And then we can deploy at scale. So I think inference is like that. I work at MITRE, it's not like MITRE has a thousand people inference team, right? Because once you have a

a system that can be horizontally scalable with high production quality, then the rest is just let it scale. That's the beauty of designing a highly scalable system. So in that sense, I think this is squarely in our strength and expertise. And also, again, the infrastructure system is not simple.

it is actually going to, we're evolving to get into this compound logical reasoning inference system.

where there's a lot of complexity in building that. And that complexity is not by throwing people and money to solve. So, yeah. Yeah. I guess with these smaller expert models, one thing I wonder is as these models get smaller, I feel like there's a big trend to running them locally too. And I'm curious how you see that playing out over the next few years and then if that becomes part of the Fireworks platform as well. Yeah.

I've seen a lot of argument for running models locally for two reasons. And also we should talk about what does local even mean, right? One reason is cost saving because yeah, you need to pay GPU on cloud and running on desktop you don't need to pay. And yeah, so the other is privacy, right? So I have different thoughts on that. I think

That makes sense to kind of offload compute from cloud to desktop. I think many application like Zoom, for example, that save a lot of cost for them, right? But offload compute from cloud to mobile is a different story. Because mobile phone has a very, like it's getting like beefier and beefier, but it has a very limited power, right? So, a matter for

application, a lot of application metrics are closely monitored, including cost of time, power consumption. All these affect application rating and user adoption and experiences. So because of that, the model that can practically deploy to mobile is tiny.

1B, 10B and very, very small. And those model has limited capability. So I think like between mobile and cloud is kind of the difference is extremely big. And offload to desktop, I think for many like consumer personal facing application makes sense. But also like there's another argument of, hey, you know, it's more private.

And privacy is interesting, right? Because we have a lot of personal data that's already on the cloud. Most of our personal data is not on local desk anymore. So then, kind of, hey, does that even make sense for privacy? Yeah, so I think that's a separate concern. Yeah, super interesting. I guess on the open source side, obviously you used to work at Meta. It feels like, obviously, they've been providing a huge service to the ecosystem by training these larger and larger models.

How far do you think they go on continuing to scale spend on open source models before, like, at some point-- I mean, I guess everyone's asking this for all the players, not just the open source players. But how do you think about, like, the extent to which it makes sense to keep pushing on pre-training spend? I think that's a good question too, Zach.

No, I'm joking. We work very closely with the Meta team. Of course, we came from Meta. And we actually co-designed, not just the open source Lama models, they're also building a standard called Lama Stack.

And LamaStack, the intention is to standardize the tools stack around Lama models. And we co-design with them because we have a lot of

information coming from our customer that can help them make design choices. So I think that's kind of ambition from Mana is to build this Android world where everything is nicely standardized and you plug in to different components and make it very easy to adopt. And from my vintage point of view,

minus-SAML4 is coming soon, right? So there will be continuous investment

from meta that train i haven't seen that stopped um of course like if it's a general question when would uh model providers stop or invest less on pre-training is when there's not much li right so why there's not much li is because we're hitting a wall of data where ronald data everyone call the same internet data there's just we have exhausted synthetic generated data

we have exhausted a combination of multimedia and text data, then kind of there's less return. Before we hit that wall, I think there will be continuous investment. When do you think we'll hit that wall? I think we're, it's definitely getting longer and longer, like from like open-ass release point of view, right? So...

And I definitely see kind of a lot of investments start to shift from pre-training to post-training, from post-training to inference. So I think we're already hitting a soft wall, not a hard wall.

So yeah, so I think in general as an industry, kind of where the ROI is kind of transitioning from pre-training to post-training to inference. - As you think about like all the sets of tools that are required to make like an agent successful. I mean, there's one world where as Fireworks, you can be a one-stop shop and like build all of them. There's others where you could partner and you go, look, we're not gonna build, you know, this part of the thing. Like, how do you think about that? - Yeah, so we always are very compatible with

with the imperative agentic tools, right? I think most notably is Lanchain. And we have been a very strong partner with them because we are not going to like

LanChain is doing great work in what they are building. Huge community following and adoption, really awesome. We just want to simplify the next level above single model as a service by composing wherever it makes sense, multiple models to solve the problem much better. We're not changing our position, we're plugging into LanChain.

For example, as one example, there's also, of course, many other examples in the community. Yeah, makes a ton of sense. As you think about the competitive landscape today, I imagine when people think of fireworks, they also probably lump you in with Together AI. I know Databricks talks a lot about these compound AI systems too. How do you think about what differentiates fireworks today and then five years from now, what will ultimately determine who captures this really interesting space?

Yeah, we are probably thinking... So compound AI actually is the term, we didn't coin that term, right? So it's kind of coined by Berkeley. And we know that Databricks is also like looking into that space. So I think it's great. It's a new category that we're defining, right? So definitely multiple players will kind of be in the same space. And we are very happy that Databricks is also thinking about that problem.

So I think that space is very meaningful because it's a very complex space for application developers, product developers, engineers to get into. And there will be a rich set of tools that emerge in that space to make development much more efficient. And today we're not there yet. So we're determined to be a key player in that space.

Other than that, we are not in the space of being a GPU cloud, for example, and provide access to cheap GPUs. That's not our play at all. So we actually, we build on top of GPU cloud to offer a complex inference stack. I guess one thing that's interesting about the company is, I guess you started in Fireworks like what, a few months before ChatGPT?

And so I'm curious, as you think about the original vision for the company and then where you've come through today with this Gen AI craziness, how much of it was the same sitting there in September '22? How much of it had you had to change as the world has just changed so much in the last few years? Yeah. This is a very interesting question. When I started Fireworks, there was actually an active debate, is AI here or not?

because it's pre kind of awareness of foundation models and it's kind of, oh, there are so many like fragmented AI application and the data is not there, you know, AI is not there. But we clearly see that AI is coming, right? So usually hyperscalers are like

three to five years ahead of the whole entire industry and mata is massive ai powered uh and through um clearly the industry is kind of through pytorch adoption it's clear they see that you know uh the wave is coming uh so that's kind of the timing about um fireworks um and the gen i did skew chat gpt heavily skew the adoption curve of ai and there's a special kind of ai so it's very interesting

Gen AI is a different beast, not because it's Gen AI, it's magical, because it's fundamentally changed the accessibility. So pre-Gen AI, when it's traditional machine learning or early deep learning models, any company investing in AI particularly have to first

hire a machine learning team because they have to train model from scratch. There's no other alternative. There's no other model that's pre-trained you can build on top of.

And that means this team has to spend a lot of time curating data, and then the training process, because training from scratch kind of takes a long time. So they have to have the resource to hire those very scarce talent and throw money into training. - And was the original Fireworks product built for those people? Because that was originally what it felt like was going to be needed to get adoption. - Yes, yes. When we started, we think about that cohort.

And then with Gen AI, it makes

it much, it's kind of a change in landscape, right? You don't have to hire a big team to carry data. Because it has the beauty of JNI is it creates a foundation model that absorbs a majority of the knowledge and then you just build on top of it as is or you tune, right? You have thousands of samples, you tune. So then you either don't need to have any machine learning team, you just need to have the application part team to build on top of that directly or you have a small machine learning team.

Right. So this is a massive, massive unblocking of accessibility, of that technology. And that's why we see kind of the adoption curve take off like crazy because of that accessibility is fundamentally different before and after.

So because of that, these are focused on it because the pool is much stronger in that direction. The other side effect is all these models, all Gen-I models are PyTorch models. Yes. We wrote PyTorch code for years and we operate PyTorch

large complex package model in production with high volume traffic for years. And that's what we're really good at doing. Yeah, very well situated for that. You mentioned when you're at a place like Meta, you get kind of a window into the future like before it's obvious. Was there something you saw when you're at Meta where you're like, "God, this is just going to be clearly so big shortly thereafter?" I think it's more the kind of AI is going to be very big wave. That's clear.

I worked at Mada for seven years, right? It's an interesting time. When they joined Mada, they were going through the mobile first transition and the kind of tail end. And then the industry is also going through mobile first, right? Sometimes desktop to mobile. And mobile first enabled consumers to be able to access applications everywhere. And that drove up adoption and engagement. And because of that, it produced a lot of data.

and a lot of data is powered by AI. Right? So that's, then the whole entire industry is also following the mobile first, drive a lot of data, and they are powered by AI. So that trend is very clear. What AI research do you pay attention to? Maybe even outside of like the core day-to-day of fireworks? I usually pay attention to two kinds of research. One is like model system co-design kind of research. Because there is a

interesting organizational division as in there's researchers, they focus on quality only. There's a system builder like us, we focus on deliver the best quality with low latency and low cost, so like we solve the system problems. But oftentimes, the best ROI is think about that together. That's how we operate at Meta is the research team and the info teams sit very closely to discuss trade-offs and the co-design.

So there are a lot of research in the co-design space, I think it's going to be very appealing to kind of find the best design point across quality, latency and cost as a three-dimensional optimization. The other kind of research is kind of fundamentally different, disruptive kind of research. As in, Transformer as a technology is overdue for disruption.

So where are the next generation of transformers? That's going to change how we train model, change how we do inference, and so on. So that's very interesting. And also in the agentic world, how different agents is going to communicate with each other. As I mentioned, thinking the latent space, that kind of

very new kind of thinking is very, very interesting. One challenge that I imagine in building an infrastructure company is that obviously the pace of change is so fast, both like the improvement in models plus just like how enterprises are thinking about and actually using these things. And I can imagine like one school of thought would be it's changing so fast, like, you know, we'll just kind of quickly update what we're, you know, what our kind of core tools are because of how fast it's changing. And at some point it will settle down and then we can kind of build the standard set of tooling

How do you think about that tension of building for however folks are doing things today without knowing exactly? Maybe in two, three years models have all these other different capabilities and we're designing systems in all these other different ways. Yeah, that's a really good question. That's what we think about all the time. We don't want to keep chasing what's coming because chasing is exhausting. We always want to stay on top of the curve.

I think the model capabilities that are definitely evolving has evolved from two years ago, especially if we just look at opening-eyes model capability from GP3 to now today. It's the differences. But fundamentally, a few trends that doesn't change. So one is, again, anchor back to our vision is we believe the direction is specialization and customization.

that doesn't change how the model core capability evolves.

because we just don't believe that one size fits all different workloads with proprietary data and so on. We believe there's much better solution if you can customize, if you can steer, if you have control that that's new here. So that's why we built our stack to have the enable that in an easy way.

Specifically, we offer on top of our inference engine, which is one size fits all, we have a file optimizer that's one size fits one. We take inference workload as input and also your customization objective as input and spit out inference deployment configuration along with potentially a model, adjusted model for you to have the control whether you want to deploy that or not.

So this is just a process of we kind of close the loop and make it super easy to customize. So I don't think that's going to change in the future. Totally. Just what you're customizing may change, but still the way you will go about it, I think that makes a tremendous amount of sense. Well, we always like to end our conversations with a quick fire round where we get your take on a standard set of questions.

And so maybe to start with Luv, just one thing you think is overhyped in the AI world right now and one thing you think is underhyped. - I think overhyped is the perception, JNI is magical. It's the recipe of all problems. We should just ask it. Any questions, it's kind of, it will have the right answer. I think it's going through the correction time right now. And again, that's where kind of we believe

like there's no magical one model that solves all the problems the best way or in the correct way. What's one thing you've changed your mind on in the AI world in the last year? I think my hypothesis in

building this company and how this Gen Ed technology can be adopted. And our go-to-market strategy in my mind is always like in a sequential way. Startups are like Gen Ed native, so they will be on the frontier of adoption. And then we'll work with digital native. They have very strong engineering resource and always kind of be more tech forward. And then traditional enterprise because they have more like other kind of problem to solve.

But now we're working with all of them at the same time. And to me, it's a little bit crazy. Of course, the use cases are not all the same, but kind of...

The adoption curve is happening all at the same time. It's very different from what I was imagining. On the application side, it feels like the normal rules of how long it takes to land one big enterprise and then land a few after. It's all out the window. Some of these application companies are scaling so fast in the large enterprise. I think there's just a tremendous appetite for it. That's right. That's why I feel like we're in this revolutionary wave. A lot of things will be done differently. Not just kind of the

application review will be different, technology adoption curve will be different, but also even how to think about go-to-market will be different. The sales cycle is much shorter.

and people are open to kind of think the procurement process differently. So it's kind of very interesting kind of ripple effect from this massive transformation. - Do you feel like, is there any difference in what you have to build for enterprise versus the startup folks or is it actually like the needs pretty similar? - I think, my observation is startups, they are typically,

like they want to get access to, and when we talk about abstraction, right? They want to get access to the low level abstraction. Yeah. Because they want to think a lot, right? So assemble a lot of things. And for traditional enterprise, I'm not saying kind of this in an exhaustive way, but typically, they want to have a higher level abstraction. If they don't need to solve for like low level details, you don't need to pay attention to, rather not to pay attention to,

So even at the meta, we have those different abstraction layer for different kind of team because they want to kind of pick and choose, right? So usually two-level abstraction will suffice, kind of like they were happy with or at least have two choices. And we see a little bit of those. It must be interesting to build both at the same time, you know, as you're rapidly scaling. Right, right. So we need a low-level abstraction for ourselves anyway, so kind of...

To us, it's not additional overhead, but we see kind of the adoption will happen at different layers. Yeah. I know it's probably, I won't ask you to pick a favorite, but what is like one favorite application that's going on with Fireworks today? This is a tricky question.

I would say Cursor, which... But we see a lot of similar company in productivity space, right? Not just Cursor, but also Sourcegraph, Zed, Cognition, Factory. They're all very forward-thinking company. And I think in that space, it's too early.

But you know, that's the space we're very excited about the whole entire space Do you think someone will spend a hundred billion dollars on a on training a model in the next few years? Oh maybe It only makes sense there if they are training using a fundamentally different Model architecture. Yeah, it's so disruptive. It's what that investment. Yeah, I

You kind of already mentioned the coding space and some of those startups, but any other AI startup or space that you're really excited about outside of fireworks right now? Again, I think the agentic world hasn't been fully figured out yet. Have you seen any applications do that well? We have seen early applications, for example.

the digital SDRs, the digital marketing. So we have seen kind of good startups getting great adoption there. But I do think in the future there will be a lot more complexity in that space.

and we're still early. - Well, this has been a fascinating conversation. I'm sure folks will want to pull on all sorts of threads. So I want to leave the last word to you. Where can folks go to learn more about you, about what you're building at Fireworks? The floor is yours. - Yeah, so yeah, definitely. So we offer,

For the benefit of our developer community, we offer a self-serve platform. Very simple, just go to firewalls.ai and you can access to our playground and hundreds of model capabilities we have. Yeah, and feel free to reach out to me also. Connect me on LinkedIn and love to hear your use case, your challenge and pain points. Amazing. Well, thanks so much, Teresa. This was a fascinating conversation.

you

Ep 50: Fireworks CEO Lin Qiao on Why There Won’t be a Single Model, Will Hyperscalers Win Inference & AI Use-cases with PMF 55:49 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 50: Fireworks CEO Lin Qiao on Why There Won’t be a Single Model, Will Hyperscalers Win Inference & AI Use-cases with PMF