Why Compound AI + Open Source will beat Closed AI

2024/11/25

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Chapters Transcript

L

Lin Qiao

S

Swyx

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Lin Qiao 认为，基于开源模型的复合 AI 系统，通过整合多种模型和 API，并针对特定工作负载进行优化，最终能够在质量、速度和成本上超越闭源 AI 系统。她以 Fireworks AI 平台为例，详细介绍了其分布式推理引擎、模型优化器和复合 AI 系统的设计理念和技术优势，并强调了团队的专业性和客户至上的服务理念。她还提到了 Fireworks AI 在模型推理速度和成本方面的优势，以及其对开源社区的贡献。 Swyx 则表达了对开源模型能否最终超越闭源模型的疑问，他认为闭源模型的优势在于其整体性，而开源模型的专业化可能导致其在某些方面不如闭源模型。他同时关注了 Fireworks AI 的竞争策略，以及其在模型质量、延迟和成本之间的平衡。 Alessio 则关注 Fireworks AI 的技术细节，例如分布式推理引擎的工作原理、模型量化方法以及与其他推理平台的比较。他还探讨了模型评估方法，以及如何将模型评估结果应用于产品决策。

Deep Dive

We are so back. This is charly your A I co host. We're still between studios. So today we're filming on location at fireworks A I H Q in redwood city. We are here for a deep dive into the massive compound A I wave coined by data bricks co founders meta hara, ali godsby, three time latent space gesta john Frankl and others with linka co founder CEO of fireworks ai, which is the leading compound AI platform after two red hot funding rounds with benchmark and squire capital and an incredible customer list from superhuman to cursor to quora to hub spot, we Normally strive to have great offline relationships with the guests we bring you, but this case is more special than most.

After lyn chattered with us briefly for the europe's twenty twenty three podcast, swiss has had the privilege of advising the company for a number of its launches, from fire attention, which has fifteen x hired through than v LLM, to fire function and fire optimized and real time audio launch, to its most recent launch of f one. One of a new batch of long inference reasoning models built a top open models to compete with OpenAI o one along side of news forge A P I and deep seek r one as well, which were released after this recording in latent space news. We now have a stacked meet up calendar from the third A I engineer meet up in london to the A W S reinvent listener meet up in los vegas, to the new latent space live micro conference at europe twenty twenty four.

The europe event in particular will be hosted both online and in person. And for the first time ever, we are confirming three prize categories for our speakers, one too hot for europeans, for papers that are too new or wrongly rejected for europe. Two best papers of twenty twenty four for survey talks, nominating best papers of the year in a given domain, three oxford style debates for hot topics in the A I research and engineering community head to ht t years collen slash slash ludd tt M A slash L S live to sign up and applied to speaker sponsor limited in person tickets available now. Lastly, we are still taking listening questions for our end of year recap, head to speak pipe dotcoms slashed lighten space to submit questions and messages for a chance to appear on the show, watch out and take care.

Everyone, welcome to the little space focus. This is A S, U. Partner, and see your a table partners. And i'm joined by the coal swiss thunder was my hey.

And today we're in a very special studio inside the fireworks office with intel. C, O I work song, 于是。 Yeah, thanks having it's unusual to to be in the home of the startups like it's also I think our relationship was a bit unusual with compared to our our guest.

definitely. yeah. I am super excited to to talk about very interesting topics that face with both of you.

You just celebrate your two year .

anniversary history yeah it's quite a crazy journey we took around and share all the crazy stories across these two years and uh IT has been superfine uh all the way from we experience to become valley bank run right? Two, we delete some data that shouldn't be deleted Operationally.

We went through massive scale where we actually are busy getting capacity to yeah we we learn to kind of work with as a teen with a lot of brilliant people across different places. Join the company. IT has really when you started.

did you think the technical self will be harder or the bank around and the the people side, I think there's a lot amazing researchers. They want to do companies and the hardest thing is going to be building the product. And then you have all these different from the things. So uh, were you surprised by IT? What has been the your experience?

Yeah to be honest with you, like my focuses has always been on the product side and then after product go to market and idea, realize the rest has been so complicated Operating a company and so on. But because I don't think about that, I just kind of manage IT. So is that so I think I just somehow I don't think about that too much and saw whatever form coming our way and I went.

So I guess this started the prehistory like the pre the place of, uh, initial history of fireworks. You ran the pitching I met for a number years, and we previously have some move to talk on. And H, I think we are just all very interesting. The the history of genti ei, maybe not that many people know that how deeply involved fair and meta war is beautiful, like prior to the courage integration.

My background is a deeply in strip te system data manage the system。 And I joined mata for the data side, and I saw this tremendous amount of data growth, which cost a lot of money, and we are analyzing what's going on. And it's clear that A I S driving all this data generation.

So is a very interesting time. Because when I join matter, matter is going through ranking down mobile first, finishing mobile first transition and then starting air force. And there's a fundamental reason of about that sequence because mobile first give a full range of user engagement that was has never exist before.

And all these users in in general law of data and this data power A I. So then the whole entire industry is also going through falling through this transition。 When I see all he, this A I power all this generation and and look at where's our A A S, that there's no software, there's no hardware, there's no people, there's no team like I wanted, I IT up there and and help this this movement。

So when I started this very interesting in industry landscape, there are lot of A F Marks is a kind of proliferation of air for reMarks uh happening in the industry. But all the air foremark focus on like production um and they use a very certain way of defined graph uh of new network and the used to drive a the model actuation and production ization empire is completely different so they could also sume IT. Then he was the user of his product and he has a researcher face so much pain using existing F, M.

This, this is really hard to use. And i'm gonna do something different. I for myself, and that's origin story of py watch pitch. I just started as the for researchers don't care about production at all and as they grow in terms of adoption. So the interesting part of A I is researches the top of production.

There are so many researchers across academic, across industry, and they innovate and they put their results out there in open source and that power downstream productions ization. So it's brilliant, for matter, to establish pilot as a strategy to drive massive adoption in open source because matter internally is a photo shop. So it's quite a flying well effects.

So that kind of a tradition behind a IOS when I computer is kind of a caso mata h pilot asked the frame for both research and production. So no one has done that before. And we have to, I think, how to architect patch ch, so we can really sustain production worker, the stability liability and see all this production concern was never concern before.

Now is the concern. And we actually, we have to adjust its design and make IT work foot sides. And that took up five years, uh because matter has so many AI use pieces all the way.

From ranking recommendation as power in the business copy or as ranking you speed video ranking, to citing integrity attack by content, to using A I to all kind of effects, translation image cost option object detection is and also cross um air running on the service side, on mobile phones, on A R V R devices, the wide speco. So by the time um we actually basically managed to support A I across ubiquity everywhere but interesting ly through open source engagement, we work with a lot of companies. IT is clear to us like this industry is start to take on air first.

Tradition and of course matters herself always go be ahead of the industry. And we feel like IT feels like when we start this gennet and matter there, no no offers, no team for many companies we we engage with uh through py org, we feel the pain. That's a genesis why we feel like, hey, if we are quite fireworks and supporting industry go into this tradition will be huge on the impact.

Of course, the problem that the industry facing will not be the same as much, much so big. Uh, right? So it's kind of skill towards extreme scale and extreme optimization, the industry of different, but we feel like we have the technical and job and when seeing lot, love love to kind of uh, drive that. So so that's that's how we started when .

you and I try IT about like the origins of fireworks， IT was originally envision more as a package platform and then later became much more focused on the generated is fair, is like, right? What was the customer discovery here?

right? So I would say our initial work print is say we should be the patrols cloud because patrol an library and there's no service home to enable air workloads even in .

two thousand two. And I will know .

that absolutely no, but like cloud providers have some of those, but it's not the first classes, right? Because at twenty two, there is still like tis flow is massively production and this all predit I and the pi is kind of getting more, more adoption, but there's no pyo age first as home is the thing. At the same time, we are also very pragmatic, settle people.

We really want to make sure from the get goal, we get really, really close to customers. We understand that use case, we understand that pain points. We understand the value we to them. So we want to take a different approach.

Instead of building horn cle patos cloud, we want to a build a vertical zed platform first and then we talk with many customer and interesting, uh, we start a company september to and turn to and october and november then open that announced ChatGPT and then bone. Then when we talk with any customer, they're like, can you help us, uh, working on the genius, uh, as back, right? So of course, there are some open source models is not as good at that time, but people already like putting a lot of attention there.

Then we decide, uh, if we gona pick a vertical, we're going to pick geni. The other reason is all general model patroon dels. So that's another reason we believe that because of nature of genre is going to generate a lot of human consumer able content IT will drive a lot of consumer consumer developers facing application and product innovation guaranteed.

I will just at the beginning of this, our prediction is for those kind of application, the influence is much more important in training, because influence scale is proportional to the applied award population, chinese sale is proportion. The researchers, of course, each training round could be, uh, very expensive. Although pyper ble influence and training, we decide to focus on train influence.

So yes, that that's how we get started. And we launch our public a platform August last last year. And when we launch IT is a single product, is a distribute the inference engine with simple A P I, opening a compatible A P I with the menu models. We started with L M, and they don't add a lot of model fast for to now. We are full platform with multiple product lines with candidate deep into what we offer a so but that's a very fun journey in the in the past two years.

What was the transition from you start focus on puni, people will understand the framework, get IT live and now say maybe most people that use you don't even really know much about pipe watch at all. You know they're just wrong. Go to my model from a product perspective, like over some of the decisions on like riding october member, you were just like, hey, most people just care about the model, not about the framework. We're gona make a super easy or was a more a gradual transition to .

the model library you up today. So our decision all based on who is our p and one thing you want to acknowledge here is the genre technology is disruptive, is very different from A I before genre. So it's a clear leap forward because before geni, the companies that want to investing A I, they have train from scratch.

There's no adult way, there's no function model doesn't list. So that means to start a team first, hard team who was capable of punch data. There's a lot data.

Ch, right? Because training from square, you have to put all that. And then they need to have, uh, they, you have G, P, S, had to trade. H, and then you should start to manage you. So then, because a very complex project, h IT takes a long time and not not many company can afford IT actually.

And jane is a very different game right now because this is a foundation model so you don't have to train anymore that make A I much more accessible as a technology, as a developer or part of manager. Even on the developers, they can interact with gene models directly. So in our goal is make air accessible to all of table percent product engineers that go.

So then getting them into the building model doesn't make any sense anymore with this new technology. And then building easy, accessible A P S. The most important our are only long when we can get started.

We decided we're gonna open ing ing. Is just kind of very easy for developers ers to adopt this new technology. And we will managed to an underlying complexity of serving all these .

models open an has .

become stand standard.

uh, even as well recording today, germany, I announced the F, I, comfortable A P S interest the other night. We everyone yeah .

that's interesting because um we are working very closely with meta as one of the partners and matter announced matter, of course, is kind of very generous to donate. Many reverse join business models expecting more to come but also the have announced lama stack yeah which is basically stand ized the upper levels stack view on couple of lama model. So they don't just want to give out models and you fig out.

I don't want to build a community around this stack and build build a kind of new standard, I think is this is interesting dynamics in a play in in the industry right now. When is more like standards across OpenAI because they are kind of creating the top of for no or stand as across lama because this is the like most used gus model. So I I think do lot a lot of fun working at this time.

I've been a little bit more delta on lamos C. I think you've IT more positive basically is just like a metal version of whatever hugger fees offers, know our tensor t or and whatever the open source opportunity is. But like IT, to me it's not clear that just because that open islam a that the rest allama stack will be adopted and is not clear why I should adopted, is very .

early right now. That's why I work very closer with them and give them feed back the feedback to the matter team is very important. So then they can use that to continue to improve the model and also improve the higher level stack. I think the success of the stack will depend on the community doping and the where and I I know matter would like to kind of work with the world community, but it's very early.

One thing that after series b, so you you wait a match mark, I remember being close to you for lisa a series b nounce ment. You started betting heavily on this term of company eye start, the term that we've covered very much on the podcast. But I think it's then getting a lot of adoption from like data bricks in the berkely people and all that. What's you are take on compound eye.

Why is IT rested with people? right? So give a little text consider that space yes.

because like previously or not, there was no letter. Yeah now on your language.

So it's kind of very organic evolution from when we first launch our public platform. We are single product. We are distribute influencing where we do a lot of uh, innovation on customize the rock corner different and build distributed, disagree the execution, inference execution, build all kinds cashing.

And so that is one so that kind of one part line is the first most specificities in problem because we pto have a special together with a customer， we vote and then we work with many more customer. We realize oh, the direction in engine is our design is once has for all right, we want to have this influence and point then ever come in. And they no matter what kind of formal ship or work work they have ah, we just work for that, right? That's great.

But the reality is that we realize all customers different kind of use cases. The use cases come in all different forming shape, and the end result is the data distribution in their influence were closed, doesn't align with the data distribution. In the training da for the model, it's a given actually, if you think about because researchers has a destination, what is important was not important doing like in for training.

So because of that misalignment, then we leave a lot of quality, latency, cost vem on table. So then we are saying, okay, we want to haven't invest in a customization engine. And we announced a call, fire open zer.

So far, ocular zer basically help user navigate a three dimensional optimization space across quality lencs in the cost. So is a food up dimension, a curve? And even for one company for different use case, they want landing different spot.

So we automate that process for our customer is very simple. You have your inference ble load. You inject an into the opposer along with our objective function.

And then we spent out uh inference, employment, config and model set up. So like it's your customize set up. So that is a completely different product.

So that product thinking is one side. This one different five, one size is all. And now on top of that, we provide a huge variety of status art models.

Uh, hundreds of them, varying from tax lodge state, are largely english models. That's where we started. And then as we talk with many customer, we realize all audio and text are very close.

Many customers starting bute, a system or kind of assistant using text, and the immediate want to add audio auto eating all you out, so we support transfers, ript translation, speech, sis text, audio alignment, all different kind of all of features is a big announcement. We're gonna. I should have heard time.

This is out. And the other areas of vision and the text are very close with each other, because lot of information doesn't living plain text, lot of information living market, media format, living images, P, D, F, screen shots and many other different formats. So often I solve a problem, uh, we need put the vision model first to extract information, and then use language model to process in the result, so visions import also support vision model.

There's different kind of vision model specialized in processing kind of source and an exception. And we also gona have another announcement of a new A P M point or support for people to upload for muti media content and then get the uh extract very active information out and feed to into the and no, of course we support inviting because inviting is very important for semantic search for rag and all and in addition to that, also support text to image, image generation models, text to image, image to image. And we are adding text video.

As in portfolio, it's very comprehensive of model c on on, on top of the optimization. And this the infant engine. But they would talk with more customer.

They solve business use case, and they realize one model is not sufficient, so of the problem. And it's is very clear because one is journey, that gal jane has to solve all my problems magically. But then they really all this model who is needs IT.

Is that because it's not A S politic. H, so it's designed to always give a answer, but uh, based on popularity, so holy, and that's actually sometimes the future for creating fight. For example, sometimes it's a because you don't want to give misinformation and different model also have different specialties to solve a problem.

You want to ask different special model to kind of decompose your talking to multiple small task, narrow task, and have expert model that task really well. And of course, the model doesn't have all the information I has to made knowledge, because the training data is finite, not infinite, so model often time doesn't have real time information. IT doesn't know any property information within enterprise.

It's clear in your head. In order to really build a compiling application on top of geni, we need a compound, a system compound. A system basically is gonna have multiple models across along with A P, S, whether is public, A P S internal property, A P S story system, database system, knowledge systems to work together to deliver the best answer.

Are you going to offer database?

We actually heavily partner with several big, big data providers. They are all great in different ways, but it's public information like MongoDB investor, and we have been working closely with them for a while.

When you say distribute IT inference tension, what do you mean exactly? Because when I hear your explanation is something like your centralizing a lot of the decisions through the fireworks platform, like the quality and not what you mean, distributors like G P S and like a lot of different classes. So like you're .

sharing the inference across. So um first, all we run across multiple G P S, but the way we distribute across multiple G P S is is unique. We don't distribute the whole model thic across much of G P S.

We chop them into p ces and see them completely different. Bason was the boat neck. We also are distributed across the regions.

Uh, we have been running in north america, amErica and asia. We have regional affinity to applications because licensee is extremely important. We are also a like doing global law balancing because a lot of opinion there, they quickly scale to global population.

And then at that scale, like different wake up, work up a different time and you want to 看到 no balancing across。 So all the way. And we we also have we manage very different kind of Harry steel from different harra vendors and different hardware design is best for different type of workload, whether it's long context, short content on generation.

So all these different kind type or low is best fit for different kind of hardworking e and we can even distribute across different hardware for A D load. So yeah, so the distribution actually is is all around. In the first step.

somewhere will show on the youtube the image that red and cases been working on with, like all the different modalities that you offer. Like to me, it's basically you offer the open source, measure enough everything, everybody I should be all right. I don't think there is that actually if you do text the video, you will be a super cent of yeah on this because they don't have sorry, is that mi, by the way, is yes, motion and there .

few others like there. I will say the interesting thing is I think we are betting on the open source community is gonna grow like literate this little little worse yeah. And there's amazing video generation companies.

Yes, there is this amazing audio companies there like crossword. The innovation is off the chart, and we are building on top of that. I think that's an advantage we have compare with a close source company.

I think I want to restate the value proposition of firework for people who comparing you versus like a raw GPU provider like the run pod or land. Do you know anything that goes in which is like you create the developer experience layer and you also like make IT easily of scalable server or you know as as an in point? And then I think for some models you have custom Crystals, but not almost .

for almost for all model, for all large language models .

and of .

the s yeah yeah yeah I almost for all models with we and and so that is .

called fire attention. That's call fire. I don't remember the the feed numbers, but currently much Better to deal and especially on concurrently basis.

right? So far, tension for a mostly for language model for other modalities will also have customised coral.

And I think the the the typical chAllenge for people is understanding like there has value uh and then like there are other people who also offering business model like your mode is is your ability to offer like a good experience for all these customers. But if your existence is an entirely reliant on people releasing nice ice models, other people can also do the same thing.

对。 yeah. So I will say we build on top of open source model so that that's a kind of foundation be on top of.

But we look at our the value prop from the length of of application developer percent part engineers. So they want to create new U. X. So what's happening in the industry right now is people are thinking about completely new way of design products. And i'm talking to so many founder is just mind blowing.

They hope you understand existing way of doing powerpoint, existing way of coding, existing way of managing customer service is actually putting boxing head for the example of power on right. So part one generation is we always need to think about how to fit into my story, telling into this format of slide one after another. And I am going na juggle through like design together with what story to tell.

But the most important is what telling lies, right? And why don't we create a space that is not limited to any format in those kind of a new product, U. S.

Design, combined with automated content generation. For genia, I is the new thing that many founders are doing. What are the chAllenge of facing? I let to go from there.

One is, again, because lot of product you on top and there are consumer sonic and they replace interactive experience. It's just a kind of part experience we all get used to and our desires and actually get faster, faster interaction. Otherwise big want to spend time and is again and then that will cause low ateneo.

And the other thing is the nature of a consumer persume developing is the your audiences is very big. You want to scale up a product market fit quickly. But if you lose my net a small scale, you're going to brush quickly.

So it's like actually a big contrast is I actually have part market fit, but when I skill I still out of my business, I so that's come very, very funny. Uh, we to think about this. So then have low latency and low cost is essentially for those new application in product to survive and really become a generation company. So that's the design point for our distributed influence engine and opie optimise. You can think about that as a feed.

The more you are feed your inference workload to our inference engine, the more we help you improve quality, lower agency, further loyal cost is busy, becomes Better and and we automate that because we don't want you as active proper engineer to think about how to figure out, uh, all these low level details is impossible because you are not trainable to do that at all. You should kind of keep your focus on the party in notion. And then the compound A I, we actually feel a lot of paying as the up developers engineer. They there are some new models every week. There's at least a new model coming out .

OK and have a giant .

model this week. Yeah, yeah I saw that. I saw that five billion .

yeah um so that I should .

I keep chasing this or should I forget about IT? right? So and which model should I pick to solve? What kind of problem? How do I even decompose my problem into those smaller problems and fit the model into IT? I have no idea.

And then there are two way to think about this design, right? I think I talk about that in the past IT. One is imparted, as are you tell you figure how to do IT, right? You give, give tools, how to take, how to do IT.

Or you build a declaration system where develop tels, what they want to do now, how? So these are completely to different designs, right? So like the analogy I want to draw is in the data world, the database menu system is a declaration system.

Because people use database, use secure seal is away. You say, what do you want to extract out of database with the result you want, but you don't figure out which no is. Can how many notes you to run on top of how you read, find your disk, which is you use, which you don't even worry about any of those.

And they always, managers will figure out, generate new best plan and expert on that, right? So visit. And that makes you super easy. You just learn the call, which is learning semantic mentions, and you can use IT in party side.

Is there a lot of intel pipelines and people design this tax system off with trigger with actions and and you do take exactly what you do and IT fails how to recover. So that's a important system. And we have seen a range of system in the equal system like go different ways.

I think they're value of both. They're valuable. I I don't think one is gonna subsume the other, but we are any more into the philosophy of through the decca ative system because from the length of a development .

engineer that with easiest for them to integrate one .

some of use. So yeah focus on use of use and then let the let the system take on the heart chAllenges and complexities. So we follow, we expand that thinking into current system design.

So another announcement we will also announce a our next declaration system is gonna be appear as a model that has extremely high quality. And this model is inspired by open announcement for open the eye. You should see that by time we announced this or .

inner train by you, yes. Is this the first model? Do you train?

And like this is not the first um we actually have a trained a model of fire function as a functional calling model is our first step into compound system because function calling model can dispatch a request into multiple A P S. We have three big set of A P S. The model learn.

You can also add additional A P, S to do the configure to let model despite accordingly. So we have a very high quality function calling model that already released. We have actual three versions of latest versions is very high quality.

But now we take a further step. You don't even need to use function calling model. You use our new model and release, uh, you will solve a lot of problem, approaching very high like openest quality。 So i'm very excited about that.

Do any benchMarks we have benchmark.

we're going to release IT. Uh, hopefully next week and we just put our model to L M S. And people guessing, is this the next gi model or maus model? Guessing that's very interesting. Like watching the red is a discussion .

right now to be I ably I release a one a lot of people asked about whether or not it's a single model or whether it's a good a chain of models and no and basic. Everyone on on the store berry team was very insistent that what they did for enforcement learning to know, cannot be replicated by a service model cost. Do you think that that is they are wrong. Have you done the same amount of work on R L. As they had for was a different direction?

I think they take a very specific approach where I do the carbo of team is very hot, right? So I I do think they are the domain expert in doing the things they're doing, but I I don't think there's only one way to achieve the single. We are on the same direction in the sense that the company skin law is shifting from training to inference.

We are definitely honest that I fully agree with them, but we are taking a completely different approach to the problem. All of that is because, of course, we didn't try the model of the scratch. All of that is because we build on the short of giants, right? So the current model available, have access to is getting Better and Better.

The future trend is the gap between the open source model, cos of model, it's just gonna drink, to the point, is not much difference. And then we are on the same love of field. That's what kind of, I think our early investment in inference and all the work we do around balancing across quality, latency, costs pay off because we have cum a lot experience there and that empower us to to be release this new model that is an approaching open ice quality.

I guess, like they rushing is, what do you think the gap to get shop will be? Because I think everybody agrees with open source models. Eventually with with elementary to one four five b, we close the gap and then no one just reopen the gap so much. And it's unclear. Obviously.

you're .

saying you're model. We are closing this.

So here's the the thing that happened, right? There's public benchmark IT is what IT is. But in reality, open source model in certain direct dimension already on power or beat closest model, right? So for example, for in in the coding space, open service models are really, really good.

And in function calling like far function is also so it's all a matter of whether you build one model to solve all the problem and you want to be the best of solving all the problems, or in the open house stoma it's going to specialized right? All is the different mother is specialized in certain narrow area. And it's logical that they can be really, really good in that very narrow area. And that's our prediction is with specialization, there will be lot of expert models really, really good and even Better than like one close.

I think this is the the core debates that I am still not one hundred percent either way on in in terms of compound I versus Normal a AI because you're basically fighting .

the bitter lesson. Look at the human society we specialize and you feel really good about someone specializing doing something right. And that's how I, like, we evolve in ancient time.

We are all journalism. We do everything in the time to now, we having. So I my production is the a model space will be also.

except for the Better lesson, you get short term gains by having specialists and domain specialist, and then someone just needs to train like a tennis figure model on tennis more inference, tennis more data, and next more model, perhaps whatever the the current skating was and then IT supercedes ds, the all the individual models, because of some generalized intelligence, flash law knowledge.

No, I think that that is the the core insight of the G P. S. Want to to you.

That was right. But the skating law, again, right? The training skating lawyers, because you have including all the data to train for, and you can throw a lot of computer, right? So I think on the data side will approaching the limit and the only data to increase that there and there is there right? If a good.

Mother, you can generate very good in data and and then continue to improve quality. So that's why I think you open up the from the training skin low into infront skating and is the test time in all this. So I never believe that the future direction, and that's where we are relieved that I doing .

couple versions upon that. Are you planning to share your reasoning traces?

That's a very good question. We are still debating. Yes, both do that.

I would see if you for example, it's interesting that like ample three bench, if you want to be considered for ranking, you have to send me a reason traces and that has actually disqualify someone ask like cosign was doing well on sweep entry, but they didn't want to speak those results. So that's why you don't see one preview on sweet we bench because they don't have their reason fixes. And I use these IP. But also, if you're going to be more open, then that's one way to be more open. So your model is not going to be open, sort right, like it's going to be an end point that you provide this an entring also seem I just kind of face.

yes, this is I don't have actually information. Everything is going so fast. We have even think about that yet. I should be more prepared.

So I mean, this is live. No, it's it's nice to just talk about IT as if as IT was like any other things that you are like you want feedback on or you're thinking through, it's it's kind of nice to just talk about something when is not decided you about this new model. I mean, it's it's gonna be exciting. And gary, a lot of right.

i'm very excited about to see how people gna use this model. So there's already read a discussion about IT and people asking very deep medical questions and seems they will get IT right, surprising and interview. We also ask the model to generate what is A G I and generate a very complicated tag thinking process.

So where have a lot of fun testing this internally. But i'm more curous like how do people use IT? How what kind of application they're gonna try and test on IT. And that's where we really like to hear feedback from the community and also feed back to us like what works out well, what doesn't work out well, what works out well but surprising them and what kind of think they they think we should improve and those kind of feedback will be tremendously helpful.

Yeah I mean, so I think the production user uh preview and mini since much I would say there very, very obvious jumping quality, so much so that they they made the clouds on IT and for all just like they they made the previous day of the aren't looked that like it's really it's really that, uh, start that difference. The number one thing actually, you know this feed back or request feature request is people want control on the budget.

Because right now, in all, one IT kind of decide is on thinking budget. But sometimes you know how hard the problem is and you want to actually tell the model, like spend two minutes on this or spend some dollar. Maybe it's time in these dollars.

I don't want the budget. Is that that sense? We um we actually thought about that requirement and IT should be at some point we need to support that. Uh, not initially, but that makes a lot .

of sense OK. So that was a fascine overview of just like the things are working on first, all I realized that I don't know if i've ever given you this feedback. I think you are guys are of like one of the reasons I need to advise you because like, I think when you first met me, I was kind of do B S.

I was like.

Why is this together that like the players, very, very competitive fuel, why you win? And the reason I actually changed my mind was I saw you that shipping, you know, I think your surface there is very big. The team is not that big.

No, only forty people. yeah.

And now here you are trying to compete with open eye. Know everyone else like what is the secret?

I think the team, team is the secret.

Oh boy.

So there's no.

no, no, I Y. I think we all come from very aligned on the culture because most of team came from matter and many startups s so we really believe in results. One is result and second is customer.

We are very customer obsessed um and we do want to drive adoption for the sake of adoption. We really want to make sure we understand we deliver a lot of business values to the customer, and we are we really value their feedback. So we would woke up middle and deploy the model for them, a shuttle, some capacity for the um and um yeah over the weekend.

No brain are so so that's just how we work as a team. And the calorie of the team is really, really high as well. So like as plugin, we are hiring um we are expanding very, very fast. So if we are passionate about working on the most cutting edge technology in the genre space, come here.

Let's say a little bit about those. That customer journey, I think, be more famous customer pressure. We were the first pocket to have first on and and obviously, since been up, cause effects are not related. But but you guys especially worked on a fast apply model where you were one of the first people to work on spectral coding in in the production setting. Maybe just talk like behind the seat of working with.

I say here is a very, very unique team. I think unique part is the team has very high technical calibre. I no question about IT, but they have decided, although like the many companies in coding copa, say i'm gonna d AR because I can and they are unique innocence.

They seek partnership, not because they cannot. They refuse ly capable, but they know where to focus. That to me is amazing.

And of course, they want to find a by fast partner. So we spend some time working together. They are pushing us very aggressively, uh, because for them to deliver high calibre product experience, they need the latency.

They need the interactive but also high quality at the same time. So actually, we extended our product feature quite a lot as we supporting course, and they're going so fast and we must scale quickly across multiple regions. And we develop with the high intense influence stack, almost like similar to what we do for matter.

I think that's a very, very, uh, interesting engagement. And so that they are not the trust be build and they realized, hey, this is a team they can really partner with and they can go back with. That comes badly. He were really customer obsessed and all the engineers working with them, there's just enormous all the time thinking together with them and discussing. And we are not big on meetings, but we are like that cannot always on yeah so you always feel like working as one team so that I I think that's .

really high light yeah so basically is a visco, but most of the time people will be using close models. I I actually use a about IT. So you're not involved there, right? Is not like you hole saw that or you have any partnership. We have no big you involved where current oil like there they're property there, housework and bottles often.

So I don't know what I can say. No, but the things they haven't say.

like it's very obviously the drop down in four and sr, right? So like I assume that the course is the firework side and in the other side, the calling just kind. Then like do you see any more opportunity on that? The you know, I think you made a big special with that one thousand talk with per second. That was because it's very good people. Is there more to push there?

We push a lot. Actually when I mention a fire optimize, right? So I think we have a unique automation stack at this once at fifty one, where are you deploy to curse early on, we see oppose for their specific workload, and that's a lot of juice extract out there.

And with success in in that part is actually can be widely adopted. So that's what I kind of we we started separate part line call the file mr. So and speculative decoding is just one approach. And speculate decoding here is not static. We actually roll the block host about IT.

There's a so many different ways to you can pair a small model with a large model and see model family or you can have equal have so there are different tradeoffs of which option take really depends on your workload. And then you your worker, we can align the equal heads of dozer heads, or you know small, big model pair, much Better to extract the best little reduction. So all of that is part of the organza offering.

I know you mention some of the other inference providers. I think the other question the people always have is around benchMarks. So you get different performance unlike different platforms.

How should people think about, you know, pui hey laa three point to his x on animal, you but maybe you know you think specular the coding, you go down a different path, maybe some providers from a quantize model. How should people think about how much they should care about how you're actually running the model? You know, like, what's the dota between of the magic that you do? And like, what a raw model?

okay. So there are two big development cycle. One is experiment where they need the first situation. They do want to think about quality, just kind want to experiment to a product experience and and so on, right? Well, that's one.

And then IT looks good and they want to kind of post part of my life was scaling and the quality is really important in late cy and all the other things are become important during the exact mentation face is just pick a good model, don't know about anything else, which even like gina, is the right resolution to your product and that's focus. And then put that kind of three directions across quality, latency, cost where you should land. And to me, it's a purely a product decision.

Too many product, if you choose a lower quality but Better speed and lower costs, but doesn't make a difference to the product experience, then you should do IT. So that's why I think inference is part of the validation. The variation doesn't stop at offline about the validation is kind of will go through A B testing through influence.

And that's why we can offer a very different conflicts with you to test which is the best setting. So this is a like traditional product evolution. So product evolution should also include your new model versions um and different a model set up into the conservation.

But one is specifically talk about what happens a few months ago with with of your major competitions. I mean know all this is public. What is your take on what happens and and maybe want to set the registry on how fireworks does anticipation? I think a of may have outdated perceptions or they didn't read the the clarification post on your post protez ation.

First of all, is always surprised to us that, uh, without any notice, we got caught out my name.

which is yeah in .

a public post and have certain interpretation of our quality. So I was really surprised. And it's not a good way to compete why we want to compete fairly and often time when one vendor give out result fun, another vendor is always extremely biased.

So we actually gram ourselves to do any of those, and we happily partner with the party to do the most fair evolution. So we are very surprised, and we don't think that's a good way to figure the competition and escape. So then we react.

I think when comes to potential the mentation, we wrote that actually very thorough and post because again, and no one said, is all we have very different constitution schemes. We can contest very different part of the model from ways to activation, to prostitute communication to like they can use different content scheme or consistent across the board. And again, they trade off, trade top across the street and quality lencs and cost.

And for our customer, we actually let them like find the best optimize point. And that's kind of how and we have very thoro evaluation process to to pick that point。 But for self serve, there's no only one point to pick.

There's no like customization available. So of course, we depends on like what we will talk with make customer we we have to pick one point. And I I think the end result like A A published a little on A A publisher 破了题马上 and we actually we look really good。 So I don't I want and that's why I want, I mean is I will leave the evolution of quality or performance to third party and work with them to find the most fair benchmark approach and methodology. But i'm not a proud of approach of calling out specific names and critic other computor uh in a variables way.

the abies uh events as well. I think you are the more politically correct one and then demise the more. This, you want twitter.

we, the .

russian, we .

partner. No, actually all these reactions we build together, we play different ways.

Another one that I want is on just the last one. On competition side, there's a perception of Price force in hosting open source models. You are you and we talk about, uh, the competitive to this in my do you aim to make margin open such models?

Oh, absolutely so. But I I think this is really when we think about pricing, it's really need to corner with the value we're delivering. If the value is limited or there are a lot of people delivering seem value, there's no definitely there's only one way to go is going down.

I saw through competition, if I take a big step bag, there is pricing from we are more compared with like close model providers, A P S, right? The close model provider, their cost structures even more interesting because we don't have any we don't bear any training costs. We focus on infant optimization, and that's kind of where we continue to add a lot of product value.

So that's how we think about product. But for the close source, A P I provider and model provider, they were a lot of training cost and the other test training cost into the influence so that quite a very interesting dynamics of yeah if we match pricing there. And I think how they are gonna make money is is, is very, very interesting.

So for listening ers opening eyes, twenty twenty four four billion in revenue, three billion in computer training, uh, two billion in computer inference, a one billion in research computer advertizing and seven hundred million in salaries. So that is late. I mean, A A lot of money .

yeah so I think matters basically like make a zero yeah so that that's a very, very interesting dynamics were Operating with coming back to influence, right? So we are mention our product is where a platform we are, not just a single model. The service provider as many other influence providers like their product model, we have promise to have nice towards your instance, work close, we have a compound system were signing, simplify your interaction to high quality and low ency low cost. So those are all very different from other advisers.

What do people not know about the words that you do? I guess I feel like fireworks. You run model very quickly. You have the function model. Is there anything like under rated part of fireworks that more people should try?

Yeah actually one user post on x 点 com， he mentioned, oh actually far worse, can allow me to upload the Laura adapter to the service model with at the same costs and use IT at same costs. Nobody has provide that that this because we have a very special like we were mutilated a last year actually, and we actually have this function for a long time and many people have been using IT, but it's not wearing on that.

Oh, if you find to your model, you don't need to use on demand. If you find your mother is Laura, you can upload your Laura doctor and we deploy IT as if it's A A new model and then you use you get your uh and point to you can use that directly, but at the same cost as a face model. So i'm happy that user is marketing IT for us.

So he discovered that future, but we have that fall last year. Uh, so I think we have, I think to um feed back to me. We have a lot of very.

very good.

We have pump catching way back last year also. We have many. Yeah so yeah, so I think that is one of the underwater feature. And if if their developers are you are using ourselves platform with tried out yeah yeah yeah.

The Laura things interesting because I think you you also like the reason people add additional cost to IT is not because they feel like charging people like Normally in Normal law serving setups, there is a cost to dedicating loading those weights and dedicating uh, a machine efforts and how come you can avoid IT.

Yeah so so this is kind of our technical tilla. So we basically have many law adaptors share the same base model yeah and basically significant reduce the memory print of serving and one base model can sustain a hundred two thousand thousand lower doors. And then basically all these different law doctor can share the same like the same traffic to the same base model where base model dominating the cost. So that's how we open tize um that way and that's why how we can manage the tokens um dollar a million token pricing the same as this model.

Is there anything that you think you want to request from the community you are looking for model wise or tooling wise that you think like someone should be working on in this?

yeah. So we really want to get a lot of feedback from the application developers who are starting to build on genre or on the already adopted on studying about thinking about new use cases and so on. To try out on fireworks first and let us know works out really well for you and what is your wishlist and what is what ah what is not working out for you and would you we like to continue to improve and for our new part launches.

Typically, we want to launch to a small group of people. Usually we launch on our discord first to have a set of people use efforts. So please join our just for channel.

We have a lot communication going on there. Again, you can also give us speech. We have started office hour for you to directly talk with our deal and engineers to exchange more long nose.

And you're hiring across the board.

high across the board, hiring front engineers, infrastructure oud infrastructure engineers, backing system optimization engineers, applied researchers and like researchers who has done post training, who has a lot .

of fine. Thank you. Just having us.

Why Compound AI + Open Source will beat Closed AI 58:25 Share

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Deep Dive

Shownotes Transcript

Why Compound AI + Open Source will beat Closed AI