We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI Reliability, Spark, Observability, SLAs and Starting an AI Infra Company

2025/6/27

MLOps.community

The podcast discusses the evolution of data from a back-office function to a core product. The limitations of existing data platforms, built 12-13 years ago, are highlighted, along with the impact of AI and the rise of unstructured data.

Inference is the new transform.
Existing platforms weren't built for AI workloads.
Hardware evolution allows for single-node processing.
Unstructured data (text, images, video) is now prevalent.

Shownotes Transcript

In our view, inference is the new transform. Data started slowly to become more of the product and not just like a back office thing that the company needed to run the business. Platforms that exist out there don't treat these as first-class citizens. It's not like, okay, you're not getting a bonus this month because your prompt was not good, dude. Who's going to be the pets.com of the LLM bubble? I'm excited because I was the...

Metaphorical Cupid. Yes, did it. Ultimate matchmaker. Yeah, you did it. Building a company is always a journey. Usually people talk about what they've done after the fact. So they tend to like to remove a lot of the gory details. Yeah. So...

Both me and Yoni, we came from similar experience of the market, but from different angles. So I was working mainly like more, let's say, traditional data infrastructure. So I saw a lot of like the data engineering like work there, especially like in the enterprise. Yoni came from data infrastructure again, but more from DML. And we started like working on that. Okay, how we can...

rethink, let's say, and build new tooling around working with data infrastructure for the problems that we have today. Because keep in mind that the dominant tools that we are using were created 12, 13 years ago, completely different use cases. Yeah, like Spark. Spark, Trino, even the commercial tools like Snowflake, they're kind of similar at the end of the day.

It's all about how we move into the cloud and how we do like this big data thing, which the dominant use case is BI. It's analytics, right? There is ML, but ML was like always more of a niche thing at the end of the day.

compared to the BI market, if we take it in just the numbers out there. But in 12-13 years, many things have changed. Many new workloads came in. There's ML, what Yoni was working on. Today we have AI, we'll talk more about that later. There were things about embedded analytics, a lot of product-related... Data started slowly to become more of the product and not just a back-office thing that the company needed to run the business.

And of course, like these tools were not built for that. We could kind of like, pass them like to do it. But as the market was demanding more and more and more, the pain was becoming bigger and bigger and bigger. And we felt that like, okay, this is like the right time

to go out there and start building something in this space. Yeah, no, I think you're right on there. I think now more than ever, what we were seeing is teams saw the value in putting data products into production as quickly as possible because they realized the direct correlation that they have with business outcomes. The more effective they can put data products into production, the better business outcomes they get. And it's not just now the Silicon Valley companies, right? If you think about it, it's

all the non-tech first companies also saw this value there. But the tooling didn't enable them to do that. I can tell you how much time we spent, thinking back to my time at Tekton,

like helping customers debug through their Spark logs, right? And working with them to figure out, you know, we would go and onboard them and these weren't Silicon Valley companies that had run Spark clusters before. So here's Spark 101. This is how you, this is a Spark config. This is how you go through and get your initial cluster. It's like, oh boy, you're in for a treat. And at the end, what we saw is,

The main thing that's evolved from 15 years from now when the brilliant guys from Databricks went out and built Spark is that there's a couple of things that have really changed since that time. One thing, hardware has evolved tremendously, which is, Kostas and I always talk about this, is that at that time, distributed workloads were a lot more necessary because

you're a lot more limited on the resources you can get through AWS. Now you can go and check out, you know, a huge instance on EC2 and DuckDBs come in and DataFusion's come in and they've kind of flipped the analytics world on its head where you can do a lot more in single node, right? And we saw that at Tekton too where we'd have these

these clusters and you'd be running hundreds of gigabytes of workloads that weren't really necessary to have distributed Spark clusters there. You can really do a lot on single node.

So that's one thing. And I think one of the catalysts that we were kind of identifying, and other things, AI is coming to the picture too, which has totally changed the nature of workloads, right? It's no longer only about structured tabular data, which we have great platforms for right now, and are more or less a solve. There's still a lot of challenges there, but now there's unstructured data. There's a lot of text data, there's images, there's videos, all the different modalities there.

that these platforms just weren't built for, right? They're first-class citizens, they're structured, tabular data, they do that really, really well, but they don't have the ergonomics, the capabilities, they weren't built from the ground up with first principles, having these characteristics of AI workloads in mind. The other thing that like AI did was actually

acting like a catalyst to the market and the industry. Because if we take and see what was happening again, like in the past 12, 13 years, we have the analytics market, which is huge, but still is, it's a backend kind of work, right? You have reports that you need to build and deliver. These reports are consumed by people who make decisions, blah, blah, blah. But it's not a product. Like it's not the customer facing product of every company out there, right? Right.

Now, these things started changing gradually, but it was changing slowly. We had ML. ML was like part of that because when you have recommenders, right, now the data becomes the product itself. Like you need to use the data in order to recommend and increase your revenue, right? And it's not like BI doesn't do that, but the link between...

Let's say how the data is used and the value that's created is kind of hard to identify, right? That's been a traditional problem. And I know that there's so many teams that have talked about that. How do you champion for the work that you're doing if you can't draw a direct line from...

what you are doing to revenue generated or saved. 100%. And especially the way that data teams are built, I think, in the successful teams. Like take a company like Lyft, for example, right? Or Uber. Again, Silicon Valley, like data-driven company. And if you see like the data teams in there, the layers that you have of people from the SREs that they take care of your AWS accounts,

to the data platform people who are making sure that there's always capacity there for your Spark clusters, to the data engineers who are making sure that the data is delivered, and then the analysts and the ML people and the data scientists on top. Like, the guy at the bottom has no idea how, like, his work is used actually to make a decision of how, like, to improve, like, the revenue that you have, like, right? Right.

And that's part of what Yoni was saying also, that the model that these technologies were built and the requirements in terms of the talent needed to maintain and scale them, it was not sustainable if you want to make it accessible to everyone in the market out there. Contrast that to how we build apps. If we were building apps like SaaS applications the same way that we did with data,

Like the industry would be, I don't know, like one-tenth of what it is today. And like the people that are working there probably would be 2% of what it is today, right? So what AI did, because we started getting into the data becoming more and more of a product, is that the AI came in and said,

is everything about data. Now, whatever product we are building, one way or another, consumes data and spits out data. And it's not just the data we're used to, as like Bionis said, now we have also unstructured data that we can process in ways that we couldn't do before. And there is an opportunity there to make these technologies even more accessible to more people. Yeah, it's easier than ever. Exactly.

So AI kind of like accelerated. That's what I'm saying. It's not like it wasn't happening. Like already the industry was leaving the SaaS and selling the software as a service, as a way like to build value.

and getting more into the data is becoming the driving force of the next iteration of the industry. But AI came in and accelerated like 100x. What do you mean by that exactly? Leaving the SaaS? Because I've heard so many different versions and viewpoints of that idea of like SaaS is not the way forward. Yeah. So if we think about...

You need to be old like me to do that, okay? Because you have to see how the market developed in the past 20 years. I'm revealing a little bit of my age now. So what happened with, let's say, from 2010 to 2020, right? Where we had all these companies, like the sales force became really big, right?

the work days of the world out there, right? What was actually happening is that we took all the activities that humans were doing and we tried to automate them using software and deliver that software over the cloud, right? The cloud was important because the cloud allowed for

Efficiency is like in the financial point of view that like you couldn't do it like with other delivery methods, right? And that helped like grow a lot and commoditize the software at the end, right? So we are using taxis, right? You go out there, hey, hello, or call someone to come and pick you up. No, now you can use an app like to do that, right?

So we kind of created platforms or e-shops, for example, have like Shopify, right? We created for its activity that like humans do like in business and their personal life, software platforms to make this process like more efficient, right? And that was like delivered like through SaaS, right?

we kind of saturated that. I mean, like, okay, I don't think there's like much left out there that we haven't turned into like a software platform one way or another, right? So if we want to keep the growth of the industry the way that it was, like, okay, what's next? Now, the magic of like turning every process into a software is that you catch a lot of data. You capture a lot of data, right? And this data is like,

sitting somewhere and now the question is like, what do you do with this data? So the first thing, do BI reporting, right? Like do your financials. Like BI, by the way, like for people don't know, like the first owner in the business of BI was like the CFO because it was like primarily the first need was to go and build financial reporting, right?

We came into like, okay, now let's do marketing reporting, let's do sales report, blah, blah, blah, like all that stuff. And then, well, you know what? Now that we have all the interactions of these people with our products, maybe we can build recommenders. And now we can automate and send like an email and be like, hey, you know what? I saw that you were looking into that stuff. I found this one that might be interesting, right? And this can drive, let's say, the behavior of the world to go and buy more. Right.

That's based on data, right? So we have all this data and we have also a lot of unstructured data. We have all these customer supports that people are calling thousands and thousands of people to tell their issues and all that stuff. But we couldn't work on that before, or at least easily enough, right?

So, there is the opportunity out there to take all this data. It's like an untapped opportunity to turn more and more value out of that. Now, the question is, okay, can we do it with the technologies that we had? And actually, it was like, it's kind of hard. And then AI came in. Actually, LLMs came in, right? And they offered a number of tools. And most importantly, in my opinion...

to a much, much broader audience out there than ever before. So people now, they can work and do things with their data that previously was really, really hard to do. You had to be in Lyft or Uber or Apple. Yeah, I like that, highlighting how mature the company's data is.

teams and just overall team and vision of how to use their data needed to be before AI came. And then you relate that to

app builders and how simple it is to build an app versus how simple it is to put a recommender system into production. Yeah, yeah. And I can give like an example for people to understand how big of a gap there is between the maturity that, let's say, the app building part of the industry has and the data one, right?

So imagine, like today you have a front-end engineer, builds like the front-end, they work in React, blah blah blah, whatever they are doing there. You have a back-end engineer, they're doing their stuff like with Firebase or like whatever.

And you have an app that you can put out there. Now, these two guys there, they are working. Each one of them is working on their own thing. And somehow they can merge their work and have an app working at the end. And you two probably have more experience than me on that. Try to contrast that with what has happened with the data teams. The data scientist is going to build a notebook

And what happens next? You hand this notebook to a data engineer. And what is the data engineer doing? Rewriting the code to put it into production. Now, take this and put it into app building. How many Salesforce we would have if we had two? Literally, the front-end developer hands the code to the back-end developer, and the back-end developer is going to rewrite the thing in order to release that.

So that's the difference between the level of maturity that, let's say, one side of the industry has compared to the other one. And AI is that great equalizer. Actually, it's the catalyst, I would say. It's not there. And I think a big problem that we are seeing today with AI is that getting things into production is really, really hard. It's great for demos. We can build demos.

very impressive demos but getting into production is really hard and the reason for that is because the tooling that we have and the engineering practices that we have are not there yet to deliver let's say the same capacity of delivering software with the quality that we need as we do like with building applications it's it's funny yeah you mentioned that and how you're

you now have a new persona that is able to access that. So that is that catalyst. And it's like the market is demanding that you have the ease of use because you have all these backend and frontend engineers who are able to use AI and

And they're coming in and then they're like, okay, I got something 80% there. I can show a demo. I'm good. But then you're like, so let's use this in our app. Yeah. And I think part of that is like kind of what you're describing is if we take a look at the AI journey that companies go through, right?

chat interfaces. That's the primary thing that came out that is still, I think, the dominant way that people interface with LLMs and AI, right? And so you have, let's say, interactive AI. That's where people tend to start building apps, building AI products, right? They'll go and grab a Langchain or they'll go and grab a Lama Index and really expect to either have like a human in the loop or, you know, interactive on the other end, like an agent waiting for like an AI response that comes from these LLMs, right?

And I think that's where companies tend to start primarily because that's where most of the tooling exists now. Then they tried, they already start seeing value, right? Okay, we're building this product, it's adding some value. But now I really want to go and scale this. Like how do I now do AI at scale? How do I actually build a product in production? And the challenge there is the nature of all these models are non-deterministic, right?

And so a lot of the concepts that were once used for data teams with structured and tabular data don't necessarily apply for unstructured data because of the nature of non-deterministic. And so what our theory is and what our thesis is, is that we really want to create and help companies and build a platform that can help them create deterministic pipelines

with using very familiar interfaces that they're already used to in concepts on top of non-deterministic models, for example. That's where a lot of the challenges, that's what really the problem set that we're excited about helping teams do.

is how do you build a lot of clarity and a lot of stability and reliability and help companies actually take that next step where I think about how do we scale our AI products and how do we put these pipelines into production in a resilient, in a stable fashion? And that's where a lot of the tooling is kind of missing, right? They kind of start doing the one-off interactive AI. And then when it comes to putting this actually in production at scale, right?

You know, you're talking about concepts like context windows and tokenizing and partitioning and chunking, all things that are new muscles that data teams really need to build to be able to run these things reliably in production. Even just, is Anthropic working right now? Is the API up? Yeah, exactly. So what are the latencies like, right? And then all the new versions of the models coming in introduce new things for cost optimization, right? Inference is expensive, right? Time

- Time to first token. That's so true. There's so many new metrics that you're looking at that you previously had not been exposed to. - And the platforms that exist out there don't treat these as first class citizens, which means that teams that have great engineering teams

are DIYing it. They're saying, okay, how can I go and build something around these constraints that I have and around these properties of non-deterministic models to try and simulate how we're currently interacting with our structured and tabular pipelines, right? So they'll, if you're like a great Spark shop, for example, you'll go and write these

complex UDFs which are brittle, hard to maintain, and on top of Spark and not taking advantage of all of any of the performance that Spark has to offer because it's not running in distributed fashion. Or you'll go and build some complex logic on top of Lambda functions where you're trying to hit the APIs of these models.

But lots of engineering cost goes into that, that's hard to maintain. Teams are trying to piece things together to try and help them run AI at scale in production.

And so that's where we see that there's a really nice opportunity is how do you build a lot of these properties really natively into an engine and a platform that helps companies go and scale their AI workloads. And as much as everyone, and I think it's true, like agents are here and, you know, 2025 is a year of agents. We're going to see a lot more of these come into production. But I also think you have a lot of teams that are starting to think about, okay, like,

I'm collecting all these transcripts. I have a call center of 60 people that are doing thousands of calls per day, and I want to perform semantic analytics on it. And our view is,

inferences the new transform. It can be used very much in the same way as what teams are used to when they're building pipelines for structured data. It's a very powerful form of transform. If you're trying to do semantic analytics at scale where there's no human in the loop,

how do we enable teams to really take advantage of this new powerful form of transform that's there? I like that. I've heard that before from the head of AI at Wise. He was saying, we should be thinking about these different LLM calls as just a way to take unstructured data and turn it into structured data. That's what you see. That's what we're trying to help teams really do. We think that there's teams are doing that already on their own.

But can you build something that treats this as really the first class problem that you're trying to solve? And let's take a moment real fast to highlight the difficulty of that scale in production. Because I remember I was talking to my buddy who works at Decagon and he was saying, you know,

To get the agents working for the customer service stuff, that's not really the hard part. We had a working demo up and we were able to sell within two weeks of when we generated the idea. Later, what's really hard and what kept us up at night is how do we do that for a company that is receiving sales?

thousands of customer service requests per minute. Yep. And that's where batch inference becomes really relevant, right? And really prominent there. But then also what you were saying is like, you're taking all of these transcripts at scale

And there's a lot of things that you need to think about that are properties of, let's say, the AI models, right? So the context windows, right? So if I have this transcript, is it enough for me to take each individual message and pass it to the LLM? No, because it turns out that you need to have the context of the entire conversation. So are you going and applying arbitrary buffer around each individual message? Well, that's going to have you increase your input token count, which is going to make things more expensive. So there's a lot of these nuances and complexities

that when you're trying to run things at scale, you need to give people the expressivity to be able to, one, experiment before pushing to production. So how do we build this in a notebook environment? Test out different prompts that we have to make sure that the data quality is coming back. But then also what you're saying is true is like, how do we create structure of the unstructured? So really long transcripts, blobs of text into nice, neat structured tables.

And so and that's where the true power lies, because if you can take things and create structure out of it, now there's a lot more that you can do to control input token count. And and you can do a lot more on understanding what the context window and that you want to send to the LLM. And so and then you get into things like rate limiting and like model latencies and model cascading and things of that sort that are also really important. You're running things at scale.

It does feel like too, if you're doing that kind of stuff in Spark, you're overengineering it. - Yeah, yeah, 100%. And if you're building it on Spark, right, I think it just, the architecture that was built, and it's a brilliant, super powerful platform, right?

but it wasn't built with unstructured data in mind, right? And so you're not getting a lot of the guarantees that you can get if you had a platform that was built for and around unstructured data processing. I also remember there was this guy, Zach, that was putting an agent platform into his company. And he was talking about how one of the things that they did was

was they were trying to allow everyone at the company to build their own agents for whatever the use case was. And I know that's a common thing that most people are dealing with. They're saying, "We know that each individual department needs agents, and we know that these agents are going to be better served if the subject matter expert builds them. So how do we create a platform that can allow folks to build their own agents?"

And one thing that he did was he created a metric for when someone is building an agent so that, hey, if you put this into production and wherever you're going to expose it, the expected traffic is going to be this much and this is the expected cost. And so that's a fascinating piece. But then the other side of that is I wonder how much you have been encountering folks like –

the people that we talk to at process that say we don't really look at cost in the beginning we just want to make something that is working so is it possible then we go into how do we optimize it how can we delight the users and then can we actually like optimize that yeah uh that's a great point because there is also a hidden complexity that we didn't have before again with like

The thing with AI is that we really have to rethink how we're building software. So one of the things that AI enabled is that the people who are actually the domain experts, now they have to be part of building the technology itself. If you think how we were building software before, right?

We would get, okay, our customer, we have like a product guy, we have a persona, let's say like our persona is like salespeople. We try to understand what the salespeople need. Then we'll create some PRDs. We'll go to the engineering team. Like, hey, now you have to build that. How many story points? Yeah, how many story points. Roll out like the software, give it to sales and salespeople. Somehow it will work. Okay? Yeah.

But now that's not enough because this behavior of the software depends on the salesperson directly, right? So how do we involve that person in the process is something that I don't think we have figured out yet. And also how we...

we let the engineers keep doing engineering because we still need engineering, right? It's not like we can live in a world where everything is kind of random at the end of the day. And the salespeople at the end of the day, they are still salespeople. They get paid for selling shit, not for building software, right?

So it's not like, okay, you're not getting a bonus this month because your prompt was not good, dude. Like we can't do that. Right. So this is like, I think like a big challenge, especially for the people who are building like more customer facing products. But thankfully the salespeople don't have to write SQL though. Right. They can just write prompts, which is like, and like,

That's easier for them to be able to go and, let's say, iterate with than having to go through and try and do like the old business analysis. But also it brings up the point of this is built for engineers. It's not built for sales folks. And so the fact that, like you were saying, playing around in a Jupyter notebook, trying to give that to a salesperson is not going to work. How do you create an environment that someone is able to natively go in and it's like...

do you focus on the lowest common denominator of someone who is technically apt at being able to do something? Or are you trying to push for how do we enable our power users? - The role of like the heads of AI that we see now coming in and becoming very common at a lot of companies. Most companies that we talk to have a head of AI at this point, right? And really, I think the interesting part about that role is doing exactly what you're talking about, is trying to bridge that gap.

They're essentially like the AI Sherpas within the company, right? If you think about it, they're going to like the head of product marketing, they're going to the head of sales, they're going to all the other business functions there. And they're saying, hey, I'm the gatekeeper for AI. What do you want to try and build for this? How can we try and add value to your teams to be able to leverage AI? And so the head of sales will say, hey, look, like we're losing out on a lot of deals and

and all of the information we have is in this sales Slack channel that we have. There's a lot of data there in context around why did we not hit our numbers over the last week, over the last month and quarter? Can you build me some pipelines that are going to go through

and leveraging LLMs, try and identify the patterns in there so that we can then go and improve and increase conversion rates, whatever it is. If I know this sales team, I know the answer. And it's probably marketing. Marketing fucking sucks. But then you go to the marketing guys and they're like, okay, like, look at the sales guys, right? Or product. With marketing, they're like, it's product. We're giving them qualified leads, right? That's kind of the thing, right? And

And I think there are, there will be more advances there where it's like, okay, how do we get, but at the end of the day, like, do you want your head of sales to be going in and trying to work in a platform and experimenting with different prompts?

I think there's a balance there, right? But that's what these, I think, heads of AIs are trying to do is like create the AI roadmap based on all the feedback they're getting from other like business functions within the company, which is like a really interesting... Just Sherpa is a great term. So there is the UX part, which is like a big thing that needs to be like figured out. But there is also the, like the platform, like the infrastructure side of things, right? And I think, and going back a little bit like to...

Like, why you shouldn't do that, like on Spark, for example, right? The thing with AI and LLMs is that the workloads change their characteristics, like, dramatically because of LLMs and inference and GPUs. So in the past, again, 12, 13 years, everything was, like, pretty much CPU-bound.

Spark is like an amazing tool if you want to crunch numbers and make sure that your CPUs are always operating at 100%. That's what they're trying to do. First, and second, how we move data around. How we shuffle, because we need to move from one server to the other. And if you have thousands of them, then how do you make this reliable? So if something breaks, we can resume. Now, put an LLM in the equation.

CPU doesn't matter anymore because your CPU is going to sit idle there waiting for the GPU to return like a result, right? So from a CPU-bound workload, we go into like more of an IO-bound workload. Then

Okay, you're building your UDF, you're running these things, but guess what? There's zero reliability when you're talking to GPUs, right? Like a GPU, because the systems are not mature yet, we're kind of, I would say, how kind of like the internet was like before 2000, like literally, like back then, you know, like you would connect on the internet and if your mom was like picking up the phone, like you would lose your connection, right? Like, does this happen today? No, right? Right.

But that's kind of what happens with LLMs. You're aging yourself again, Kostas. It's okay. We can discriminate on age. We're in San Francisco. But then with the LLM, you send a request. You send, let's say, I don't know, let's say you have... You're using Gemini that has like one million tokens, like context window. You said like one million tokens there, right? It starts doing its magic there and...

In the middle of that, it breaks. What happens? You can't resume that.

That literally happened to me yesterday. I did that exact same thing. I'm pairing up all of the attendees for the event tomorrow and I'm sending an email to them saying, hey, you should know each other because I think it would be great. And so I asked... You're trying to get a new company built, I see. Yeah, I'm trying to be that matchmaker again. Exactly that. And so I'm actually... Gemini, without me prompting it to, it says why it paired up each one, but it's taking a long time. And...

And halfway through, I said, "Alright, well, I'm going to go get lunch and it should be done by the time I get back." I got back and the whole thing was just not there anymore. It magically disappeared. So I had to prompt it again and then ask it to create it again. And you've already wasted cost on the input token for the first time. That didn't work, right? Even before you reach the cost thing, it's like the reliability thing, right? Because think that now,

You are doing this like 12 hours before your event and it breaks and you can't send the emails. Like you're not going to be in a very happy position, right? Imagine you are AT&T and you are going to process all the transcripts of the previous day, right? To create tickets for your engineers for the next day, like to work. And we are talking about like probably tens of thousands of hours of transcripts that you have to do there and that breaks, right?

Where's the SLA? There's no SLA. There's no SLA right now. It does not exist. And data teams are used to SLAs, right? That's how they get rated and qualified and like, are they doing a good job? Are they hitting their SLAs, right? Are these pipelines getting executed in the amount of time that they need to? Throw LLMs into the loop. It's a whole...

It's the wild west. Yeah, it's the wild west, right? Exactly. So you have this super reliable CPU-focused technology that is Spark. You put LLMs in the equation there and it doesn't work anymore. Like the reliability that you expect from something like Spark, it's not going to work and it's not its fault, right?

Or your clusters are going to just sitting there and just waiting for the GPUs to return something back. And the reason I'm saying that is because people might say, okay, why not just go and like build on top of that, right? For our like LLMs. And the reason is that because very, very soon, and again, let's say you have infinite money, like you don't have a problem with that, right?

you will have an extremely hard time creating reliable systems that will reliably deliver value to your company. And the moment that that happens, people will lose faith in the technology itself. And we see that with AI, right? People do lose faith because they don't care at the end of the day. They are not doing it just for the technology. The technology is making their lives harder. Why would they care, right?

Not everyone is like, you know, like a tech junkie in like Silicon Valley that have like to work with towards like advanced general

indulgence, right? Yeah, I think it's an important thing too, right? As you talk about that, that 70% of AI projects never make it into production. Everyone that's building or like trying to innovate in the AI space always famously claims that thing. And then they fit in somewhere under that umbrella for why the problem that they're trying to solve is the main reason for that.

And it's all based on conversations, right, that they're having with these heads of AIs and teams to try and figure out and pinhole what the problem is. And there are multiple layers to that, right? The one that we're very focused on is taking it and running AI at scale in production is a very hard, unsolved problem that teams just really need to solve.

struggle with because they're used to the world of things working relatively well, right? They don't always hit their SLAs, but when they have these pipelines that are running across large data sets, you know, they have on-calls, they have processes for being able to run these, they have retries, they have orchestrators like Airflow that are kind of good at helping run that retry logic and build all of that. And that's where we think that

The new workloads that are introduced and the new paradigm that we live in now, the infrastructure behind it needs to be rethought. Well, it's funny you mentioned that because I remember back in the day, my friend Diego was telling me about how he felt like there was going to be this new...

for folks that focused on reliability, but specifically for the ML systems. And he was like, maybe we could coin it as the MLRE instead of the SRE. It's the MLRE. But now it feels like what you're talking about is the AIRE, the AI reliability engineer who is going to be 100% heads down focused on, can we make this reliable? And that's,

that goes into all of these new metrics that you're talking about. It's not just, can we look at the logs and traces and pipe it into Datadog? Or can we have that kind of analysis? Yeah, they want to be able to feel confident around their pipelines and be able to even think the next, look, can we put an SLA on this, right? And given the current tooling, it's very, very hard to do that based on

the ergonomics and the tooling that's there in a platform that exists for working with these models, right? And so that's the AI infrastructure kind of opportunity that we're thinking about. Something about like the roles that exist already, because I think that there is a gap there. Someone has like to fill this gap. I don't know if it's going to be the like AI engineer, although my experience so far with AI engineers is

I feel like they're coming more from the data scientist side, which is great. You need people who can model and understand how to work with models and at the end of the day create something that delivers what has to be delivered there. But I think there's a huge opportunity there for people, especially from the data engineering background and the engineering background,

to step in and actually take this role. Because at the end of the day, their reason that they existed was to add reliability to working with data. Because working with data was always an unreliable business. It was always really, really hard, right? And the job of these people were to make sure that

Everything is delivered on time and is with the quality that we need. We have SLAs, we have all these things. And if you want to understand like a group of people, you have to look into the language that they are using, right? And

One of the most commonly used terms for data engineers is "indepotency". Indepotency is like the attribute of a system that if you put the same input, the output will always be the same. That's great because when you have that, you can ensure that there is a reliable system there. Now, that says something about data engineers, that they

literally breathe and live and exist for reliability at the end of the day. The same is like with ML engineers, but the ML engineers, I think they have an added advantage that they've already had to take into account that we are working with systems that they are not deterministic, right?

And LLMs get this thing to the extreme. So I think these two groups of people, they have an amazing opportunity to actually transform themselves into something that will be extremely important for the future of the industry. And if they don't do it, someone else will do it, right? Exactly, the SRE. I was thinking 100%. It's a lot of folks that were in the MLOps community in the beginning, right?

came from that SRE background and they were tasked with

figuring out the reliability of the ML systems and the ML platform. And then you had this, the rise of the platform engineer. And a lot of those folks were SREs that were rebranded. And you have the platform engineer. Now you are probably going to start seeing more and more of these AI platforms. And one of the jobs that the AI platform engineer is going to be tasked with is exactly that. How can we make sure that

Whoever's building with AI can do that confidently. Yeah. And I hear a lot from people, especially like from the data engineering world, because, yeah, like if your whole existence is around determinism, right? Like when something comes that's like so...

different to what you're doing, like your initial reactions like to reject it, right? And that's like the biggest, I think, danger for them right now. Like you see a lot of like rejection. It's like, oh, we're data engineers, like leave us alone with that, like LLM stuff. We're going to do our thing. You crazy AI engineer, go do whatever you want. I don't want like to know about it. No, because if you go, like you don't focus on that,

and you are okay to feel a little bit uncomfortable, you have a tremendous amount of value that you can deliver because reliability is what is literally like what is missing.

to turn LLMs and AI into what they are promised to be. So I think there is, for both data engineers and the male engineers, a huge opportunity here. They need tooling. I mean, it's not the existing tooling enough for that stuff, but the tooling is a different conversation. It's not their job to build the tooling. The industry should build the tooling.

But the most important thing is the mindset that they bring and the experience that they bring. And that's something that no tooling can do, right? So these people have like literally sitting on like gold, but they have like to do something with that. Otherwise they will miss like a huge, huge opportunity in my opinion. Yeah, I've heard some people talk about how

they can't connect the dots. You hear everyone banging and screaming from the rooftops on how AI is only as good as the data that you give it and garbage in, garbage out. These are like the tropes that are so common. And then I saw someone say, but explain to me how that's possible because right now I just go in and I give a prompt and there's no data that's going into that.

That's just the prompt. And so I was trying to put two and two together to really encapsulate why it is like that. And on one hand, you have me just going in, doing one-off tasks with ChatGPT or Gemini, and that's used as a bit of a...

or I'm talking to it, I'm trying to learn something new, I'm trying to understand something, or I'm using it more like a browser, I'm asking it to tell me these different things, that's not necessarily a data product. But then you have products that the company uses, and like you were talking about with the support system,

And all of the data that's going to be going into the context window there is not something that you're doing one-offs with. That's something that should be very operationalized. - Yeah, and so here's like an example that we use, one of our early design partners to help kind of crystallize this too, like for a use case, right? As like, we talked about it, this team has, let's say 60 call center folks. For all intents and purposes, it's an insurance tech company, right?

whenever you go and you get a new policy for insurance, you get this thing called a deck page, which is what they call it in the insurance. It's a declarations page. So it's like a summary of all your coverages and all of your policies there. Now, the problem here is that if

a call center representative misrepresents what's in the Decks page. So let's say there was someone, this is an example that Coastal has recently built out, is like, I have roadside assistance and it says in the Deck page it's only for 15 miles.

But the call center representative during the call with the customer tells them that it's for 50 miles, right? That's a liability that they're then taking on and could potentially get sued for that because they misrepresented what the actual policy is, right?

And so how do you build these pipelines that are going to, let's say, and this is where the data quality portion comes in too, you want to have the declarations paid, structured in a way, and be able to manipulate it through using some nice like text and chunking and partitioning capabilities, along with a transcript.

side by side and be able to go through and, and look for and filter all of the questions that were asked by the customer. And then the answer is that the support person gave them and then match that to the portion of the deck page that they're actually talking about. Right. You don't really care about like the niceties that, hi, how are you? Like you don't want to feed all that shit. Shit. It's okay to say into, uh,

into like the LLM because it's gonna be more expensive. You don't wanna just take the whole transcript in and of itself.

You want to be able to partition and chunk in and only send the relevant information that you need in order to understand whether the customer support agent represented what was in the declarations page accurately. So you're taking the question and the answer, and you have the expressivity and the tooling to be able to do that, and then feed it in. So that's where the data quality portion comes in to feed it into the LM to make a decision as to whether yes, what the customer support representative said is correct.

Or no, he actually misrepresented it. The actual amount of roadside assistance this customer had was 15 miles as stated in this portion of the declaration page, right? So then you want to create that report very quickly that there was something that they misspoke or they misrepresented and send that up and escalate it to the team that can then be proactive around handling this case so that they don't end up getting sued in court.

and having to pay out for this misrepresentation, right? And so this is an example of what teams are thinking and trying to do at scale, right? Where it's not like I'm sending this response immediately back to like an AI agent or there's like a human in the loop. Sure, you want to have

somewhat real-time, which is like an overloaded term, but within the next few hours is totally fine to be able to do that. But you're getting these transcripts that are coming in thousands a day. How do you go and actually build these pipelines in a way that is trying to create some determinism on top of the ability to work with these models? And so context windows, chunking, partitioning, all of these things, we need to arm

AI engineers, data engineers with the ability to actually build these in a robust manner and then give them some of the guarantees like Kostas is talking about that

They're already used to, right? Like I need each transcript, the medium time to have it being reviewed needs to be, or the mean time to have it reviewed needs to be three hours, right? And so be able to give these teams those guarantees that within three hours of a customer conversation happening, we'll know if the representative did well or if we need to go and fix things on the back end, right? Yeah, and I want to add something here. There's always data.

You can't say that there's no data. Even if you want to keep it just to the, oh, I'm asking a question to the LLM about, I don't know, how to change... My baby's diaper. In a proper way. Your prompt is the data. Actually, the structure of the whole dialogue on its own is important just because of how LLMs work. LLMs, you...

take also like the previous conversations that you said, like you feed it back there. Like, so you create data that you feed to that. The LLM itself is built on data, right? And then anything that is, let's say, outside of like the trivial things of asking what we would ask like on Google, for example, it requires like extra data. Like there is a reason that tools are becoming so important. Like we wouldn't have agents if we didn't have tools, right?

A lot of the work that you're doing with tools is actually fetching data. Now, it might be, let's say, if you are using Cursor. If Cursor didn't have the context of your code base, it wouldn't help you. And what is your code base in this case? It's data, right?

When you use Cloud Coder, whatever it's called, and you're like, hey, find me the file that does this, this, or that in my code base, and it runs a tool that does a find in the grip,

it gets data back, right? Your code base again, and the outputs of these tools are the data. You see that already we are creating actually something that I think like the data engineers and the ML engineers again would be like very familiar with, like we're building pipelines of feeding data, getting data out and use that like to the next step, blah blah blah blah blah blah blah, right?

The deep research functionality, it's pretty much, okay, I'll search on Google or Bing. I'll get the raw HTML that is returned from the queries that I sent there, and I'll work on that. That's data again. Anything non-trivial using like an LLM requires data. So I think...

The way that we think of LLMs, let's say these kind of Oracle that, it's not accurate. Yeah, you can do it. And that's part of what made them so successful because it was so easy for people to experience something by just talking to it. But at the end of the day, what we are doing with LLMs is that

And let's go back to the SAS exam, right? So what we were doing was like building software and we were forcing people to learn how to think and operate the way that the machines can do it, right?

And now we changed the equation there, because we made the machines be more like thinking and working like the humans do. That's what makes it so accessible. But at the end of the day, we still have a machine that has to do something. We tell it with natural language to do that, and they are very generic. They can do many different things, but still they are going to do the vacuum. And everything is driven by data at the end of the day.

Even in the online use case where you are just chatting with the bot and asking about things, you will copy-paste something. You will take a picture from somewhere and be like, hey, but the CSS here doesn't look good. Look at this picture. That's data, right? The difference is how the things that you do, just you as Dimitrios and Gemini and your Excel sheet with the attendees there,

And there is the company that has to do that at scale every other day for all the new leads that they have. So that's a different approach. You can't do it with chatbots anymore. And that's what we are talking about, how you put these things into production. One thing that I feel like we need to hit on if we're talking about reliability is evals and how you think about...

reliability in the context of evals and getting that also like where do they fit in in your worldview? One of the problems that we have is that the first iteration of evals platforms were inspired primarily from I would say more of like the engineering that happens in the application layer.

So if you think about evals, like in the common case, you have a model, an input and an output. And that's what we care about. So you're saying, okay, I put this into this model, I take this output, is this output what I expect to get?

Now, the problem is that, in my opinion, with this model, is that there's a lot of context that is actually missing, right? Especially in cases where you have to invoke many different models to achieve a goal, right? So let's take the case of processing a transcript, right? What most people do is, okay, the first part is I'm getting my audio file.

I'm going to use something like Whisper, then link it to a transcript. That has some issues, right? Like the output. I mean, it's usually really good, but still there are things that need to be corrected. You get this big chunk of text that you have there. And then what people do will like, okay, clean it up, maybe use an LLM to go and fix some of these issues. So let's say something is misspelled, right? LLMs are great to go and find this and fix them, but still they can make a mistake, right?

Now, this is like on the very, very first stage of like processing that, right? Now, the next step is, okay, let's start like creating some summaries. So we will break it down into some pieces, create a summary for that, store the summaries. Then on another level, we'll take all the summaries, create a summary of the whole thing. And we'll end up, let's say, with the summary that you will put on your like website when you put the podcast episode out there, right? Yeah.

Now, if you consider, let's say, the eval just as individual step there, right?

You're having like the problem and you can't evaluate the whole pipeline of creating like going from the audio to the end results, which is the summary that you have there, right? To do that, you have to trace all the calls and you have to consider all the calls and you have to see maybe the LLM that corrected some references in there made like a huge mistake.

And that changed completely all the summaries, right? I'm exaggerating, but the thing is that a step at the beginning, right, can affect the result at the end. But if you take like a call in the middle, it might still be perfect, right? But the data was like wrong. So of course, like the output was like wrong. So the question is like, okay, how we can work on that? How we can build these more complicated workflows? And I think that's,

If you take into consideration agents, for example, it's even worse because the agent can make tens of different calls, go back, run calls again. And how do you evaluate the output at the end? Because that's what I see at the end, right? I ask Claude to write some code for me. It can take a few minutes, right?

who knows what he's doing there, but definitely there are many back and forths and calls to the LLM. So what do I want?

So I think this is an important thing that is missing, and we'll get there. But I think we need to rethink also the infrastructure that we are using for that, because now we're talking about a lot of data, and data that it's not like a unit test. It's more of like how, let's say, in the observability world, we were doing traces over distributed systems, right?

So that's what I want to see out there. And I think that is going to change a lot, like how people work and how they can build actually reliable systems and incorporate them. I think as we talk about that theme of 70% of projects don't make it into production, that's one thing that we hear from the AI leaders out there, right? Is,

Great, I built a lot. You tell me inference is a new transform, right? So now I'm counting on the results of these pipelines to be mission critical. They're making business decisions for me. So they say, how do I know that this pipeline, this multi-stage pipeline, like Kosas is saying, this multi-stage pipeline that's going through a bunch of different

inputs and outputs into LLMs ended up making the right decision for me, right? And that's really what is top of mind for that. And now these eval platforms that have come out, there's lots of them. They all have different flavors and different angles, I would say.

very important part of building trust in your AI pipelines. But really what we need to be able to do is if we have these multi-stage pipelines that are running in production, we need traceability all the way up. So how can you traverse up from the final decision

what was the input for getting there, but then also what were the stages before that? And how do we provide visibility and observability to AI leaders to really build confidence and understand which step of the pipeline was actually the wrong, did the model not perform well, which then propagated down to the next stage. The same thing for like AI agents, right? Like they have 10 steps that they need to go and do. Which step of that AI agent's workflow

Was it not good? Do I need to go and tweak the prompt? Do I need to go and modify and iterate on? Right. And so that's where I think those next stages, it's also for agents, also for these multi-stage production pipelines that we think are very important for businesses to run to. And you need to provide them that level of confidence. So you can do things like give confidence scores and things like that that models have, right?

But what Costas is saying is true is you need to have all of the outputs and the reasonings behind these models and how they made decision and give them a really nice interface and tooling to be able to then go and review it very quickly and know what they need to go and iterate on in order to get that output data quality to be

feel that good. Maybe it's not 100% of the time they have confidence, but when they're running these things in batch at scale, if we can get 99% confidence that the output for this mission critical pipeline is good, that's probably good enough for us. But there's tooling and info that needs to get built for that. And to add something here, because that's I think is relevant also with the previous generation, let's say, of data infrastructure.

So, like a term that every data engineer and probably like every data practitioner is like familiar with is like data lineage, right? So everyone's thinking like a very important part of ensuring the quality of our data is like keeping track of the lineage. Like how, okay, I have this end result here, this report.

how these reports came into what it is when there are literally hundreds of tables that we have to operate on to get that. And it appears that when we're working in a fully deterministic world, just keeping track of the column level is enough because knowing the data type

enough to reason about what is happening, right? And it fits also well with the columnar, let's say, nature of these systems, the OLAP systems that we have. But you can't do that anymore. Slightly differences, small differences in the input that you put and the prompt that you add there, which you didn't have before, right, as an additional piece of

data there can change dramatically like the output that you get on the other side. So what you need now is more of like row-level linears, which is a very hard problem. And it's not something that has been...

developed primarily because it was hard enough and not needed enough for people to invest in that. But I think that's something that we are also working ourselves on that. I think this is part of how you can make people create a traceability there that it will definitely change completely how the quality of the evals that you are doing. I'm not sure I fully understand role level lineage.

So the row-level lineage is that, let's say you start... Let's take, again, the example of the transcripts, right? The output of your, like, whisper model is going to be a blob of text, right? Now, you might do a few things like, okay, break it down into pieces, chunk it, da-da-da, whatever. Now, the next step is, like, for each one of these...

chunks that you have. Each chunk is a row, right? This is going to be fed into an LLM with a prompt that says "Do you find any references here for the names of the participants that they are like mistakenly transcribed? And if yes, fix them." Right? Now,

Your first row got into the LLM and the second will do the same, the third will do the same, blah blah blah blah blah, right? Now the next step is that you are going to take each one again of them, right? And create a summary of each chunk, right? So you see that each row goes through steps of processing

But because of the differences in the data that you feed at the raw level, right? And how the prompt might change for each one of them, you might have different results. So you want to be able to track that, something that you didn't need to do before. Kind of like a treat reversal exercise. You wind up in this leaf node, but then what were the nodes before that that led to that, right? Right.

So that's where you want to kind of traverse back up this multistage pipeline and see. So if let's say you have your end summary, right? Because at some point you get all these like mini summaries and you put them all together into like an LLM and you put the output, which is like your summary. And you see the summary there and you're like, no, I don't like that.

So now you want to go backwards, right? And you're like, "Okay, what contributed, what data contributed in this particular output that I have here?" Right? And then you say, "Oh, it's like these five mini-summaries that I have." You want to be able to track these five and recall them so you as the human or whatever, like a valve machine or machinery you are using can have access to this particular data.

With the lineage as it was before, you can't do that because the lineage is just keeping track of the metadata of like the columns that participate, right? But now you can't do the same thing because the actual data has like a big effect. It's not just like the data type. It's not that something, let's say a join breaks because I was expecting a data type of integer and it was a string, for example, something like that. Like, okay, I'm just like making things up now.

It's much more like you have to get into much more detailed views of the data itself to understand how the LLMs at its stage operate to give a good or a bad result. And you have to be able to navigate that. And the data infrastructure does not keep track of that. So it's not an information that you can recall. And that's like one of the things that I'm saying that

It is a big problem to solve because it's not just like the evals from point of view of like how scientifically do an eval that has validity, it's valid. It's also how to capture all the data because it adds overhead, store all the data which adds overhead and process all these evals that now like explode in terms of like the number of evals that you have like to do there, right? What you're talking about is...

the logs and this data lineage that we have is not sufficient. It's not painting a good enough picture for us because even if we know that, yeah, this call went through successfully or this data was transformed in this way, we can't get that granularity that you're talking about. And so you don't see that as something that,

of a job of an observability tool? It's like the same thing as we had, like if you think about like the data platforms, right? So you had something like Databricks or Spark that would go and actually execute whatever logic you deploy there, right? Then you most probably would have another tool that is doing like, analyzes the lineage or like does like the quality checks, blah, blah, blah, like whatever. Now,

take the QA of data and the lineants there as inputs and substitute the names with evals instead of QA, right? And the lineage still has to be there. I think...

I think one of the problems is that we've been building applications around LLMs for chat modalities primarily, right? So, of course, when you do that, it's all about, okay, that's the prompt of the user. That's like the output. Is this like a good one, right? Yeah.

But when you start getting into an environment where the interactions with the LLMs for an outcome become much more complicated and they have dependencies between them, right? You have to expand your understanding of that to the whole pipeline that is built to do that, right? Again, you might need...

probably like a different tool to do that, but still this different tool needs to access data that the engine that does the processing has to capture, which on its own is like a hard problem. And then it has to work with... It will face like the same problems that observability in the app world have, which is there's like a lot of data. The value per piece of data is not that high.

So we have to be extremely good at storing this data to make it affordable. And then you have the additional problem of like evals being slow and expensive. So again, how do we pick the right evals like to do? And what kind of tooling we have like to give to the users to build the splunk of LLMs at the end of the day, right? I don't know. I'd love to see...

how this is going to come out, because I think it is a pretty lucrative space to build. Although I know they're very, as in any other LLM-related activity, thousands of companies trying to do it. But I still believe there's a lot of noise and not that much signal. And again, what I'm trying to say is that I think there's tremendous value for the data people to go and build

solutions for LLMs because primarily it's driven by more like application engineering people and there's a lot of like foundational stuff that comes from the data world that they are needed if we are going to be building with LLMs. And I wonder, in your time talking with folks, it feels like right now because there's so much open space and there's so many new pieces that we're trying to add to our

and our ecosystem of putting this into production, there's a lot of things that you could do. Where have you seen people focusing on what absolutely needs to be done before we can do anything else? Like what are the main bottlenecks? Do I need to go out and get an evals tool? Do I need to go out and get a pre-processing

proxy or like an AI gateway? Yeah, the eval tools are important, right? You need that feedback loop, right? To be able to help you understand if you're building effective AI, if the output of what you're doing, if it isn't what you're expecting, why isn't that, right? So this eval tools provide you that feedback loop to be able to go and do that. You know, something that I always think about when you're talking about tabular and structured data world, right, is

engineering teams were really good at building canary builds, for example, that go back and try and sense any regression that happens in these pipelines that are more or less deterministic because you're dealing with structured and tabular data.

And so we'd run these nightly canary builds and the output of that would be like, oh, there's like 4% drift or like 5% drift because you introduced some regressions by adding some application code that actually caused some drift from what we're expecting to have the output of these pipelines be. That same kind of mindset needs to be applied now and thought about. And that's where I think we're talking about the opportunity here is like, how do we take that same concept and

and allow teams to be able to build these kind of canary pipelines, for example, on top of output for non-deterministic models. And that's a very big challenge because the nature of the data is totally different. You're dealing with lots of text,

These evals are very expensive to store and process and build insight on top of. Now, the more complex the problem, I think the bigger the opportunity from an engineering standpoint, and I think the more fun it is to go and solve it. That's why I think it's wide open for helping teams

build that, right? Like imagine if you're able to, like the transcripts example, right? You're able to assign certain scores and confidence levels to the output of the pipelines that you ran on a daily basis based on certain properties that you consider to be successful outcomes of the pipelines, right? Which is what kind of canary builds were built for, right? It's a very deterministic way of being able to evaluate

your software and your pipelines. And so applying that same concept to unstructured data processing is, I think, going to be very important and a huge unlock for now AI and data teams feel confidence around putting these and leveraging inference as a new transform in production. Everyone would like to do something with AI, right? Everyone's like, oh, like, okay, we should invest in the cloud. That's like why

I think that's one of the big reasons behind having the role of the head of AI is because there is a mandate like from the board or like whatever that we need to do things like with AI. We have no...

idea what to do. But it's a race, because all of our competitors are doing it. We need to be on that. So we'll bring this poor guy here whose job is like to go and find use cases. So he will go, as Bionis said, like to the marketing folks, be like, hey guys, like, how can I help you? I have budget, by the way. So that's great, right? Like you don't. So let's do it. Salespeople, the same thing, like engineering, the same thing, like product, the same thing. And then they have like to go and build something.

And so the first thing is like, okay, many companies are still at the stage. Where should we focus? Like from all the different things that we can do, like where we should focus, right? Which project we should run? And these are typically companies that are early in the journey. Then you have cases where they build the demos, but...

you get, it's hard to deliver consistently the value to whatever like the stakeholder is. Because again, salespeople don't care about your signing technology. Like, okay, if you are going to help them, like you have to help them, right? It's like, and you are talking about business lines here that they are like as quantitative as they can be. It's like, I have a quota dude, like, or I'm going to lose my job. So like, can you help me with my quota or not? Like, that's it. If you help me 30% of the times,

I have to think, is that like worth the time like doing it? I don't know. And potentially you can mess up what I'm doing 30% of the time because you're sending the wrong emails or you're sending some hallucinated jargon. 100%. And then there's, I had like some, we had like some interesting conversation with folks that they were saying, well, we run like a few experiments, right?

And our experience was that, and I'll give an example. Let's say we have tickets, right? And support tickets. And we want to be able to extract, first of all, label them somehow into categories, and then extract some information from there that can drive our product decisions, right? So what they usually do is they find a company that provides labeling with LLMs or whatever as a service, right?

They go there, they're like, "Okay, I have to take all my tickets. I have to upload the tickets here. I'll run this thing." Something comes out, well, it's not exactly what I expect. Kind of a black box. Yeah, like iterate, iterate, iterate. Actually, what he was saying, we did a lot of prompting gymnastics to make it work. And actually, one of the very interesting feedback on that was like, we needed

like when you are doing classification, you still have the problem of like, you have to tell to the LLM, like, what's the classification scheme of that, right? Or use an LLM like to figure it out, but still you have to figure out what are the classes there, right? It's not like, it's not an oracle. I will just come out and be like, yeah, that's what you should do. And like, shut up. You don't have an opinion, human, you know? So we did that, took like some time. We ended up getting like an output, which was like a data set in CSV that we could download, right?

I was like, okay, that's too much work to start. Because here's the thing, when you get your labels, that's when the real work starts, right? You still have to go and figure out insights out of these labels. So the guy was like, okay, I can't do that. I can't be in the process of downloading, uploading...

getting CSVs, put them somehow on my snowflake and then have an analyst also to go and analyze these things. Okay, that's not going to work. Extra work. Exactly. So it's a lot of, I think, the kind of... It's probably more like a product mistake, but again, people need to understand, they need to meet the users where they are and not try to force them to do things that...

do not fit their workflows. Because a product person is a product person. Again, he's getting just by the business based on the product work that they are doing, not the labeling that they are doing. Yeah, it's potentially a huge distraction. Oh yeah, and it is. And then you have the companies that they manage to get to the point where they have things that they work with

But then they're like, okay, how do we put these things into production? Which on its own is like a big, big conversation, like what that means and what risk it puts to the project itself. Because again, like the people on the other side, they're waiting results, right?

And then there are like very few companies, I think like, okay, it's like usually either like Fortune 20 type of companies or very like Silicon Valley high-tech companies, which by the way, many of them, they say they're AI, they are not really AI, right? But there are some of them that they are doing, like they put things in production like seriously. But we are talking about, I don't know, like maybe 10 or like,

low hundreds of companies out there that have successfully do that. The other thing we see too is the common theme is everyone gets wide-eyed when the head of AI comes to them and be like, "What can I build you with AI? Just tell me, I have a whole team. We'll go and build this for you. It's going to be great. I got budget. I got everything going. Whatever you want, we'll top it off with a cherry for you." And they'll go and they'll come up with these ideas. They'll create their AI roadmap for the quarter. The team will go and execute on it.

And then he turns around and says, yeah, we built it. We spent the time on it. We gave it delivered to them with a cherry on top and a nice bow and they don't use it, right? And so I think a lot of it also has to do with you really need to go in and identify the high value use cases that are business critical for you, right? If you get something that's like,

kind of supplementary and not really easy to integrate in the day-to-day workflows of a product marketing manager, for example, they're probably not going to use it.

And so I think it's very important when you're thinking about considering and taking on new AI projects that the business outcomes and there are key success metrics that are there, right? Like, okay, content moderation, for example, is another big example besides like the transcript thing, right? These companies that are building communities where the users are applying content,

And they're going through and spending tons of money on content moderation teams, which means humans are going through and reading every single message

in order to understand whether the message was safe or not to be published into the community. It's all about building very safety. So if there's any like racist undertone or sexism or things like that, immediately want to disqualify, profanity, things like that too, right? And so when you have a use case like that and you're spending hundreds of thousands, sometimes millions of dollars on hiring this workforce that's just literally sitting there reading messages manually,

Now there's a business outcome associated. How can I reduce my cost? Because that model doesn't scale. Now, let's say I want to expand to different locales. I want to now offer this in Portuguese and I want to offer the same community in Spanish or French. I have to go and build out these content moderation teams that are French speaking and all of that. LLMs are great at that use case too.

And the value is very apparent there, right? Like now no longer do I have to go and spend millions of dollars on hiring these folks. I can dedicate that cost towards inference, build these really robust pipelines that are helping me do content moderation and even get better performance outcomes and metrics. Like my mean time to review a message or a conversation is no longer eight hours because I need to wait for...

the content moderator shift to start in France or whatever it is. I think that's one of the key lessons that we're seeing a lot too is,

As you're a company thinking about it, bringing on, building on AI use cases, make sure that there are key success metrics and business metrics that you're targeting and that you're able to track as you're putting AI into production for you. I heard it like on a X and Y axis. And I think the guy's name was Sergio that said it was when I was at the JCPenney.

the Gen AI conference in Zurich. And basically he said, put on an X and Y axis, the impact and then confidence that you can actually implement it. Yeah. And whatever is the highest up in that top right quadrant, that's what you should start with. Yeah. And the confidence you can implement comes down to the tooling and infrastructure that you have to do it right. And that's where I think we're seeing a lot of innovation going into now. It is helping build that confidence

for teams to be able to say, I know I can put this into production. And this is a very common issue. Now the only other, yeah, now the only thing, question is, is there business value? And that's potentially up to the stakeholder to be able to decide that, right? Yeah. There's something else that's, like, I want to add to that because we see,

You know, having budget is kind of like a blessing and a curse at the same time for these teams, especially in the market as it is right now. Because one of the patterns that we see is that, okay, usually the head of AI is not necessarily a technical person themselves, right? So they rely a lot on the teams to find the tooling that they need, right? And obviously there's a lot of offerings out there. Now we are still at the stage...

of the industry where there are so many verticalized, very specific tools that do one thing and they're pretty much like a black box. They have money, so they go and get that stuff. And it's good because in a way it helps you kickstart your projects, but there's a huge trap in that. You can't engineer systems with black boxes. You can't do that.

And people need to understand that no matter what LLMs are, there's true engineering that needs to happen to make them robust and deliver value at the end. So my advice to the people is that you should try and invest more in the infrastructure and your knowledge of how to build the right things and what practices will drive you there instead of being like,

"Oh, you know what? Like, okay, I need to OCR something here. Let's go and like use every black box OCR thing there that says that they are 5% better than Mistral or like whatever out there, right?"

And do it for prototyping, 100%. But I guarantee you that if it's successful, you'll get to a point where you'll be like, okay, now what do we do with this thing? Because it's either too slow, it's not reliable. Oh, now we are getting outputs that we can't really understand why we are getting the outputs that we are getting. Or, oh, now we are getting into a different use case of different documents that these models that they are using here are not probably as good as they used to be. So now we are...

Are we adding another tool that we have to manage and who's going to do that? Right? So I think engineers should keep thinking like engineers invest like in tools that they are, let's say, good with like their infrastructure for their work and they will build the value on top of them. And one last thing on that is

There is a reason that this revolution of AI goes through the engineering practices first. The reason for that is that the work that we are doing as engineers is to validate. So you cannot... If I ask the LLM to build me a function,

It's almost like you can almost automate figuring out if it's working or not. You just run it. You compile it. Now there are many super impactful problems to be solved with LLM that you don't have that. If you are doing the things that Yoni was describing with the transcripts, how do you validate with 100% confidence that this thing is going to work?

And data, and that's like something, again, I'll go back to the data practitioners and how important their knowledge and experience is like to make this successful because they know that data drifts. That was always the case.

Like, there's nothing that you can just build it once, put it out there, and it's going to be working forever. And that's going to become even more true with LLMs. So you have to build systems. You can't just throw black boxes in there and make things work. You have to be an engineer. You have to engineer this, and you have to keep iterating on it as the data drifts, as the needs drift also from the users. And with LLMs, this is going to happen like...

in a much more accelerated pace. So again, like focus on core skills and infrastructure. I want to connect the dots on two things that it feels like we're dancing around, which is that business value and the skill of being able to sniff out that business value and understand how to properly implement it is one thing. But then going back to since 2020, I've heard almost...

everyone that has presented at an MLOps community event talk about in some way, shape or form in different words saying how

I can build the best model. I can build with the highest accuracy score. I spent five weeks tuning it so it went from 98.1 accuracy to 95 or whatever that like metric is inside of it. But then I gave it to the people that were going to be using this model and they didn't use it and it fell flat on its face. And all of that time that I had spent on it was for nothing.

And so the whole idea of being able to make sure that what you're building is the right thing and that you're spending ample amount of time getting it into production as quickly as possible to know if it is the right thing or where you need to tune it is so important. And that's like what you're saying, Kostas, here is get it in there, start working with it, engineer it in a way that you can then go and debug it when you need to figure out if something

It's not working because we've absolutely missed the mark on the product or we missed the mark on one of the steps in this pipeline.

Yeah. It's not about the models anymore. The models are going to keep getting better, keep getting more accurate, probably cheaper and a lot more performant, less hallucination, all that kind of stuff. It's all around the infrastructure you have around it now, right? And your ability to have that really tight feedback loop to be able to know and build confidence around the outputs of the pipelines that you have and be able to trace all the way back to be able to iterate and make improvements incrementally on that.

And that's where we lie now in like the AI innovation spaces. There's lots of great innovation happening towards that. And I think in the next couple of years, it's gonna be a lot more prominent where we're starting to see teams feel like you're talking about the X and Y axis, the confidence level is gonna be going way up because of the infrastructure and tooling that's being produced now.

I want to give a quick shout out, Demetrios, you are the ultimate community builder. And Kostas and I are always super impressed and admire, like, you just seem to be everywhere all at once. And we're like, oh my God, I don't know how he does it. So, and it's like amazing seeing the prize of the community and like how you've been able to grow it and all the conferences that you're doing. So like,

Lots of respect, man. And it's been really fun just like over the last few years seeing all of it happen. Well, I'm glad that we got to make this happen. Yeah. And thank you. Big thank you for making TypeDev and our company happen. Like it was one of the things that's super hard when you're building a startup is like finding co-founders. And we know we're both second time founders. We put even more emphasis in it.

So we don't take for granted the fact that you made us meet up at some random blue bottle in San Francisco and then five minutes into the conversation decided that you had better things to do. It rarely works, but this time it did. So I'm going to put that on my resume now. Yeah, so thanks for everything, man. Communities have always been very important in building technology, and they will always be. So...

That's another kind of service that you provide, bringing the people together, especially when you're solving problems that are not even well-defined. At the end of the day, it's all about emerging patterns through people interacting who are passionate about what they are doing and trying to find solutions. So that's probably the most important thing. And I think it's beautiful, too, that you mentioned that

people come from different backgrounds, like the data engineering background, like the modeling background, data science, like the SRE background, and getting this space and specifically this space, getting to see how each one of these folks is attacking the problem is really cool. And it makes for fertile ground for innovation. Oh, a hundred percent. And I think if we want to succeed, we need to somehow like increase the cross-pollination of

like these communities and that's your job to do obviously. Uh, but there is like tremendous value, like bringing like these diverse, let's say, uh, engineering disciplines like together and like trying to, because here's like the, what is like very interesting and kind of like, uh, why, why I'm excited with, uh, LLM is because when we talked a lot about my age, uh,

But LLM is like in a way like they make me feel young because it reminds me like how technology was when technology was young. You know, like people...

complain today that like, "Oh, like this thing is like not reliable, blah, blah, blah, blah, blah." But they forget that in order to get our databases to be transactional, and probably kids don't even know today that like there is like transactions that they ensure that when you write something and I write and Johnny writes, it's going to be the correct thing. It took like decades of research and development to get to that point, right?

So we kind of like we are again in these like early states of like a new potentially very transformative technology. And it feels nice. I mean, it's not easy, but you know, you get back like to how it was like hacking with networks around and like networks not been reliable and not like having fiber at home, like each one of us and don't have even to think that there is a router somewhere. Right. And

There is... One of the bad things that SaaS did was that it managed to hide from the vast majority of the engineers out there the complexity and the effort to make things reliable. But it was always about that. There was no technology that the day that it was introduced was reliable. It took a long time to make it reliable. The same thing will be also like with LLMs, right? And that's where engineering comes in. But...

We are not like in 1995 anymore, we're like in 2025. And there's like so much experience with all these different disciplines and like bringing these people together can really, really accelerate and make the things that took like decades to happen before like now happen like in just like a few years. So go out there, bring them together. It's top of mind and innovation, especially being here in the Valley, like thinking about AI and infrastructure and the new world that we live in now.

But building as a community is also very important. Having the community one, but also being able to contribute together and innovate together, right? So part of our launch too for typedef is we're open sourcing one of the libraries that's

really amazing at going through and helping people build in, let's say, like Jupyter notebooks and interfacing with LLMs very nicely. And so that's one of the things I think is a huge help in being able to move the pace of innovation forward is once you can have a project and have multiple projects and everyone's contributing, there's a lot of excitement. It helps build a lot of momentum in that space. Who's going to be the pets.com of the LLM bubble? Yeah.

Okay, I'm not going to say a name for the company. No, that's not fun then. But I do think that the prompt website companies, they are going to be something like... Have a rude awakening? Yeah, yeah, I think so. I think we need them. We need the page.com of the world to happen for things like...

You know, that's like, if you reflect back to the dot-com bubble, like people say, if you think what was happening back then, it was an extremely verticalized solution that was built for pretty much everything. Crust. And then...

people realize that we need platforms, right? And then you got Amazon, which is basically pets.com, but more. Yeah. And you have like Shopify and you have Spotify. You have Shoei though too. That's doing pretty well. Oh yeah. Wait, are they Amazon? No. No, I don't think so. Anyways, I digress. Yeah. My pets.com is...

Whoever has that fucking billboard that says don't hire humans. That is on the... Whoever those guys are. There's so many things that you have to do to get a billboard. And the fact that it got so many marketers and top level people to sign off on that billboard. I don't even know who it is, but I know that I don't like them. I don't know. I personally feel that anything that feels like too easy cannot be real. Yeah.

I might be wrong. I'd love to be wrong for myself at least. But problems that are valuable, they tend to be hard. You need to put effort. You need to work hard to make them successful. So there's no easy path to... It's not like the LLM, just because you're doing LLMs, you are going to be rich. It doesn't work like that. I think there's tons of eval companies that are out there now too, right? And I think they're solving very hard problems, but

we might start seeing some consolidation there too. As you get think about the main observability and big players out there like the data dogs of the world that are also very much thinking around how do they go and integrate into AI. That's going to be an interesting thing to see. I think they're still cropping up, there's lots of different eval platforms there, but the ones that are solving the hardest problems like what Kostas was describing, I think are the ones that are going to be able to really stand on their own and substantiate there.

But it's a very important part of the AI lifecycle, right? The problem set that they're going after. I don't know that we need like hundreds of them, but it'll be interesting to see, I think, you know, what happens in that space generally to track that over time. Yeah, I think like, although I think the good thing with evolved companies is that to build an evolved company requires like a baseline of technical competency. At the end of the day, the teams that they build something,

they will have some kind of like valuable exit, let's say, or like at least like the value is not going to like to be destroyed. There are companies that they will destroy value. Like there are like companies that will end up like the... What was the company? It wasn't like a dot-com era company. It was like much more recent. The one that was doing the...

One Click, Checkout. Oh, Fastly or Fastly, one of those. Yeah, the one that had like the record in like Bern. Well, it wasn't, no...

Bolt was the other one that was doing well. I think it was fast.com or fast. Yeah, the one that had the broken record of how much money they burned in a year or something like that. And the founder went on Twitter and was saying stuff about Stripe and how it was the mafia. Yeah, of course. There's always someone to blame if you want to, but still, there's value that has been destroyed, right? So,

There's always that in the industry. I think it's part of any fast pacing and high reward space. It doesn't mean that everyone is a scam or anything like that.

AI Reliability, Spark, Observability, SLAs and Starting an AI Infra Company 01:37:22 Share

MLOps.community

Shownotes Transcript

AI Reliability, Spark, Observability, SLAs and Starting an AI Infra Company