We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302

2025/4/4

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Rohit Agrawal

Topics

我加入Tecton四年，最初专注于实时数据处理，现在管理着负责流数据、批数据和在线/离线推理的团队，主要工作是处理Tecton核心产品的数据流和基础设施。许多公司虽然使用了Kafka、Kinesis等流数据解决方案，但难以有效利用这些数据进行机器学习或分析，因为流数据生态系统非常分散，需要整合多种工具和技术，并持续维护。构建流数据处理系统需要整合Kafka、流处理器（Spark或Flink）、存储（键值存储或离线存储如Iceberg）和服务层，这需要多种技能，且系统维护成本高昂。对于小型公司来说，现有的流数据处理工具难以使用，难以从零构建，因为这些工具本身功能有限且难以使用。DynamoDB用于存储来自流处理器的记录，并提供高吞吐量、低延迟的服务，但维护它需要仔细考虑模式设计、读写延迟的权衡以及随时间的演变。流数据处理系统的成本与数据新鲜度密切相关，理想的系统应该允许用户根据需求调整新鲜度和成本。流数据处理系统的维护通常由不同的团队负责，这导致了团队之间的隔阂，并影响了端到端可靠性。许多公司在简单的流数据管道上花费大量资金，这包括基础设施成本、SRE成本和日常运营成本。使用S3作为存储服务并通过S3传输所有数据，可以降低运营成本并提高弹性，例如WarpStream（被Confluent收购）就是这样的例子。减少检查点频率可以降低流数据处理的成本，因为检查点频率越高，成本越高。数据新鲜度和成本之间需要权衡，系统应该提供相应的调节机制。数据科学家和数据工程师对数据的关注点不同，数据科学家更关注高层抽象，而数据工程师更关注底层细节，这会导致效率低下。流数据中数据的保留时间会影响成本，需要根据实际需求进行调整。许多公司需要整合流数据和批数据，需要确定流数据和批数据的截止点，以降低成本。公司通常会为批处理和流处理创建两个不同的系统，这会导致工作重复、基础设施成本增加以及维护成本增加。越来越多的流数据生态系统参与者提供更多的托管服务，例如Confluent的TableFlow，这有助于简化流程并降低成本，但这些服务本身也可能很昂贵。许多公司更关注产品速度，因此他们更愿意使用托管服务，即使这些服务可能比自己构建更昂贵，因为这可以避免维护成本并加快产品上市速度。Iceberg有望成为数据领域的GitHub，成为所有组织数据的中心存储位置，这将促进不同供应商之间的互操作性，并简化数据迁移。公司不必追求像谷歌或Facebook那样的大规模架构，可以使用更简单的工具，例如DuckDB，来处理数据，这取决于其工作负载的需求。我不认为所有数据处理最终都会迁移到流处理，批处理系统在处理大量数据方面仍然非常强大，流处理和批处理系统各有优势，并存是必然趋势。越来越多的公司选择BYOC（自带云）解决方案，这使得供应商可以在客户的云账户中部署其堆栈，而不是要求客户将数据迁移到供应商的云账户中。

Deep Dive

Chapters

Rohit Agrawal, Director of Engineering at Tecton, shares his career journey, current responsibilities overseeing teams focused on streaming and batch data, and online/offline inference. He also describes Tecton's role as a feature platform for various applications.

Shownotes Transcript

Translations:

中文

So my name is Rohit. I work at Tekton. My title is Director of Engineering. I essentially just lead a few engineering teams. And for coffee, I'm a huge fan of Comiteer, actually. I like my coffee, like basically how easy and simple I can make it. So I started using Comiteer back in COVID when like all the cafes had shut down. And I became a user then and then.

I've just since then, like, I think it's one of those COVID things that I just stuck. Probably maybe the only thing. And I still drink Cometeer essentially. People of Earth, today we get into it with my man Rohit. All around streaming data, this real time, low latency use case that can be associated with traditional ML. Now, I want to tell you a quick story before we get into it, because I

Over the holidays, about two months ago, I asked everyone in the community what they wanted to see for the new year. And many of you said, I want to see some traditional ML stuff. It feels like we've been talking about the AI and LLM boons too much. So this episode right here is for your viewing pleasure. And if you are listening on YouTube,

stations around the world, we've got a recommended song, recommended listening, which comes from a fan, Amy or Ami, because I've been asking folks what their favorite music is when they join the community. And this one is a gem that I did not know about. It's Gibran Alkoser. And the song is called Idea 10.

Testing production. That's how we're running with this. Basically, this conversation came about because I was on a call with you and Kevin, the co-founder of Tekton, and you guys were explaining to people all about streaming and the situation that the data streaming and streaming ecosystem looked like. And I thought, man, that is such a cool idea.

topic for a podcast, I would love to talk to you more and dive into it. So we should probably just start with like, what is your day to day? You're working at Tekton. What are you doing there? Yeah, absolutely. So I joined Tekton about like four years ago. So I joined as an IC. We were pretty small back in the day. So I worked on pretty much everything, but I focused mostly on the real time pieces. So for example, streaming, as you mentioned, and a lot of challenges around that.

My role now has changed a little bit. So I am more in like a managerial role right now. So I manage three different teams that focus on like streaming data, batch data, online inference, offline inference, et cetera. So kind of like the core data flows and infrastructure that Tecton provides as part of its like core product. So in short, my day is a lot of meetings, but yeah. We should probably state for everyone because we can't take it for granted that everyone knows what

Tekton does. Yeah, absolutely. Yeah. They're one of the original OGs feature platforms. So a lot of dealing with your features, transforming your data, making sure that you can do use cases like fraud detection, recommender systems, loan scoring. I mean,

Are there other stuff that you see people using Tekton a ton for? Also for like insurance, claim processing is also like a big use case for us. Recommenders are like in many different shapes and forms. And then broadly fraud and risk applications, I would say is the biggest like use case for us. Yeah. Yeah. Those use cases that you just need really fast responses. Correct. Yeah. Yeah.

So that's why we wanted to talk about streaming, man. That leads us right into the whole streaming ecosystem. Like, break down. What is it? What does it look like right now? What are some things that have changed since I was on that call with you like a year and a half ago? I can't even remember. Yep. Yeah, absolutely. So I think like when we look at customers, most of them have, you know, invested in

like a streaming solution which can like stream data at low latency. So for example, this is like Kafka, Kinesis, et cetera. But that's kind of like a data delivery mechanism, but they kind of fail when it comes to using that data effectively for applications. So a lot of app companies we talk to have like data in these Kafka streams, Kinesis streams, but then they fail because

in making like machine learning models or even like analytical use cases use this data effectively. And so... Wait, why is that? So, yeah, I mean, I'll go into it. So I think like fundamentally the ecosystem is extremely fragmented in terms of like several different tools and technologies.

And so you have to string together a bunch of these things, hire experts for each of these things. And then once you can even do these things, it's not a one-time cost to actually build it out. But these systems need like resiliency and reliability. And so it's like an ongoing effort to basically maintain these systems. So the simplest solution we see is people have like a Kafka stream and they would have a stream processor, which can either be like a Spark or a Flink.

And then they basically connect that to then storage, like could be key value stores, could be offline storage like Iceberg, and then build a serving layer on top of it for applications to consume this data. And each of these different steps actually requires like different skill sets. So building a team to actually do all of these things is very difficult. And the tools inherently that are available from the ecosystem are also like fairly limited.

difficult to actually use, especially if you're trying to like go from zero to one. I think like if you have a very large team and you're running like Flink at let's say, you know, Facebook or Metascale, you know, it's maybe somewhat manageable. But for like smaller companies and teams that just want to still leverage the real time data, the ecosystem doesn't really have like these simpler tools available for people to use. Yeah. And

I know that DynamoDB plays a huge part in this. Where does it plug in? Is it on that serving layer? Yeah, so Dynamo is basically storing records coming in from a stream processor and then making it available for high throughput, low latency serving. So essentially something needs to write to Dynamo and something needs to read from Dynamo. And that writer and reader is something that each company kind of builds it their own way. And DynamoDB

even maintaining that is a challenge because you need to think through like what is the schema which favors like read latency versus like write latency and which one is more important to you and how do you evolve that over time and all these are like challenges that people who are building all these pipelines have to think through very deeply when they are building these systems. Yeah because you're trying to just keep things super fast as we said and make sure that

your use cases like the loan scoring or the fraud detection or recommender system, whatever, is just blazing fast. And so you need to think through wherever you can shave off milliseconds. Yeah. And there's this also additional dimension of like cost because I always think of like these are two sides of the same coin, which is

how fresh do you want your data and how much are you willing to pay for it, right? And ideally, they should be in sync with each other, meaning that if I want data at millisecond level granularity, I'm willing to pay more money for it. And if I am okay with an hour worth of freshness and I'm willing to pay less for it, right? And you ideally want a system that has the right levels of knobs built in, which can give you this trade-off and you can actually play with it

For a lot of systems that we see in the real world, this is not present, meaning that they either have a system that's built for very low latency and there's a very high cost of the system, or it's like the complete on the other side where the cost is relatively low, but the system cannot afford freshness that's pretty low, that's required by applications. And

most companies have like a very, like a very varied set of use cases. Some need very low latency. Some are okay with slightly higher like latencies, et cetera. And so I think having a system that can manage the cost for that freshness is also very important as opposed to like, you know, building it like the simplest analogy is like, you know,

people are not just looking for like a Ferrari because it's like fast, but it's also very expensive. And then like what people, what really want is that like they want a system that's like fast on some days when they need it to be, but can also like act as like a Ford SUV when you want to go to the grocery store and buy some milk and eggs. Yeah. If you're with your family. Yeah. And that makes total sense because you have different use cases that have different needs and different freshness and you're

different variability of how much you're willing to pay for that. So for each instance of these, you want to be able to configure it to those different constraints that you have. Yep, absolutely. And the other thing that I was thinking about because you were talking about how folks are using Flink or Spark, I think I remember...

DuckDB being plugged in here in some ways. Can you talk to me about that?

Yeah, so on the batch side of things, we do provide a solution called Rift, which uses .db as an alternative for some of the batch pipelines. So that's kind of like the equivalent of Spark Batch, right? But on the streaming side of it, we don't use .db there. We have our own stream engine there, essentially, that we use. Yeah, that makes sense. Okay, so basically, that is the...

scene you have set is that there's a bunch of disparate pieces to the puzzle. Mostly everyone builds one size for all of their use cases in their organizations and they don't think about how to service each individual use case.

And then you've got reliability or maintenance of these pipelines and of these platforms that come into play. How do you see teams dealing with that maintenance process?

Yeah, I mean, honestly, it's like very challenging because a lot of the times the teams that are building these are sometimes like machine learning engineers or people who are more focused on like the data layer of it. But the production experience of like maintaining the system typically falls to a different team. And this silo between the teams often is a huge challenge for a lot of our like customers that we see.

And on top of that, like it's users care about like end to end reliability, right? As opposed to like whether my flink went down or my worker went down, etc. And they were looking for solutions that can offer end to end reliability across the entire pipeline.

And that's something that's just very, very difficult to build because a lot of the on-call production team thinks of like one service in the individuality of it. And a lot of the machine learning teams care about like the end-to-end reliability of the entire pipeline. And this mismatch often is like a huge concern for a lot of our customers that we see.

So wait, this means basically the SRE who's getting pinged at 3 a.m. is getting pinged for one particular piece, like the Flink went down. Exactly, yeah. But the machine learning engineer is thinking about, well, I've got Kafka and I've got my Flink and how are they looking at it? That's not just...

just like that one service went down. Correct. Like for example, if let's say like your Flink went down or like your Spark structured streaming engine went down, then probably no messages are being backed up in like Kafka. And now the machine learning applications are seeing like high freshness, basically like they're not processing any more real time data. And so the repercussions of each of these are like so far and beyond that a

a lot of the if you're building a solution that requires stringing together a bunch of these disparate tools then it's like a never-ending game of like trying to build end-to-end reliability because it's just very difficult to do that for like individual components and then you start to see that's where you can make the case for

a lot of money being burnt on fire. Yeah, I mean, we constantly come across customers who are spending like, you know, tens of thousands, if not hundreds of thousands on like, just simple like streaming pipelines. And I'm just talking about the infrastructure cost of it. There is the other cost of it, which is like the SRE cost of it and the operational cost of it on a day-to-day basis. But just like pure AWS infra will

for all of these pipelines is like pretty high and like optimizing it is very, very difficult given their stack. Have you thought about different best practices on how to optimize that? Besides like, okay, just plug in Tekton here, right? I imagine that you have seen teams that are doing this well. And what does that look like?

Yeah, I think like, to be honest, a lot of like newer age technologies that are coming in are actually aimed at bringing down like the operational and one-time cost of setting these things up. I do think it requires a departure from just like a,

mental thinking model of like, I want to use what Google and Facebook are using, which typically could be Flink, etc. And I think it requires you to be open to some of the newer technologies that are coming in. So I'll give you one example where we're seeing a lot of like streaming solutions, not use like local disks attached to nodes, but use S3 as like a storage service and stream all that data through S3 essentially and kind of build on object storage.

So one good example is like WarpStream, for example, which kind of like is like a Kafka on S3, basically. They got recently acquired by Confluent. And we're seeing a lot of like these newer age, like real-time operational systems that are truly building for like the cloud and leveraging object storage. And I think that's a huge way, that huge lever that customers can pull and see like where are we using cloud resources effectively in our stack or not.

And is that just because it's much cheaper to do it through S3?

It is cheaper, but also like the, it gives you elasticity. So I think like the biggest way you can think of it is you can start to scale compute and storage independently, as opposed to, for example, if you were using like disks that are attached to your server, then every time you add a new server, you're also adding more disks to it. And so it's very difficult to decouple the two, whereas some applications may require more compute. And in those cases, you're still ending up paying for like more storage.

And I think the cloud gives you this elasticity of like being able to scale different like layers independently. And it leads to like not only reduced cost, but also just like reduced operational overhead. Yeah, that is cool. What other tricks? Because I hadn't heard about these guys that just got acquired from Consulate. What was their name? WarpStream. WarpStream. Yeah. Fascinating. Yeah.

Yeah, I think there are other interesting tricks that we see some of our customers use. For example, a big cost of streaming workload is typically checkpointing, which is how frequently are you writing to storage in case your streaming application, in case your streaming pipeline fails, you can recover from the latest checkpoint. And typically, the more you are checkpointing,

the more you're paying for this checkpointing cost. But the benefit you can get is that if you, let's say, went down, you can pick back up from the most recent checkpoint, which maybe just be a few seconds ago, right? Because you're checkpointing so frequently. And I think for a lot of our customers, the default checkpointing settings for a lot of these streaming applications is just very, very high. And they're often surprised to see just the cost when the bill comes up. And for a lot of our customers, we can just look at... It's just one of those things where...

When most customers ask us, why is my streaming bill so high for my existing stack? The first place people look is, how frequently are you checkpointing to disk, etc.? And I think a lot of our customers are just surprised by, okay, I didn't even know about this. And what is it that the majority of the time you don't need to be checkpointing as much?

So I think like it's, I think checkpointing is usually like very important and it should be done. It's a question of like, do you want to do it every second, every 10 seconds, every 30 seconds, every minute, every hour. And just the default settings for a lot of the streaming workloads is like very aggressive checkpointing. And if, even if you can tune it from like one second to let's say 10 seconds, and if your application can allow that level of like

recovery point objective, then you can save like pretty significant costs for a lot of our users.

Yeah, that feels like a huge difference from one second to 10 seconds. Yeah, I mean, as I said, it's a lever of how much cost are you willing to pay for how much freshness. And these knobs are present in some systems, but you need to know of what these knobs are and how to tune it effectively. Otherwise, you might just end up spending a lot of cost for workloads where you don't require that level of freshness. Well, it goes back to your whole

thing at the beginning where you were saying when you have to patch together so many disparate tools, you have to have people that understand and are experts in all these tools so that they know what

oh, it's possible for us to just checkpoint every 10 seconds. Yeah, and they also need a reason about the end-to-end pipeline and understand what this one thing will lead to the end-to-end pipeline because most application teams are concerned about that as opposed to like this other objective, for example. And it's just like a very hard problem to solve. I know you talked about the way that you look at the shape of the data as being a lever that you can pull. And so it is.

This picture that you're painting in my mind is just like, there's so many stakeholders involved. And that just makes it way more complex because anytime you have to throw more bodies at a problem, it's going to increase the complexity of the solution in this case.

But then you've got like, like I imagine the data scientist isn't thinking about the schema as much for how to make it just fully optimized for speed. Yep, exactly. Like I'll give you, I'll give one example where we came across a customer who was like storing protobuf messages inside the message stream, right? And well, it made sense because the team that was like putting data inside the stream was

was dealing with protobufs and they were like doing it that way. However, like on the consumer end, we were having to deserialize all these protobufs into messages that we can like parse and ingest in our like store, ingest in the storage and serving solution. And when you spoke to the data scientists, like they,

just did not want to reason about like why do you want to store protocols in the first place and it was just like an abstraction level of thinking where they are which they are like not used to they are used to operating a little bit like higher level of abstraction which like completely makes sense but they are either forced to like reason about like much lower levels of abstraction or they just have to deal with like the performance hits that they are like getting yeah and it just seems like they have to pay a tax one way or the other with a lot of the existing systems of BSC.

Yeah, what a great point, because they're just going to do whatever's the easiest. It's like water flows downstream and is trying to minimize the friction that it encounters. And so if that is the only solution that they know works, and they're probably going to take that upon them. And it makes me think also about how

important it is on the organizational side for these different stakeholders to be talking to each other and to understand each other's goals and know that, all right, the data engineer is setting this up for me. So what, like, talk to me more about what the data engineer can do, right? Like if I'm a data scientist, I am diving into data engineering, which probably isn't my strong suit, but it's going to help me understand what these folks are

capable of of

doing or it's going to help me understand what levers these folks have to pull. Yep, exactly. And like, for example, we see cases where there are some people who don't care about what is inside the stream as long as they treat each of those messages like a black box, basically. And their job is just to make sure that these black box are flowing through the system. And there is some other team that cares about what's inside the black box and how that affects the end-to-end system. And

We see this again and again about how it's kind of like, this is not my problem, it's your problem, it's not my problem, it's your problem. But at the end of the day, it's hurting the application that's actually needing this data. Yeah, and that's not even a dimension, like going upstream where the data is created.

Yep, exactly. Trying to figure out like how if things change way upstream, then it's a domino effect all the way down. Correct. Yeah. And we see that like we see that when like we'll see like the application team basically saying that like, hey, we are seeing like

that are not fresh. We were used to seeing events that were like a second ago, and now we're seeing events that are like an hour ago, right? And debugging that just takes like so much effort sometimes for a lot of our customers, which is that, is it the application that's processing the screen that's having delays? Or is it the application that's putting messages inside the screen that's having delays? Is it the consumer application that's having delays? And so it's kind of like this, you know, highway where like even

even if like one place is like stuck, it can affect like backstream the entire thing and figuring where the exact problem is can often be challenging. Oof, yeah. And you have to go and then be this detective in a way. And the more tools that you're using,

the more headaches. Yeah, it just gets the problem just gets like exponentially difficult because now you need to check like all the different tool interactions. And now suddenly if you're using two tools or four tools, it's now like eight or 16 different interaction points you need to check. I feel for all those folks out there that are dealing with this on a day-to-day basis because it sounds painful.

What else you got for me? What other tips and tricks? And just like streaming ecosystem stuff. I mean, I love the fact that I am very green in this field and you live in it.

And so just like simple stuff that you're telling me is blowing my mind. Yeah, like another thing that we see is like how much data, how much data retention is there in your stream. Like for example, streams are meant to typically contain like recent data, but recent itself is like subjective. Like you might keep a day's worth of data in your stream or you might keep like seven days worth of data in your stream or you might keep like 30 days worth of data in your stream, et cetera. And you can just keep extending that. And again, like you...

need to think effectively about like how much data you want. So let me back up a little bit. So a lot of our customers have to like

stitch together streaming as well as batch data because what they want insights on is like, yes, they want insights on what happened in the last five minutes, which may be present just by looking at the stream data, but they also want to know insights about what happened in the last like seven days or 10 days, right? Which may include looking at, okay, I'll look at the stream for the last like maybe a day. And then I look at my batch data because my batch has data for like the last six days, right? And so basically,

I think one effective thing is like deciding what's the cutoff point between what's stored in batch and what's stored in stream, right? And so one way to lower cost could be that like, I'm only going to keep like the last two days worth of data in stream. And then anything beyond that up until like history is all going to be in my batch data. Or you could say that like, I'm not going to deal with the batch data and my application just needs like statistics up to 30 days and I'm going to keep all that within the stream itself.

And each of these decisions that you make affect like cost in like pretty substantial amount of like ways. So for example, the one easy way we like help a lot of our customers is if you had a system that can effectively, you know,

handle batch and streaming data and make it available to the application where the application doesn't care about whether it's coming from batch or stream, then you can really start to arbitrage based on what you want in the stream and what you want in the batch, right? And you can start putting lesser and lesser data in your stream, like only the recent relevant data, and push more stuff to batch.

But it requires you to have like a very advanced, I would say, like data engineering pipeline, which can really be able to combine this batch and stream data coming in effectively. So, for example, we can operate with like even, you know, a few hours of data in the stream as long as data beyond that is coming from the batch, which can, you know, really reduce costs for users if they are used to storing like 30 days, seven days, 10 days worth of data in the stream.

Dude, so do you see folks creating two different systems

for the batch and the streaming? Yeah, so it's actually pretty interesting. So what we see is that if, let's say, a machine learning engineer or data scientist said, "Hey, I need five data points." One is the recent transaction history for the last five minutes, one is for the last hour, one is for the last 10 days, and one is a lifetime sum of it, essentially, the total, right? The first two are more recent and

like the engineers would build like a streaming pipeline for that. And then the other two are like more historical, like seven day or lifetime. And they would build a separate batch pipeline for that. Right. Yeah. Um, and then these completely are like disparate pipelines and the, the data scientists would have to write transformations, et cetera, separately for each of these pipelines. And then they would then like use that in the machine learning model and separately train and like serve based on that. Um,

Having these separate pipelines is often a huge challenge for like customers in terms of like duplication of efforts, infrastructure cost, as well as like operational maintenance of each of these pipelines.

There we go with that operational maintenance again. That is something not to be underestimated. Yeah, I mean, because the cost doesn't just stop at like the day you launch this in production. Like that's kind of like the beginning of it actually where most people actually feel if you spend like let's say three to six months and you put this in production, your work is done. But like,

In reality, that's kind of like the starting of the work. And then beyond that, to actually run this in production is actually the hardest part. Yeah, that's where you get those bills coming in. And all of a sudden you're like, what the hell are we spending all this money on? Yep, exactly, yeah.

The other thing that we're also seeing is a lot of the streaming ecosystem players are also maturing quite a bit and they're offering more and more services within them. Like a good example of this is if you today had a pipeline that was, you know, taking stream data and dumping it into S3 or Iceberg, etc. Actually, like Confluent is offering like managed services for that. So you can just have like

and just say, hey, dump it to like Iceberg and they have this new service called like TableFlow. And so we're seeing that like a lot of this like fragmentation in the industry where maybe there was some vendor for like just producing stream messages and like serving them out. There was another vendor for like putting them in like object storage. Maybe there was another vendor for putting them in online storage. We're seeing like some consolidation within that, which is effectively raising again, like the abstraction on which you need to think of

So for example, as I mentioned, Confluent is offering now a fully managed service to basically take your stream and bump it into an iceberg and you don't even have to think about how data is flowing from your stream to iceberg and what would happen there, etc.

I do think that these managed services are expensive when you think about like the cost, like the one time cost of it. But if you truly are able to factor in the operational cost of it, I do think like these services make sense. And we're increasingly seeing this across like, for example, even within Dearbricks, even within Confluent.

And even like Snowflake, for example, is offering some of these services. So for example, if you have a Kafka stream and if you want the records dumped into a Snowflake table, they're offering like managed services for that. And then obviously they have the entire store of Snowflake stacked beyond that. So we're seeing like effectively a lot of these vendors provide more end-to-end solutions for

and take on the complexity of stringing these tools together. And I do think if you're operating a service, if you're operating a company where you are dealing with a lot of these disparate tools, then using some of these managed services by different vendors could be quite useful for you. We see that all the time in the MLOps community Slack. So many folks will come through and ask the questions, and it's almost like,

Yeah, and the question is phrased in some shape or form of, hey, I'm using XYZ managed service, be it SageMaker, be it Databricks, be it whatever. But it's so expensive and I'm thinking about what we can do to make it cheaper. And inevitably, someone will chime in on the thread and say, OK, you can...

Use this open source alternative or you can try and roll your own in this way, shape or form. But don't forget the cost of the humans that are going to be maintaining that. And so you think it's cheaper, maybe on paper, your cloud costs go down. But then your manager or your manager's manager is looking at it and saying, we just spent less on the cloud costs, but we're spending way more in headcount.

Yeah, and that's truly one thing that we see. The other thing we see is that a lot of companies want to focus on like product velocity, right? And so, for example, if like, even if you set all this thing up, but you are anticipating that like, I need to build, you know, three times or four times as many features and models, etc., and launch them in production.

Is this going to be a bottleneck in terms of me launching these new models and pipelines and products? Which is really a problem because then it delays your ability to go to market with these products quickly. And so we see this across the industry as well where

Sometimes people just come to us because they feel like, hey, I know this. I can run this on my own. I can even hire people and I can do all this stuff. But as a business, I just don't want to focus on things about these things. And I just want to focus on product velocity and being able to launch the best products out there to the end consumers and not have to worry about whether

this product will get delayed because some other team will not staff this on their roadmap and will require more resources to maintain these things. I saw a blog post the other day from fly.io talking about how they...

created GPU offerings because they're, you know, like this cloud. And they basically said we were wrong about the whole GPU offerings and it was a lot of hard work and we aren't going to continue with it. They're not discontinuing it, but they're not really advancing forward. And there was this nugget in that blog post that I thought was the coolest phrase I've heard in a while. It stuck with me. It said,

A very useful way to look at a startup is that it's a race to learn stuff. Yeah, that makes a lot of sense. Yeah.

It's just a race to fail and fail and then learn what works. And so that's exactly what you're saying. Like it echoes this product velocity. How fast can we learn if this product works or it doesn't? And yeah, exactly. And you want to remove as many bottlenecks that's like slowing down your learning curve as much as possible. And like probably the only like bottleneck should be how quickly are you getting feedback from customers? But beyond that, you don't want to slow down your process of delivering value to customers.

Yeah, because that's in your hands. Exactly, yeah, yeah. Yeah, the other stuff, maybe it's a little bit more difficult to get that feedback or that loop coming back. So it makes sense, man. So, all right, cool. Well, that's a solid one. I mean, the managed service one too, just how different products are going into different spaces always fascinates me because it feels like

everybody's trying to become the one platform, like the Databricks. And so in five years, all of a sudden you can do everything that you can do on Databricks on Confluent now. I don't know if you have thoughts on that one. Yeah, I mean, I think a good example of this, for example, is like Iceberg. And I think like Iceberg is on track to become like

I think like what GitHub is to code, I think like will be to data basically for any organization essentially. Like it'll be the central place in which everything is stored. And like all the vendors, could it be Databricks, could be Snowflake, could be Confluent. Essentially they have to read and write from that, but then everything is like stored there. And we are seeing more and more of these services becoming agnostic to sort of like these implementation stuff and focus more on like

How can they be more compatible with each other, especially when, for example, storage is now moved to a common solution? And I think we will see more of these things where you could more easily transition from one vendor to the other because you don't have to think about how the data is going to be migrated from one vendor to the other, which used to be the biggest challenge.

Like if you're storing data in one vendor, just migrating that to the other vendor without losing it in a secure way, I think it's going to be a thing of the past. And if all of this data ends up being stored in like Iceberg, et cetera, or like in open standard formats on the cloud, then switching from one compute vendor to the other is, I think, going to be significantly easier. Wow. And maybe it's not even that

okay, we're on Databricks. Now we're going to migrate to Snowflake. It's like for this job, we use Snowflake and for this other job, we use Databricks.

That job essentially could be a Databricks Spark job. It could be a Snowflake query. It could be like a BigQuery something. It could be a CBB, etc. And I think we're already starting to see the early signs of this where a lot of companies are now mandating that we want all of our storage to be in Iceberg because we don't want to be in a vendor lock-in with like any particular vendor among data storage. Man, that's wild to think. Like the analogy of

iceberg is to data what github is to code yeah i mean effectively if you go to any organization today like by and large mostly there might be several teams like several different engineering teams could be ui teams front-end teams back-end teams etc they're probably all storing their code on like github and they're all searchable and they have like common mechanisms they may have like different build processes etc as well like that's building code etc

But they're all effectively reading from GitHub and they're like making changes to the repo there. And I think we're, I think like code was maybe ahead of its time and data is maybe catching up now. But I do imagine a world where like we don't have to think about like, is this data stored in

one vendor and is this data stored in a different vendor? The data is just stored in Iceberg and you can read right from anywhere, basically, as long as you have a standard data format and standard catalog, essentially. Yeah. Well, that's the dream. And then what? It's like the data would be

specialized for its specialized use cases. Like, oh, I've got this generative AI use case or I've got this fraud detection use case and it's doing, like you said, it's spinning up some kind of a compute and

Yeah, I mean, we're already seeing this for like structured data. I think it's kind of centralizing on this. I think unstructured data and different modalities of data like video, et cetera, we still have some room to go there. But I'm fairly certain for like structured data as well as like unstructured embeddings data, we are not very far from a world where we will see a lot of organizations distort everything in Iceberg. Damn. Yeah, that's...

very cool to think about and what that unlocks. Yeah, and I think like, for example, like the Databricks acquisition of like Tabular when Databricks was themselves actually building Delta, which was a different format than like Iceberg. I think it's a great indication that like even probably the vendors themselves are seeing where they need to head to and where the bug is going. The first thing that comes to my mind is this is happening with

no like formal committees no standardizing bodies it's just happening because people see that this is the way forward yeah i mean like a lot of the iceberg like community came from like several different companies i think there are like maybe 10 to 20 companies that are operating uh but i think they're all realizing that we have to work together to achieve this solution or else

we're going to end up in a world where it's possible that like 90% of the world is like using Iceberg, where we have our own storage format and they don't want to be in that world like where they're like on a silo somewhere else. And so when there is enough momentum in this space, which I do think there is enough momentum in the space, a lot of people want to join the momentum as opposed to like, you know, stay away from it or like run against the tide.

Now, let's go back to something that you said earlier, which is folks are understanding that they are not necessarily in a position that they need to be like Google or be like Facebook. And they don't need to be built for this gigantic amount of scale. Talk to me more about that. Yeah, I think like the biggest example that's in front of our like

is DuckDB. And I think like in a world where, you know, let's say Google comes up with like BigQuery and like,

there's like Snowflake, there is Databricks, and there is Presto, Spark. A lot of these like very large scale batch data processing solutions, they typically came out from like very large companies because they needed it. And they were kind of shoved down everybody else's throat that this is basically what you have to use, right? Even if I don't have that much data or even if I don't have that much scale, I still need to use the system and deal with the complexity of like using the system. It's like, you know, if I wanted to go from like, you know,

San Francisco to New York instead of taking commercial flight I need to fly like a fighter jet and sort of like take deal with the complex it's like just not something that's like needed but you know some people may need it and so that's why now I have to use it and I think like the recent popularity of like Duck TV is like a good example of it right where people are seeing that like yes I mean like these snowflake BigQuery solutions make sense for like

some kind of workloads and not all kinds of workloads. And I need to look at like the workloads that I need to support and whether they fit better in DuckTB or Snowflake and, you know, BigQuery, et cetera. And we see this as well, where a lot of data pipelines don't need like very complicated, distributed, multi-parallel like batch processing solutions. They're actually significantly simple to build on like simpler tools. We've seen that, I think we've seen this

In the batch world, and I think we will see this in the streaming world as well, where people will kind of like look at systems that are significantly simpler than fling or spark streaming, etc. And look at like building these systems on top of such primitives. Streaming, I think it's a little bit early, where there maybe there is

nothing as popular in streaming as what DuckDB is in the batch world. And I think that's a wide gap there that there are some companies that are like addressing that, but nothing has broken out as much as DuckDB has on the batch side. And I think once we enter that world, I think it'll be interesting to see how people are choosing like simpler tools like DuckDB, et cetera, versus like more complicated solutions, both across batch as well as stream. Yeah.

DuckDB is known for its user experience, and I think people love that. The simplicity of it, or just like the developer experience too, on top of it, is magical. Correct, yeah. And why do you think it is that in the streaming world, we haven't seen that equivalent yet? I just think streaming is a slightly more difficult problem to solve, especially...

I think you need to think about like ongoing state much more than you would need to think of in a batch world. A batch workloads are typically a little bit more ephemeral. Like you can spin something up, compute it and then throw it away because you have the results. Streaming is more of an ongoing thing where you need to continuously keep maintaining state and that might itself grow over time, etc. It's just a little more difficult problem to solve, but I'm pretty convinced that like we will see something like that

show up like in the next five to 10 years, possibly even sooner. And I've heard a lot of folks talk about how streaming is a necessity and everything eventually will go to streaming. Do you believe that? I don't believe that, to be honest. I think maybe I have a little bit of a contrarian view. I don't see a world where people

need to always only have streaming data and everything within streaming. The batch processing systems are getting very, very powerful in dealing with like large amounts of like data. The streaming systems are very good at getting like the recent data, right? I'll give you like a database analogy, which I think like may make this slightly simpler, right? So for example, we've always had like transactional databases like Postgres, MySQL, et cetera.

And we also had like analytical databases, like for example, uh, Snowflake BigQuery, right? And analytical databases look at like past historical data very effectively. And the transactional databases are very good for like changing one row or changing one data point, et cetera. And in the database community, I feel like there was this time when people felt that we can just have one system do it all. Like, and they had like this system called HTAB, uh,

hybrid transaction analytical processing. So they can have one system that can do both. Like they can do transactional as well as like analytical. And these systems did come out actually. However, it turns out like these systems are

they can do a mediocre job at both analytics as well as like transactional. And the buyer on the other side is looking for a best in class solution for analytical workloads. And they're looking for a best in class solution for transactional workloads. And they're not okay with like a solution that does a mediocre job at both, right? And so I feel that even if like streaming, even if we get to a world where like everything can be done in streaming, I'm fairly certain that like

there will be parts of it that are like more mediocre in performance and overall ease experience than for batch. And so I do think that we will be probably entering a world where each of these systems will exist and they will just be significantly more powerful at doing what they do best, essentially. Yeah, just leaning in to their strong suits. Exactly, yeah. Is there anything else that we didn't touch on that you want to talk about? Um...

I think one thing that we're seeing, interestingly, that I think we may see more is bring your own cloud. And what I mean by that is a lot of times people have this negative feeling about vendors because a lot of vendors want all the data processing to happen within their account. And a lot of customers are like, you're dealing with sensitive data and I don't want the data to leave my account.

With the WarpStream acquisition, we've seen Confluent now offering BYOC solution or bring your own cloud so your data can still live within your account. Red Panda is doing BYOC and they're doing very well with that. We're seeing Databricks already offer like a BYOC solution. Snowflake historically has not done that. They've kind of been in the world of like everything live within Snowflake and give me all your data and I will store and process everything within my account.

I think streaming solutions, as well as batch, I think will probably more move towards a BYOC solution. And I think that will make, for a lot of companies, accepting vendors even more easy and a little bit less challenging than it is right now. So the BYOC is just like...

The vendor goes to wherever you are? Yeah, exactly. Like, basically, like, the vendor will deploy their stack in your cloud account instead of, like, asking you to move your data in their cloud account, essentially. Yeah, that is a trend.

That makes a lot of sense to me. Yeah, I think like historically vendors have been a little bit opposed to that, but I think they're realizing that the gravity is where the data is. And it's easier to move their compute systems than to move any vendor's data into anybody else's account. Exactly. It's so hard to get that data to go somewhere else just because of the sensitivity of the data. Yep, exactly. Yeah.

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302 48:51 Share

MLOps.community

Deep Dive

Shownotes Transcript

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302