We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302

2025/4/4

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Rohit Agrawal

Topics

我加入Tecton四年，最初专注于实时数据处理，现在管理着负责流数据、批数据和在线/离线推理的团队，主要工作是处理Tecton核心产品的数据流和基础设施。许多公司虽然使用了Kafka、Kinesis等流数据解决方案，但难以有效利用这些数据进行机器学习或分析，因为流数据生态系统非常分散，需要整合多种工具和技术，并持续维护。构建流数据处理系统需要整合Kafka、流处理器（Spark或Flink）、存储（键值存储或离线存储如Iceberg）和服务层，这需要多种技能，且系统维护成本高昂。对于小型公司来说，现有的流数据处理工具难以使用，难以从零构建，因为这些工具本身功能有限且难以使用。DynamoDB用于存储来自流处理器的记录，并提供高吞吐量、低延迟的服务，但维护它需要仔细考虑模式设计、读写延迟的权衡以及随时间的演变。流数据处理系统的成本与数据新鲜度密切相关，理想的系统应该允许用户根据需求调整新鲜度和成本。流数据处理系统的维护通常由不同的团队负责，这导致了团队之间的隔阂，并影响了端到端可靠性。许多公司在简单的流数据管道上花费大量资金，这包括基础设施成本、SRE成本和日常运营成本。使用S3作为存储服务并通过S3传输所有数据，可以降低运营成本并提高弹性，例如WarpStream（被Confluent收购）就是这样的例子。减少检查点频率可以降低流数据处理的成本，因为检查点频率越高，成本越高。数据新鲜度和成本之间需要权衡，系统应该提供相应的调节机制。数据科学家和数据工程师对数据的关注点不同，数据科学家更关注高层抽象，而数据工程师更关注底层细节，这会导致效率低下。流数据中数据的保留时间会影响成本，需要根据实际需求进行调整。许多公司需要整合流数据和批数据，需要确定流数据和批数据的截止点，以降低成本。公司通常会为批处理和流处理创建两个不同的系统，这会导致工作重复、基础设施成本增加以及维护成本增加。越来越多的流数据生态系统参与者提供更多的托管服务，例如Confluent的TableFlow，这有助于简化流程并降低成本，但这些服务本身也可能很昂贵。许多公司更关注产品速度，因此他们更愿意使用托管服务，即使这些服务可能比自己构建更昂贵，因为这可以避免维护成本并加快产品上市速度。Iceberg有望成为数据领域的GitHub，成为所有组织数据的中心存储位置，这将促进不同供应商之间的互操作性，并简化数据迁移。公司不必追求像谷歌或Facebook那样的大规模架构，可以使用更简单的工具，例如DuckDB，来处理数据，这取决于其工作负载的需求。我不认为所有数据处理最终都会迁移到流处理，批处理系统在处理大量数据方面仍然非常强大，流处理和批处理系统各有优势，并存是必然趋势。越来越多的公司选择BYOC（自带云）解决方案，这使得供应商可以在客户的云账户中部署其堆栈，而不是要求客户将数据迁移到供应商的云账户中。

Deep Dive

Chapters

Rohit Agrawal, Director of Engineering at Tecton, shares his career journey, current responsibilities overseeing teams focused on streaming and batch data, and online/offline inference. He also describes Tecton's role as a feature platform for various applications.

Shownotes Transcript

Streaming Ecosystem Complexities and Cost Management // MLOps Podcast #302 with Rohit Agrawal, Director of Engineering at Tecton.

Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlops.community/YTNewsletter

// Abstract

Demetrios talks with Rohit Agrawal, Director of Engineering at Tecton, about the challenges and future of streaming data in ML. Rohit shares his path at Tecton and insights on managing real-time and batch systems. They cover tool fragmentation (Kafka, Flink, etc.), infrastructure costs, managed services, and trends like using S3 for storage and Iceberg as the GitHub for data. The episode wraps with thoughts on BYOC solutions and evolving data architectures.

// Bio

Rohit Agrawal is an Engineering Manager at Tecton, leading the Real-Time Execution team. Before Tecton, Rohit was the a Lead Software Engineer at Salesforce, where he focused on transaction processign and storage in OLTP relational databases. He holds a Master’s Degree in Computer Systems from Carnegie Mellon University and a Bachelor’s Degree in Electrical Engineering from the Biria Institute of Technology and Science in Pilani, India.

// Related Links

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302 48:51 Share

MLOps.community

Deep Dive

Shownotes Transcript

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302