We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI Data Engineers - Data Engineering After AI // Vikram Chennai // #309

2025/4/25

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Vikram

Topics

Vikram: 我是Ardent AI的创始人兼CEO，我们正在构建AI数据工程师，这是一种利用LLM技术来简化数据工程流程的AI代理。数据工程师是这项技术的最佳用户，因为他们能够理解并利用AI工具来提高效率。AI数据工程师可以连接到你的数据栈，并通过自然语言指令执行各种数据工程任务，例如构建数据管道、进行模式迁移等。它可以自动完成许多繁琐的工作，例如查找API端点、编写代码、将其推送到GitHub仓库并创建拉取请求。它还可以访问并理解各种数据库的模式，从而在构建管道时做出更明智的决策。为了应对复杂数据系统带来的挑战，我们专注于特定领域、上下文和结果的专业训练，并通过创建轻量级且短暂的暂存环境来提高准确性。我们还创建了一个基准来评估AI数据工程师的性能，并通过数据质量检查和其他测试来验证其结果。为了解决上下文不足的问题，我们添加了一个检查步骤来判断任务的可行性，并在任务不可能完成时发出警告。我们还将大型任务分解成较小的任务，以提高可靠性和准确性，并允许用户更好地控制流程。我们利用GitHub仓库和数据库上下文来提供必要的背景信息，并使用检索增强生成技术来去除不必要的上下文信息。我们还允许用户附加文档作为上下文信息，以帮助AI数据工程师理解公司内部的特定术语和流程。我们的定价模式是基于订阅费和信用额度分配的，以满足不同客户的需求。AI数据工程师的计算资源消耗包括运行代码和与各种服务交互所需的计算资源。我们已经看到AI数据工程师成功地完成了跨多个服务的复杂数据处理任务，例如使用Airflow作为编排器并调用Databricks进行Spark代码执行。这表明我们的技术具有很大的潜力。 Demetrios: 作为主持人，我与Vikram讨论了AI数据工程师的构建过程、面临的挑战以及未来的发展方向。我们探讨了AI数据工程师的概念、功能以及如何与现有的数据工程工具和流程集成。我们还讨论了如何处理复杂的数据系统、如何进行评估和验证，以及如何确保AI数据工程师能够在各种场景下可靠地工作。此外，我们还探讨了定价策略、计算资源的分配以及如何平衡AI的自动化能力和人工干预的必要性。通过与Vikram的对话，我们深入了解了AI数据工程师的优势和局限性，以及它如何帮助数据工程师提高效率和解决实际问题。

Deep Dive

Chapters

Vikram Chennai, founder of Ardent AI, introduces AI Data Engineers, AI agents performing data engineering tasks. They integrate with existing stacks (like Airflow and GitHub) to build pipelines, handle schema migrations, and automate tasks usually done manually by data engineers. The AI agent interacts via a terminal, understanding API outputs and database schemas to build and push pipelines as PRs.

Introduction of AI Data Engineers as AI agents for data engineering tasks.
Integration with existing data engineering stacks (Airflow, GitHub).
Automation of pipeline building, schema migrations, and other tasks.
Agent interaction via a terminal interface.

Shownotes Transcript

Translations:

中文

My name is Vikram. I'm building Ardent AI. I'm the founder and CEO, and I like drinking lattes. Welcome back to the MLOps Community Podcast. I'm your host, Demetrios, and today my man Vikram is doing some wild shit around bringing LLMs to the data engineering pipeline use case. I love this because it is merging these two worlds, and hopefully...

Saving some time for those data engineers that you know are constantly under the gun. Let's talk with him about how he's doing it and what challenges he has had building his product. And as always, if you enjoy this episode, share it with a friend. Or if you really want to do something helpful, go ahead and leave us a review. Give us some stars on Spotify. It doesn't even have to be five of them. You know, I'm okay with four.

But you can also comment now on Spotify itself, which will trigger the algorithm, give us more engagement. Let's jump into this conversation. But before we do, if you are one of those folks that is listening to this on Spotify or on these different podcast players, I've got a treat for you. The next musical recommendation comes from...

Someone that just joined the community and told me about Cosmo Sheldrake. Never heard of him, and I am now in love. Legend has it that the moss grows on the north side of the trees. Legend has it when the rains come down, all the worms come up to breathe. Legend has it when the sunbeams come, all the plants stay with them with their leaves.

Well, let's not say that the world spins round on an axis of 23 degrees. But have you thought we are the rabbits in the plant? It's fast-ranging up the mountains while hustling up the tunes. What does office want to do? You know, I love things but many from that old woman who lived inside her shoes. The girl who sang by hand by night, she ate his soup.

Where are you at, by the way? Are you in San Francisco right now? Mm-hmm. In San Francisco. You've been doing meetups? Um, not...

too much um i actually moved here like six months ago and so i started working out of some co-working spaces and then sort of got like networked into the community so there's not as many meetups anymore i did a lot when i got here i went to a lot of hackathons um great place to meet people but after that once you sort of got you know a group of people then it sort of naturally expands who you meet and then it's not as much go out meet people yourself it's more just like they just

Enter Orbit. Enter Orbit. I like it. The other thing that I was thinking about, who are the folks that you talk to as like users of your product? So for us, data engineers are the primary users.

which might seem a little counterintuitive, especially since we're building AI data engineers. But what we've found is data engineers are really the people that understand how to get things done in their stack. So they know which pipelining tool to use, the little things that can go right or wrong. And so when they're in control of a tool that can do data engineering for them, then they

You just get way more done way faster. So they've been our best users so far. And we've seen them just do incredible things. So I want to look under the hood at how you are creating what you're creating. But you kind of glossed over something right there. And we need to dig into it. What is an AI data engineer?

Yeah, so it's an AI agent that's connected to your stack that can perform data engineering tasks like building pipelines or doing schema migrations with the sentence. So it's very similar to something like Devin, except we're verticalized and focused on data engineering only.

and really integrating deeply with that stack. So an example of that would be you have Airflow set up with GitHub, and that's how you control the code for your data pipelines, and you want to build a new data pipeline. Normally, you'd have to go clone that repo, find wherever you're storing your DAGs, your

your data pipelines, and then go write all the code and then check the databases and then figure out what to write and, you know, the structure of everything, the schemas and all that mapping. Or if you're calling from an API endpoint, you have to understand what that endpoint will drop.

So you have to do all this work, and then you can push a pipeline up and then test if it works. For us, our agent has a terminal, so it has a computer attached, and it can do all of that for you. So you just tell it, hey, I want a pipeline that does this. It'll go look at the web, look at that API endpoint, figure out, oh, how does this thing output, and then build the entire pipeline for you, push it to your GitHub repo, and

as a PR. And so you don't have to do any of the work. You just say, hey, go do this thing. It'll do it for you. And this is primarily with data pipelines. Yes. Are you also doing stuff within databases, like the fun of trying to query the databases? Or is it just the maybe you're transforming or you're doing like how complex have you seen this get?

Yeah, so we have the context into databases. So when we have users set up,

They set up with whatever pipelining service. So if it's Airflow, if it's Dagster or Prevector, whatever the hell they're using. But then you also have context over all of their databases too. So you can connect Postgres or connect Snowflake or Databricks SQL or MongoDB or whatever you use. And so our agent actually has context over what that looks like.

And it allows you, we're focusing specifically on pipelines, but when it writes that code, then it'll know the schema. So if there's a table you want to drop things into, it's not trying to guess or make up a new table. It knows you already have these three tables. It'll make the most sense to drop the new information into table one. And here's how we're going to transform it to do that. One thing that I've heard a lot of folks talk about is this type of

Coding with AI, or dare I say vibe coding, is very useful when things are quite simple. But then it kind of falls flat once there is a lot of complexity. And when you're at the enterprise level and you're looking at a stack that goes back 20 years, it gets very, very hard to have...

AI be able to really help out in the coding and code generation side of the house. Have you seen something similar where if you're trying to do one, two, three step pipelines, it's really good at that. But as soon as you start bringing in this very messy data system,

from all over the place and it's being transformed 10 times before or you're asking it to be transformed 10 times so that you can get the exact data you're looking for, it falls flat.

So we've seen a little bit of that for sure. But I think the two main ways we've tried to combat it, and I think that it's kind of a general principle too. One is managing context. So for something like we were talking to a few companies, some enterprise companies, and they had like 15,000 pipelines.

Now, obviously, you can't feed that all into a context window. There's no way. One, the context window doesn't scale to that. And two, even if it did, it would get incredibly confused on what you're asking for. So there's a portion of generating really, really, really good context, which pretty much is trying to simplify the problem down for the agent. So that's a huge portion of what you're trying to do. And I think the other part is specified training.

So for more generalized coding agents, I think one of the things they struggle with is they're not designed for specific task flows. They're trying to do everything at once. And so because of how language models work, they're probability engines. So if your probability distribution is literally everything, you're going to struggle to do the specific stuff. But if you design products around specific verticals, specific contexts, and specific outcomes...

then you'll get a much, much higher quality result. But I think that the problem of like vibe coding and getting errors at that scale is something that actually won't ever disappear.

But I think that's where you have to build specialized infrastructure to make sure that you can solve those problems. So one thing that we're exploring right now in our next build is creating staging environments for everything that they connect by default and trying to make them really lightweight and ephemeral. So let's say you make a change in your database.

instead of it going directly to whatever database you have connected, assuming that maybe you don't have a staging environment, or if you do have a staging environment, it'll just say, here's the change that we're going to make. This is what it looks like. But that also allows the agent to go reflect on that and say, hey, this is how the system would change. Is this correct? And so even if it screws it up on time one or time two or time three or whatever, even 10 times,

It's not getting committed to your actual code base or to your database or whatever. And it has the ability to reflect on that. And so it can then correct itself. And the accuracy over a bunch of different trials goes up a lot. Wow. Okay. Now, are you also creating some kind of an evaluation feedback loop so that you understand what is successful and what is not? Mm-hmm.

So we've actually created a benchmark because there really doesn't exist a good one for data engineering. Well, I was going to even say it may have to be

on a case-by-case basis? Or have you found it can be a bit more generalized with just data pipeline evaluation benchmark? Yeah, I think it actually can be a lot more generalized, but I think that's specifically because of how data engineering works. It's not exactly general. It's because the tools that people use are pretty standard. Like they're using Airflow or Prefect or Dagster or Databricks or whatever. You can actually name every single one of the popular tools and

And because of how important data is to companies, it's very, very unlikely that people are going to use some kind of unknown tool that no one's really battle tested. Because if you have an error with your database and for some reason they've, you know, coded it really, really badly and it deletes everything, you're screwed. Like you're never going to take that risk.

So people always, you know, tend to go with the standards. And that creates a nice framework for us to be able to test in where we can say as long as it can operate, you know, these tools really well and we'll give it, you know, hundreds and hundreds of tasks to optimize over and it's getting better at that, then we're pretty confident that the majority of people will be served well with it. Okay. And you mentioned in the staging, right?

environment that allows the agent to almost game out what would happen if we made what the first agent is recommending it's it's almost like seeing into the future in a way and then reflecting upon if that is the right move or not

Can you talk a little bit more about what type of validation checks you have? And is it all through agents or is there also some determinism involved in that too? So it's mostly through the agent itself.

But we can also pull in like data quality checks and other tests that you have running. So our goal is to replicate your actual environment as much as possible. So it writes a pipeline, then it runs whatever tests, if you have them, we'll bring them in. The agent itself will be able to see into error logs and all that kind of stuff to understand, hey, did I write code with bad syntax or hallucinated something? Or does the output look the way that's expected?

And it takes all of that and then it can just sort of loop until it gets things right. That's really cool. A lot of times folks talk about how hard it is to get right when they are talking with agents. The...

ability for the agent to ask questions if it does not have enough context how have you gone about fixing that problem because i could say like yeah set me up an airflow pipeline and use my database and the agent might go and just set up some random airflow pipeline with some random database and say here's what you asked for and or it could hallucinate something when it doesn't have the right context and the right ideas right so

What are you doing to make sure that right off the bat, the agent doesn't go start working unless it has the right amount of information? Yeah. So one thing we actually added in our task flow and actually specifically train on is if tasks are possible. So one of our test cases, like a very simple one, was we're going to not feed it in Postgres credentials.

And we're just going to tell it, go build something in Postgres, like add this table and see what it does. And at the beginning, it like tried to do it exactly like you said. And we, you know, essentially trained that behavior out. So we added a check step in sort of the planning phase of it, where it'll try to determine if the task is possible or impossible. And so we just tried to train that out where it will tell you that's not possible. You should probably not do that. And that's another LLM call?

It's part of the original one. So the flow that we have set up is you make a request, the agent will gather contacts, search the web, all this stuff, and it'll give you a plan of like, here's what I'm going to do. Are you okay with this? And you can go back and forth and just make edits. And so if it's made a mistake, that's your opportunity to look in detail about everything it's going to do and say like that step is wrong or that step is wrong. Or actually I wanted this table or something like that. And

and then let it go off on its own. But in that phase, there's a check there that tries to say, is it impossible or not? And if it is, then, well, it'll say, please revise that. Yeah. How do you think about really big tasks versus trying to break up the tasks and make them smaller? And when you're designing your agents,

trying to ideally have smaller tasks so that you have more dependability? I think breaking it down is definitely the way to go. I just don't think you can get enough from a larger task and in us breaking it down in the planning step. So it won't just say, hey, I'm going to build you a pipeline. It'll say, here are the steps that I'm going to execute. Is this right?

it allows a lot more fine-grained control over what you're going to do. And yeah, you get a lot more accuracy out of it. But it also helps on the user side because they can see exactly what little steps are going on. And so they also have control. And so, especially if our users are data engineers, they very much understand what needs to be done. And so they can really make the edits like, hey, step three is wrong, or it's a little bit off, or you inferred the wrong table, right?

Let's correct that. And so it doesn't put all of the onus on the agent, at least not yet. And there's always going to be a little bit of a mismatch, right? Because there's no way you can perfectly replicate everything into your brain and just dump that into the context. Are you using the GitHub repos of the organizations as context also to help when creating the different pipelines? Yeah.

So the primary way people tend to store pipelines is through some sort of version control software, and that's usually GitHub. And so, yeah, the agent will connect in, sort of scan and understand that repo, look for specific folders that are important, such as if you have a DAGs folder, which is Airflow's name for data pipelines and all of the files in there. So it'll know that's where I write my pipeline code or that's where I extract from.

But you're not specifically trying to vectorize all of this stuff that's in there. You're just taking it at face value. You don't need to overcomplicate things by saying, well, let's vectorize everything that we have. Let's throw it into a vector db. And then hopefully it can give us better retrieval. And also it will know...

better semantically like what is being asked none of that actually matters are you talking about for the code

Yeah, or any of the stuff that you're giving it as context. So we do actually do a bit of rag and retrieval. And the main place we do that is pretty much on all the context layer. So if you ask about Postgres and MongoDB, what we're going to do is try to rip out as much unnecessary context as possible. So it's actually less of a problem of giving it the right context,

What you want to do, at least for data engineering agents or I think coding agents in general, is give it as much as possible because we have this policy of achievable outcomes.

So if you don't give something like database credentials, I don't give you my Postgres credentials, there's no way you're going to ever get into that database and understand what's there in any conceivable universe. And so what we try to do is give the agent just enough, but remove all the unnecessary stuff. So for example, if it does a retrieval search and says, okay, we don't need Databricks, we don't need any of this other stuff, but we give it just enough so that if it did need Databricks,

then it can start pulling that context or make that part of the planning steps of, hey, I need to pull this thing, I need to pull that thing, then it can still do it. So we do add sort of a search layer on top, but it's more to get rid of unnecessary information, like you think 15,000 pipelines. Yeah. There's no way, right? So you do need to index all of that and understand for a user query what to pull it and what not to. So if I'm understanding this correctly, it's more about

How do I tell the agent where to look if a certain question is asked? I have that information there in the context and it knows, oh, cool, airflow. I go and look in this area and then it gives me more information about how to deal with airflow.

Yeah, exactly. So it's a mix of it'll pull stuff directly. So for the pipelines and the code, we do index that so that it can pull in, okay, pipeline three, and it looks like this. And here's all the code and go edit that code. And here's how it's stored in GitHub. Go do that. But let's say it wanted information that we've decided from that vector search is not relevant. We still have enough information so then it can go and find its way to it.

I see. So that in case it comes up later on down the line, then it's not just like, what is Databricks? Yes, exactly. You're in a bit of trouble in that scenario. That makes a ton of sense. Now, one thing that we talked about, I think probably two months ago with a team that was creating like a data analyst agent, they made sure that the...

data analyst agents were connected directly to different databases. And if the agent was for marketing, it was only given scope for the specific marketing databases and all of that analysis that takes place in like almost this walled garden. Have you seen that to work or is it something different? Yeah.

I think for certain companies that make sense, especially for larger enterprises, I think there's a good chance a lot of that comes from security. If I had it my way and I had access to everything I wanted, I would just be like, okay, give me as much context as possible. I'll index literally everything you own.

Um, and you know, I designed some sort of custom embeddings on top of that, which is some of what we do to make sure it's really accurate on everything. Um, so I think there's like a trade off there where I think more context is generally better.

actually, in this case. But again, you want to keep it pretty thin, right? Like you don't want everything about everything. You want just enough about everything, especially if you have sort of these cross-context flows. Now with pipelines, it's something like sort of the call-out structure where you have

a pipeline that'll trigger a data processing service to go do all the heavy lifting. So you're not processing millions or billions of rows of data in your airflow instance, which will blow up, like you're putting it off the Databricks or someone else. So you kind of want that context, right? So it makes a lot of sense, especially at the enterprise scale, to have it sort of guarded off like that. And it may improve the agent too. But I tend to go for

bring in lots of context and just keep that understanding so that when you have those flows that sort of peek out you know what what's happening along the lines of processing data in the wrong place or with the wrong tool that's going to make something blow up or be really expensive after the fact

Do you have alerts set up or do you have some kind of way to estimate? As you said, we have that staging environment. We can estimate if this is going to work, which is one vector. But then I imagine another vector is how much will this cost? Yeah. So what we've actually found, especially for existing customers, is they actually kind of know what their costs will be.

because a lot of times the work is not, I don't know what I'm doing. Tell me how to do it better. It's I know what I need to do, but I wish there was 10 of me. Like, I just don't want to take so long. Yeah, yeah. It's like cumbersome. The work is very cumbersome. Exactly. And so they offer that work to our agent. Like, hey, this pipeline is taking 10 minutes and it needs to be a minute. Now I know how to do that.

But I really don't want to do that. So can this thing just auto scale the pipeline out? Okay, cool. It will do that. And they can give it specifics on how to do that. So we haven't seen that as much. But yeah, you will be able to pull that stuff from the staging environment. And somewhere we're looking to go more down the line is

you know, being able to auto optimize sort of at that level of like, hey, we can save you 20% on your bill if you just changed everything like this. And our agent knows how your costs are set up and all this stuff. The other thing that I was thinking about on this is, do you primarily interact with the agents through Slack or is it via Web GUI? What is it? So it's mainly through a web app.

And they can, yeah, they can just push whatever changes to the agent. So it's just a chat interface. And then we've got sort of the terminal style. So you can see what it's doing as it's doing it. And then you have a bunch of options on do you want to make it more co-pilot or full agent, that kind of thing. So you've seen most of the interaction go through there. But we also built out an SDK and API.

So you can actually build the agent reactively into your flows. So for example, a really good example is like at 3 a.m., you have a data pipeline fail and you don't want to get up for that because that's ridiculous. And so instead of having to wake up and then figure out what's going wrong and write some more code, or if it's really simple, just restart the pipeline. Like, why are you doing that at 3 a.m.? I want to sleep.

you can put in code directly into your error handler or something, and it will auto-trigger the agent to run, and you can put in whatever text you want. So it's the same thing you would do in the web UI, but now you can do it in your code. And that allows those flows to be, hey, there's an error at 3 a.m., the agent has gone while you're still trying to get up and shake off the sleep. It's going and doing all the work, and it said, hey,

I found the fix. Here it is. Like, here's a PR to change, you know, whatever bug is going on. Or you just need to restart the pipeline. I'm ready to do that. Would you like to do that? Just a yes or no. So it's not actually having full autonomy, like you were saying before. You're trying to get to that point. But at this moment in time, it still gets that human intervention or the green light from some kind of a user. Yeah.

Yeah, I honestly don't know if we'll ever get to full autonomy. And that's mainly because, like, even if you placed it in the role of, okay, we're hiring a junior dev, would you really want them to push to prod just like, it's fine, push to prod, just go? There's some situations where fine, yeah, you probably would want that, right? But, you know, maybe you might have some breaking changes that are a little bigger than the initial one, and you don't really want those push. So

I think allowing people to retain control is pretty important, especially with agents where they are. I think we're at the point where they're useful, but they're far, far from perfect. And so it's not really a good idea, in my opinion, to just say, do everything for me and I'm just going to close my eyes and let you do everything. Yeah, the idea of...

cognitive load with the agents and if at the end of the day they help you take off certain amounts of cognitive load then that's awesome the better question i think would probably be around the idea of cognitive load and how you are looking at not adding extra burden to that end user because a lot of times we've probably both felt it we've had interactions with ai that

are not great. And we come away from it thinking, damn, that would have been faster if I just did it myself. Yeah. I think the best thing we can do in that scenario is just

Make the product better. Train it better. Get it more accurate over scenarios. I think one of the other benefits of building sort of workflow-oriented products is you get a lot of feedback from your customers. So when something screws up for them, you get a trace of like, here are the steps and here's everything that blew up along the way. And so as more people use it, the better you can make it for everyone because now you just have...

You know, it goes from thousands to hundreds of thousands to millions of flows coming in of, OK, this happened and here's the evaluation. Yes or no. Yes or no. And you can use that to just make it great. So I think the best thing to do, honestly, is just make it better. And then, you know, building those sort of staging bits and the context bits that allows the agent to actually do good work.

Because I think that's the only real solution to it, right? Like you can band-aid fix it and say, well, it won't affect anything real. But the real thing is you want it to do good work. That's why you're buying it. That's why you're doing anything with any of these tools. And so I think that's like the only real solution. It is nice that you play in the field of the suggestions or the actions that the agent takes has very clear purpose.

evaluation metrics. It works, it runs, like data is flowing or it's not. And on other agentic tasks, I think what I've heard from some really awesome people is the closer you can get to like runs does not run, the better it's going to be for an agent because you can clearly decide if it was a success or if it wasn't.

I totally agree with that. And that's actually part of how we train in sort of the benchmark training set, whatever you want to call it. So it's like an eval and train set there. We actually do have deterministic outcomes. So we have everything from super simple stuff like training.

just testing if the agent even understands the database, right? Like put something in the database, add a name and an age as a record into, you know, our Postgres database. Super easy, go do that. Or make this table. And then you validate, did those things actually exist? Are they there? And you add layers on top of that of like, how long did it take? And you pull all of that back. And so when you're optimizing, you're not just optimizing on

the simplest like you know maybe an llm to evaluate the code did this look right are there errors it's like we actually saw the data end up where it needs to go we saw how long it took um we saw fast the actual query ran like this is too slow this is too fast and especially when you're training um that's really useful feedback right yeah we had uh my buddy willem on here probably

I don't know when, a couple of months ago. And he was talking about how for them, they're doing something very similar, but for SREs and root causing problems and triaging things. And they set up a knowledge graph with Slack messages that are happening. They also set it up with code and they set it up with, I think JIRA so that you could have a bit more of a picture of

when something goes wrong for that product it makes a lot of sense I wonder if you have thought about trying to do something similar or if you're thinking that the way and the ability for you to get the context that needs to be in that agent's context window is enough right now with how you're doing it I think right now um where we focused is all the like

This is what it needs to work. So all of the coding pieces, you know, pulling in GitHub, pulling in database context, that's what makes it work. I think that next level of, and here's all your Slack and JIRA and this and that, is just, it gets better. Because essentially what people are doing right now is they're taking that context that's out there already, putting it into their head, and then putting it down into the prompt.

And so people are kind of transferring it themselves, but it would probably be something that we're looking at. So, um,

we had a few customers ask for a doc attach where they could just put in like a documentation because they'd apparently been storing all their, their contacts and all their practices in one giant document. And we'd heard multiple people do that where they're like, this is our master document on how to write everything. I think they were using it for like a combination of training new hires, but also just a document. Like this is how we do things at this company, especially for data engineering. And so they wanted to be able to attach that as like a permanent feature.

fixture of like, this is how you're supposed to do everything. So we've gone down the route a little bit, but I definitely think it's the right approach of trying to pull in even more. Again, you get into like the context balance issue. But I think if done right and done, you know, the gold standard there is like if it's done perfectly, then you do gain a lot more than I think you'll ever lose.

Speaking of that dock attach, it's not the first time I've heard it. And it's in a little bit different of a scenario, but the whole idea is how can we create some kind of a glossary?

for the agent to understand us when we talk about terms that are native to our company or when we want things done that the way that our company does them, because maybe it's different in every company or maybe what we mean when we say I need to go get MQLs from the database. What is an MQL? It's not necessarily labeled as an MQL in the column. So or so it's not like it is

clear cut and dry thing and the agent needs to know either how to create that SQL statement to figure out what an MQL is or they need to understand what that actually means and which column that relates to. No, absolutely agree with that. Actually, I think one of our core principles is that we're not going to make you migrate anything. And so that doesn't that's not just about like databases or, you know, sort of like

code level stuff. It's like, we want to work the way you work. We don't want to tell you how to do things. And I think there have been products, especially in the past before the whole AI wave, is they've kind of tried to make decisions for you because there was a real trade-off. And they were like, here's a better way of doing things. You'll save a ton of time. But now we're in this place where

you know, maybe the switching might be a little bit easier or you can sort of have cross-context tools. But still, migrations, changing the way you do things is like the worst thing you can ask someone to do. In my opinion, if you say, hey, switch your database, they're going to look at you like, are you crazy? Like for your AI is great, but not we're not switching our database for that. Like, please leave. So we very much found that, yeah, working the way people do is like

That's how you do it. That's how you do it right. And you're essentially allowing them, you know, even if you try to improve it a little bit, you're saying, well, you're doing this and you could do it better in maybe this slight way. But it's not like we're going to come in and tell you what to do. And actually, you have to, you know, rip out your pipeline service and use our custom pipeline tool that we've built with AI in it. Like, no, just use your stuff. This thing will drop in. They'll solve your problems. Mm hmm.

And along those lines, do you have something that is like common asks or common patterns, common requests of the agents that you've codified and you figured out, okay, this pipeline works.

is being requested like twice a day or twice an hour. And so maybe we can just make that a one button click instead of having to have someone prompted every time. Um, we haven't seen as much of that directly.

I actually think that's a little bit different than the philosophy we're going after. Because usually for pipelines first, they're set up once and then they're managed. So it's rarely like recreate this over and over and more of we've created it and now we have to make sure it doesn't break and make sure everything else doesn't break at the same time. So a lot of that. And then I think a lot of our approach is also

The thing that LLMs are great at is being non-deterministic and solving a wide array of problems. And so even if we see a pattern in there that might be easier as like a one-click just do this,

I don't think that's actually a good practice to add to the agent because you're trying to direct it yourself versus train on data. So what should be happening is if you see that happen a lot and you keep adding that as training feedback, it should get really good at it. So, you know, they might have to prompt and ask for it, though usually they're not going to ask to recreate a pipeline six times. Um,

But it'll just get good at those kinds of tasks. Right. Or if you want, you can just use the API. And if you really want to recreate a pipeline six times in a row, you have an API, you have an SDK, just write a for loop and it'll do it six times in a row. I'm thinking about the Uber prompt engineering toolkit because we just had a talk on it last week for the AI in production episode.

conference that we did and they were talking about how they will surface good prompts or quote unquote good prompts like prompts they people can create prompt templates we could say and so that maybe it's not exactly the same thing that's being asked but it's

you have the meat and potatoes of your prompt already ready. You click on that and then it's there for you and you change a few things or extrapolating that out. I was also thinking about another talk that we had from Linus Lee, who was working at Notion and his whole thing was with Notion AI, we just want you to be able to click and get what you need done through clicks and

without having to have that cognitive load of trying to figure out what it is exactly that I need. Because there are like six things that when it comes to Notion, at least, and I understand it's a completely different scenario for you. When it's in Notion, maybe you want to elaborate on something, you want to summarize, you want to write better, clean up the grammar. And so they give you that type of AI feature just with a few clicks.

I think our version of that would probably be like user specific prompt suggestions or like chat suggestions. So if you are requesting for a pipeline or working with XYZ pipeline a lot, then it will be able to learn from that and give you sort of almost like search suggestions the same way in Google or any service. It'll say, hey, were you thinking about asking this or this or this? And then, you know, sort of go that way. That'd probably be the best version of it for us. Yeah.

But it would definitely help, you know, people don't have to ask for the same thing. Yeah, like six times. It'll start to learn. Like, maybe you do want to talk about pipelines because that's all you've been talking about. Simple humans. What are you doing?

No, talk to me about pricing, because I know this can be a headache for founders in this space. Specifically, because like the traditional way of saying it's seat based pricing can get really not useful or not profitable for a company. If everything is on usage based pricing, then you run the risk of,

The end user thinking twice before they use the product. It's like, oh, if this is going to cost me like a buck or two bucks, maybe I should do it myself. I mean, hopefully people value their time much more than $2. But I know that I've been in that situation and I think, do I want to spend the $2 right now? I don't know. So how do you look at pricing? How are you currently doing it? And what do you think about like

As you've learned from customers and talking to customers. So usually when we when we sign new customers, it's usually like a flat subscription fee that we give them. So and they get a credit allocation for it. So usually we evaluate what their needs are. And then we essentially come up with like a, you know, we have a scaling of here's how many credits is to, you know, whatever dollar amount.

And then we usually offer them a subscription on that, especially because, you know, usually they want to either build new pipelines or maintain existing ones. And so they want to remove that sort of work.

And a credit is like a token or a credit is...

like spend away like those are yours to spend like please drop it to zero you know um and if you want more on top of that then we offer sort of a a token basis on top of that so it's like hey you've used up all your credits for this month if you want re-up here send bitcoin to this wallet address then good yeah not yes not exactly the bitcoin not that far but yes

oh yeah wouldn't it be nice though i wonder oh that's hilarious the but the um yeah that makes sense and that is one of the pricing patterns that i've seen because it will it will help folks so you're kind of estimating as you're looking at how nasty their data pipelines are or how many they have if it's that 15 000 one you're going to

Give them a bigger quota or think that they're going to use a bigger quota. So you're going to give them an estimate that's bigger than if it's just one data engineer with a few pipelines they need to run. Exactly. And our primary goal, like whenever we're on those calls, is not to say like, you know, it's not really you have three seats and it's like, OK, what's your problem? How do we solve it?

Right. And it's making sure you have the resource allocation to actually get your job done. Like what we don't want is the hassle of someone coming in and saying, well, OK, we you know, we got, you know, let's say we had a different 100 credits and it didn't really do the thing. It's like, yeah, that's because it's the task you asked was not suited for that exact scenario. Like we didn't.

everything isn't set up right for you to be able to do that the way you want it. Or we've set up a run that runs every three days and does data quality checks or just checks for new changes and stuff. And, you know, is there anything wrong? Like whatever you want the agent to do, right? We've set it up in the SDK and like it just stopped running. It's like, well, yeah, because that's not what your needs actually were. So I think it helps a lot on the customer side where they just have to worry about getting their job done.

And that's it. That's the only thing you have to think of. And then as things start to scale, you know, then you're like, okay, well, you can re-up credits or maybe we can change to a, you know, a different sort of subscription or something like that. But at least for now plus a little bit in the future, they just don't have to worry. Yeah. You mentioned also compute. Why is compute involved in there? Are you abstracting away the pipelines themselves too? Are you like...

adding the compute or are you doing the compute yourself? So the compute comes from the agent actually running thing, like running code to get your job done. So the fundamental way it works is it is a coding agent. And what our thesis is that seems to be working is that coding is like coding is the language that we've already created to interact with every service out there. So why would we try to rewrite it?

Like it's already there in a box. You want to interact with Databricks? They have an SDK and an API. You want to interact with Postgres? Like they have it for you. Why would you try to rewrite that?

And so the way the agent works is it will actually just code the way that a human would to get the job done. So if you have a GitHub repo set up, it'll clone the repo and make the changes in the right place for DAG files, for Airflow, and then push you a PR for that. Or if it's just make a change in my MongoDB database, it'll write the code to get that done and then push that. And so all of that requires compute to run that VM that the agent is given. And so that gets passed on.

Oh, I see. So it's not only the LLM calls, it is also the VM and everything that's happening in that sandbox. Awesome. Now, before we go, man, I feel like I want to make sure that I get to ask you everything because this stuff is super fascinating to me. And I love the fact that you're taking like this AI first approach for the data engineers, because Lord knows,

Every day is Hug Your Data Engineer Day. They go through so much crap and get so much thrown upon them that this is a tool that I imagine they welcome with open arms. And I guess the other piece of it is like, I imagine you...

probably for fun or out of passion of building this product, have looked at a lot of logs or looked at a lot of stuff that the agents have been able to complete. What is one run or something that an agent did that surprised you that it actually was able to pull it off? We did a bunch of testing to have it write Spark code and Databricks and to call that orchestrator pattern of

have something in Airflow, trigger something out of Databricks. And like a lot of companies do this where they use Airflow purely as an orchestrator, which is generally a good pattern. And we just had it, it was like a very simple thing

like data frame calculation that we wanted to do. But it was a fact that it was able to use multiple services at the same time, one shot the code properly, like process the data elsewhere, and then sort of pull that all together in like an airflow pipeline that didn't error out. I was like, there's no way that thing just works, right? Like, that's insane.

So I think that was probably the biggest thing where, you know, it wasn't just the complexity of write the pipeline. It was like, and you have to call out to a different data processing service, which means you need to understand what are the clusters that you're running and like all this stuff that needs to go in. Then you also need to write the Spark code properly to make sure that doesn't error. And it like did it all. And I was like, oh, okay, guess we're onto something. Like this is, wow, this is impressive. Yeah.

It worked once and you're like, don't touch anything. Nobody move. We need to pray to the LLM gods right now that it will work again. Yeah.

Yeah, I remember I immediately just texted my friend. There's like, there's no way this thing just worked like this is insane. And then, you know, it's pretty cool when you have those moments like you're running through training loops and it's just crashing, crashing, crashing. And then it just starts working and then it works more and more and more. And it's just not failing anymore. And you're just looking at that thing like, holy crap, like, wow, this is working. Like, this is unreal. Like you think three years ago, you'd have been

bots pretty much that you could just say go do this super complex task you also need a code you also need to understand like an entire Databricks environment oh and you you know you should get it right in one or two tries like that's just unreal that it's happening.

AI Data Engineers - Data Engineering After AI // Vikram Chennai // #309 49:40 Share

MLOps.community

Deep Dive

Shownotes Transcript

AI Data Engineers - Data Engineering After AI // Vikram Chennai // #309