We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI, Data Engineering, and the Modern Data Stack

2025/6/20

If you like literally take all the tasks that data engineer does every day and you write them on a list and you say, which one of these are things that a highly trained, highly paid human being should be spending their time on? It's like a lot of them, they shouldn't. Pipeline failures happen. And yet inevitably the cause of those failures is kind of dumb. It's like not that interesting. Agents are like quite good at analyzing

identifying the problem and proposing a fix. I expect to see a lot of automation of data engineering tasks over the coming 12 months. Jevons paradox is coming into effect pretty hard right now. Analytics always expands to fill the available budget. You want to continue to improve the price to performance ratio, not so that at the end of the day, like people can like stop doing things, but so that they can do more things.

Thanks for listening to the A16Z AI podcast. We have another great discussion for you today, this time featuring dbt Labs co founder and CEO Tristan Handy, along with A16Z general partner Jennifer Lee and partner Matt Bornstein. If you're active in the world of data engineering, there's a good chance you're familiar with dbt.

But if you're not, here's the very short version. dbt helps its users build data products using the rigor and best practices of software engineering. And as Tristan points out during the episode, it counts more than 1 million users across more than 70,000 organizations. However, this discussion isn't really about dbt. It's about the major changes in the data world brought about several years ago by the concept of a modern data stack and, more recently, by the advent of generative AI.

The three start off on the topic of where AI can really shine in the world of data analytics and data engineering before getting into the rise and plateau of the modern data stack. They also cover the lessons data engineers can still learn from software engineers, and finally, what we should make of a spate of acquisitions and product announcements across the data infrastructure market. And you'll hear it all after these disclosures.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures. I don't believe in the idea that you're going to do analytics by asking a model to write SQL. It's not that interesting.

If you can write a well-formed SQL query, the hard part of analytics is what data analysts are doing is they are socially constructing truth inside of an organization. There is no such thing as revenue in an abstract sense. It is just what do we all agree is the way that we measure revenue. And a model just doesn't have access to that unless you...

you give it very specific instruction and you would do that through metadata. And in a best case scenario, you would do it through something called a semantic layer. A semantic layer would actually give the data

the model exactly the metadata required to construct the SQL query in a way that everybody in the organization agrees it should be constructed. We acquired a company called Transform, I think it was two and a half years ago, and now it's integrated into the dbt platform and we built an MCP server that kind of exposes this functionality. And when you go to any MCP enabled language model and you ask it questions

about your business data. It gives you correct answers. And the funny thing is that there's a bunch of people that kind of play around with that, but it hasn't crossed into the mainstream. There's a ton of curiosity around this, but still people are using

the BI tools generally that they have been using. Let's break it down to like, what are the tasks an analyst is doing today? And what are the pieces that actually these models have capabilities to serve? I think, you know, even compared to a year ago, the capability of writing SQL is nice and day. And I recently asked ChaiGPT to build a chart for me with this

very complex Excel sheet, actually. You need to sort of do a couple of pivot tables and paint this chart, take out a couple of rows and columns as well. It did a great job at painting this chart. I was very surprised, impressed. I did a couple of spot checks of if this is still correct data points, and they're correct.

And that's sort of gave me more hope of where AI can be applied to maybe in the final step of visualization. But there's also data cleaning work to do for analysts. There's this organizational social work to do, which I don't believe ever. Maybe through a few agents working together, they can gather some truth. But which are the pieces maybe we can speculate now ready to be

automate it and which are the pieces still, I think we require a human to come in and do the work. I can share just one or two things that we've seen from some of your fellow portfolio companies. I think you're absolutely right that humans have a lot of work to do in data analysis. Like that's very clear. So much of the work is gathering context, making definitions, sort of

almost negotiating with other stakeholders about what the definitions mean and which ones are correct. You obviously are kind of the world expert on this, so you know those...

This is much better than we do. What's interesting is there's a parallel there to writing application code. I think when you're just writing code for a new piece of software, there's still a lot of context, both in the terms of the code that has to be ingested and you have to understand the architecture of the whole system and all that kind of stuff, and the social context of you're working in a bigger team of engineers with a lot of different opinions. We're obviously seeing AI coding take off to a very...

large extent. And I think the key there has been finding the right insertion point. I think this is exactly what you're asking about, Jennifer, in analytics, where it's like,

it's very clear you don't want to just completely replace an engineer with an AI coding system in the same way that you wouldn't want to completely replace an analyst. Like, it almost just doesn't make sense. It's like somebody still has to press the button. So as long as that's true, it's like, okay, are they just pressing a button or are they providing a specification? If so, what does the spec look like? And if so, like, shouldn't they just be writing some code or kind of driving anyway? So there's this kind of like...

fundamental problem, I think, with full replacement of people in these jobs. But, you know, what coding has gotten right is the models write very good code actually now. They can do some stuff sort of on their own if you give them proper direction, write a good spec. And there are great tools, you know, like things like Cursor and Cloud Code and things like this that kind of in certain way that engineers like. I'm sort of curious to see if that comes to analysts, right? We've actually seen like people use Cursor to write analytics. We

queries, which is pretty interesting. One of our companies, Hex, has a pretty good AI product that I think you know this company is interested in. It's almost hard to go back once you've used their kind of magic features to not using it. But these are still relatively small and relatively incremental. So what's interesting, I think, is kind of what happens next and what really is that right insertion point. I think you're totally right. The interesting question here is human in the loop versus human not in the loop. And this is why

Typically, the way that people think about the quote-unquote AI analyst is not as a way to accelerate current analysts. It's a way to do self-service inside of businesses. So like, okay, I want to take this and I want to give this to every single data user throughout my company, which is 10x, 100x as many analysts as there are. They're the people who mostly are using Excel today. But those folks don't have the ability to evaluate, is this code actually valid?

Correct. Producing the correct result. Yep. They have no way to verify it, which is a very scary thing. That's a human out of the loop process. The places where I think human in the loop is working really well, and Hex is a great example of this, typically you're going to see users constructing these queries or notebooks that...

have an ability to read the code and say this is correct or not. And as a result, it becomes an accelerator for them as opposed to a replacement for them. The area where I think there's even more room for this is in data engineering. Data engineering is incredibly valuable, but also

If you literally take all the tasks that Data Engineer does every day and you write them on a list and you say, which one of these are things that a highly trained, highly paid human being should be spending their time on? A lot of them, they shouldn't. This is an area where SQL generation is very valuable. Pipelines are incredibly valuable. Their performance matters a lot. So in DBT, we have the ability to...

using natural language help you build pipelines. And of course, you've got to actually validate them and work them through the CICD process, etc. That works great. One of the things that I think is the most time sucky and produces very little value is debugging pipeline failures. Pipeline failures happen. Pipelines are more brittle than we'd like them to be. And yet,

inevitably the cause of those failures is kind of dumb. It's not that interesting. And so you just need to look through enough log files and trace it upstream. And there's a process for going through this, but oftentimes it takes four hours for a human being to trace this. And it turns out, and we haven't productized this yet, but we've proven it to ourselves internally via prompt engineering. Agents are quite good at

identifying the problem and proposing a fix. And if you have the right tooling, you can then take that fix and run it through CI and you can say, ah, this actually produces the output that I'm looking for. So I expect to see a lot of automation of data engineering tasks over the coming 12 months. And are these failures kind of within system boundaries or across system boundaries? Because I've found this is one of the big kind of questions for AI. It's like,

If you have to interface with an external system, it's a lot worse at that versus if it's like, oh, there's a mismatched schema. Like it's actually pretty good at making a guess, you know, like trying to align them. You're right that the things that I'm focused on are very much in the world of the data has landed and all the way through to like, I've got the data set ready for analysis that I want. But I think that there is enough time

connective tissue in in this space that i think you could turn around and do that same set of things to to five train pipelines or anything like that it's very interesting because in the coding example you can set you know your bug bot loose to try to find a bug and if there's some external dependency it tends to just start making stuff up

It's like, oh, maybe that system went down or maybe, you know, like the function signature, you know, and it's like, well, did it or did it not, right? It's very hard for it to tell. That's actually very interesting. That's sort of a point in favor of kind of, you know, to your point. Yeah, pipeline failures in our world happen for a pretty defined set of reasons. That's very interesting. An upstream source changed its schema and it broke something or new data showed up that we didn't anticipate or these kinds of things. That's very interesting.

I'm asking you this question because you coined the term, second, also because you're sort of a historian of the space as well. You wrote very popular blogs around this. The Analytics Engineering podcast also talked quite a bit about more than just stack. Give us a bit more background on what is this

modern data stack? Modern is always a tough term. Modern relative to what? I live in a mid-century modern house. It was built in 1979. That's not particularly modern anymore, but like we still call it that because it was in reaction to a thing that came before it. And I'm not an architecture snob, so I don't really know the full history there. But the thing that modern was referring to kind of in relation to two things that came before it, one was the kind of

Hadoop world and the other was the kind of on-prem data warehouse appliance world. And both of those either had already hit or were starting to hit some pretty serious headwinds by the time that kind of cloud emerged.

came for data. I would put the start of the modern data stack at the launch of Redshift in 2013. I think that that was, you could argue that maybe the 2013 version of Redshift didn't have many of the characteristics that some of the data platforms later came to have. But it was the first time you could swipe a credit card and get access to really great analytic technology in the cloud. Before that, you had to spend

you know, a hundred grand in procure servers. So an ecosystem grew up around it in the early days. It was like Looker and Mode and Periscope and Fivetran and Stitch and maybe a couple of others. And you could pretty quickly like put together a set of products that was like pretty mature with a couple of credit card swipes and like in an afternoon. And that was brand new. And for people who had been stuck with

not good tooling for a long time, it was really exciting. It allowed us to work in ways that were not at all possible. Now, I'm sure we'll get into it, but I think that the arc of history has, I think, played out on the term modern data stack, mostly because it won. The ideas in the modern data stack have kind of taken over the industry. And so then the question becomes like, well, what

What's next? You sort of also ended the modern data stack era to 2024. So where are things at now and where is data stack at? Is it post-modern? I don't know. The impressionist data stack or the deconstructionist data stack. I don't remember who originally got me onto this, but I've become a big fan of the Carlotta Perez thing.

framework, it introduces the concept of S-curve stacking. So like every technology goes through an S-curve where it starts off where almost nobody uses it. And then very quickly, a bunch of people, early adopters and then middle and then late adopters. And then eventually everybody's using this and it kind of starts to level off. And the way that you get technological progress is you stack S-curves on top of one another. The way that I see the space right now is that

Really, we had the S-curve right before what we've been talking about as the modern data stack was kind of the rise of public cloud and Hadoop. And Hadoop was really enabled by the cloud. Most companies couldn't really imagine running kind of a Hadoop infrastructure on-prem. It's just not really how it's built. That S-curve kind of came to an end, and then you have the rise of the modern data stack and Hadoop.

I would say that that S-curve kind of came to an end in the same way that the S-curve around railroads came to an end. We got all the railroads, and we're not in a deployment phase of railroads anymore circa 1925. And so now the big axis of innovation, I think, is in two places. One is in open standards, things like railways.

delta and iceberg that's at the file format or the table format level. And then the other one obviously is in AI. And so AI is a much bigger topic than the world of data. Like for all the excitement that's happened in data over the last 15 years, I don't think anyone's worried about like

artificial data intelligence and data putting us all out of jobs or anything like that. Like the societal implications of AI are like fascinating and well beyond what I'm an expert in, but there are like very direct implications

implications for AI on data and for data on AI. And so it's that intersection that I'm particularly interested in. One question I have for you, Tristan, is are there things that modern data stack never hit, right? Are you seeing a lot of workloads that are still kind of there and they've been there forever? And even though people know modern data stack is the right way to do things, like they're like, oh, but this has some other thing, you know, and so we just haven't touched it. The term modern data stack, if you've

Move away from the technology part of it. There's also a persona part of it. Who tends to work with this set of technologies? And I think the answer to that is the spectrum from data engineer to analytics engineer to data analyst. It's like people that are firmly in the world of data. Software engineers sometimes dabble in that space, but they mostly don't.

And similarly, if you're a business analyst...

You might dabble, but you mostly don't. Like business analysts have been pretty resilient to the rise of the modern data stack. And a lot of them still use tools like Tableau and Alteryx and Excel. And software engineers often still like we run into people who, despite the fact that there's, you know, well over a million people that are authoring DBT workloads and 70,000 companies today, like software engineers, like a lot of times just don't have any contact with

this tooling stack at all. In terms of workloads, I thought that we were going to do more in streaming. Like we, the collective kind of ecosystem, and that hasn't turned out to be true, at least at the pace that I had anticipated. I think that ends up being more of a persona thing than a technology thing because I think there are actually good answers to how to do SQL and Python on stream processing engines.

But I think it's actually just tends to be different humans who need really low latency data delivery. Streaming like cold fusion is one of those things. It's always a good idea and always on the horizon. That's a really interesting point you make about software engineers versus analysts or analytics engineers, for instance, because I think in a lot of ways, the history of the data stack, like you said, is sort of this diffusion phenomenon.

from more engineers towards more analytics type people, right? Like you mentioned Hadoop, for instance, you know, Hadoop was sort of a very technical, highly engineered solution built by a bunch of engineers, right? They kind of looked at this data problem that existed at the time and said, okay, let's do a distributed file system and this really complicated sort of

programming model that only, you know, like a Google engineer could invent, right, called MapReduce. And so I think you saw a diffusion of this happen for a long time, right? Like you can trace things like Hadoop into things like Redshift or Snowflake, where you're having this distributed benefit, but with an easier programming model, for instance. I think Iceberg or Delta, which you mentioned, is another great example of that, where this was

sort of a new table format, as you mentioned, sort of built by people at Netflix and Airbnb and Apple and places like that, but really is diffused much more broadly now as kind of mainstream enterprises want to, you know, apply this kind of independent storage layer. It's not clear if that's still happening right now. Like, are there kind of new things that are diffusing out of the engineers, you know, into kind of the analytics world? Or maybe to your point, maybe like those groups aren't talking to as much

each other as much these days? Or maybe, you know, it's just kind of the natural flow of the industry? It's an interesting question, I think, that you bring up. Not that we represent the entire modern data stack by any stretch. I would never try to claim that. But I will say that if, you know, if we're at over a million developers and 70,000 companies... Those are huge numbers, by the way. Sorry to interrupt. That's like crazy to think about. It's a decent slice. And we still see those numbers growing

pretty quickly. That is not because there are that many new humans getting minted every year. It's because people are still joining this movement, this way of looking at the world. I think that will continue to be true for a long time. In terms of, like, we still have a lot to steal from software engineers. Like, you don't, I've pretty consistently felt like software engineering tool stack was maybe two decades ahead of data. I

I think that maybe we've closed a little bit of that gap, but we're still pretty far behind. One of the things that is irritating to me is that in data, most of the processing engines that we use are proprietary and they're controlled by a vendor and

As a result, there's no such thing as a local development environment, which is like kind of anathema to software engineers. Like the idea that the only way I could possibly run my workload is in Amazon RDS. Like that's not a thing. Or it like was a thing 25 years ago. The other thing is that like basically all software engineering ecosystems are fundamentally built on a compiler or an interpreter in the case of, you know, interpreted languages like Python. That component

compiler

defines kind of the ground truth for like what works in an ecosystem. And then on top of that, you have libraries and package management and like a whole ecosystem kind of built up around it. And because of these two things together, you end up having like a dysfunctional software environment where every company that you go to, you have to build everything from scratch all over again because there's not good shared libraries because

At this company, we use a different data platform and the languages between these two data platforms are different enough that you can't reuse the code across them and blah, blah, blah, blah. And so one of the things that we have been very focused on over the last six months is we acquired a company called SDF. SDF is fundamentally a compiler company.

The technology involved is a SQL compiler, a multi-dialect SQL compiler. And so it aims to abstract across all of the differences between all the different SQL dialects and then pull it down into a place where you can actually emulate that database with full functionality.

100% fidelity on your local machine and give developers tooling that they can trust there. What is the product work you're doing now on dbt Fusion? The dbt Fusion engine comes directly from technology that we acquired from a company called SDF. This is a group of very smart humans who essentially rebuilt

the engine at the heart of the dbt ecosystem in in rust and gave it a bunch of new capabilities at its heart it is a sequel compiler multi dialect sequel compiler and so can do a bunch of things like understand the most granular level how a query will operate when it's sent to a database and it can emulate that locally that allows

us in this new world to do a bunch of neat things. It allows us to give developers local development environments. It allows us to give developers much better developer tooling in their IDE than they've ever had access to before. Error handling, automatic refactoring, all of these kinds of things that you would expect in a modern software language.

It also will allow us to do a bunch of neat things that are kind of new in the data engineering space. So the original technology for this came from

uh when the cto wolfram was at meta he was hired at meta in the wake of the cambridge analytica scandal and the task was we have over a million tables in our data warehouse and we don't know how pii flows through that data warehouse and we've got eight different compute engines and their sql dialects are all a little bit different and we need to make sure that everywhere the pii flows we can we can track it and

So that is a capability of this engine. And so you can, on the source level, you can tag all of your PI, PHI, and then it will perfectly track that for you through your entire data estate. It will also give you the ability to orchestrate your pipelines in a much more sophisticated way so that it never does any work that it doesn't have to, which has the potential. Much more efficient. Much more efficient. It has the ability to reduce your overall kind of

infrastructure costs by like meaningful double digit percentages. Also thinking ahead of, you know, what is happening when we have more AI analyst agents that the compute bill probably is going to stack up if we don't have these more efficient workflow engines. Yeah, Jevons paradox is coming into effect pretty hard right now. I was

uh just talking to jordan tagani at mother duck and he's he's seeing a lot of workloads move onto mother duck but then what do you know he saves people a bunch of money and they find a bunch of new workloads um and he shared some quote with me which i forget i forget the person who this is attributed to but he said analytics always expands to fill the available budget so like you you

You want to continue to improve the price to performance ratio, not so that at the end of the day, people can...

like stop doing things, but so that they can do more things. Right. And that's one of the premises of, you know, why more than data stack was popular was a lot of business analytics work that needed to be done in the past. We're not able to with much limited data sets now that you can store all the data you want to analyze in the data cloud data warehouse and a much cheaper and much more performant, easy to access way. Like, you know, we can answer a lot of questions that we were not able to answer before.

by the way the stf guys just deserve like a medal of honor for actually doing this work right like the idea that you can like interpret specific sql dialects and like run local emulation of each of these engines it's like this kind of extremely detailed systems work that like is it's hard to do yeah i kind of didn't believe it at first i asked them how many

automated tests they had to write in order to ensure, to guarantee that statement. And the answer is that on top of the SDF database emulation stuff, there's single-digit millions of automated tests that run.

That's incredible. That is crazy. It has to all be written in Rust because it's like a very serious build system required there. What are some of the things that you're most excited about that haven't been done of borrowing from practices of software engineering that could be applied to data and data engineering? I think that... So I just gave you two of them, local development environments and compilers. Where we go from there, I think...

It is like healthy reusable ecosystems. When you build a website, you don't start by writing HTML and CSS.

you like typically would use React. And then on top of React, there's a ton of components and almost never are you going to build any of these basic components. Maybe you'll like modify the CSS to make it look like your brand or something like this. And then everything breaks. So you go, oh shoot, better change my CSS back. Right. Yeah. The point of good tooling

is to multiply the impact of every individual professional. And that has always been my goal. I started my career as a data analyst. Data analysts, especially back in 2003, did not have great career paths, didn't make a ton of money. And the better tooling you can give somebody, the more business value they create and the more you can afford to pay them. And so I think that if we can create tools

really highly functional package ecosystems, we can stop the process of people reinventing the wheel over and over and over again. Yeah, 100%. And also thinking in context of not just going to be humans analyzing and utilizing data, but there will be more and more AI agents that are coming too. And the more you can standardize, the better your agents will be able to interface with your data.

ZOEY FAN: For sure. And reusing the components, reusing the libraries, being able to guarantee more accuracy through having these verified sort of components as well. I'd love to hear a bit more of your hot takes on the recent news. DBT has done a couple acquisitions. More and more recently, you mentioned SDF. At the Databricks Summit, people were talking about Lake Bay's, from the recent acquisition of NEON, and Snowflake acquired Crunchy Data. What's happening with these, I would say,

one of the companies going into first more operational or like transactional data workloads. And also how generally you think about, you know, what's happening with the tooling stack being more compressed now compared to sort of a few years ago. Compressed, you mean like consolidating? Yes. Yeah, yeah, yeah. It's like the C word these days, consolidating. Yeah. One of the most boring things to do as a data engineer

is to create pipelines that replicate data using CDC from your OLTP to your OLAP data stores. It is just...

like these these database technologies for operational workloads and analytical workloads they optimize for different things and so then i don't believe they're ever going to be the same so you always have both of them and you always need to get data back and forth between them the idea that you would have the same vendor being able to provide both seems like

Obviously a good idea. Now, I know that we're recording this on the afternoon where Ali and Reynald went deep into the lake base. That happened this morning, and it was super interesting to hear them talk about it. I think it's based on a lot of good thinking, but this is not...

they're not the first ones to do this like let's give people both access modes oltp and lap i think that it will help a bunch of databricks and snowflake users that their platforms now support them and what like what do you think is really going on here and and maybe just for our listeners we can do the quick explainer which is oltp means kind of one row at a time so if you're if you're checking out an amazon right like you sort of insert into an oltp or transactional database

OLAP means you're going kind of one column at a time. So if you want to look, summarize across all the rows, all the transactions ever done, you know, so it's more analytical. Like, what do you think is really going on here? I just find it so interesting. It was sort of an OLTP world for decades, right? It was Oracle and SAP and, you know, even MySQL and Postgres. Like this was when you said database, this is what people thought of. It's almost like OLAP kind of

became the hot thing, right? Between Snowflake and Databricks. And I know I'm abusing the term OLAP a little bit now, but just sort of analytical workloads in general. But now it's very funny, right? Because these companies are now getting into OLTP. It's like this kind of like market pendulum kind of shifts back and forth. And the technology may not change dramatically, but the people kind of running, kind of like owning the customer and sort of owning the consolidation point may change. You know, so I'm just so curious what you think is kind of going on, like why that happens. Yeah.

We have needed databases to process transactions as long as we've had software. And you could certainly get people who know much more than me about the early days of that ecosystem, but you'd have to trace it back to whatever, like the mainframe. Yeah, it's like the airline bookings systems. Yeah, right. You could probably draw some exponential curve of the number of

software applications out there in the world. And almost every software application needs some way to store state. It needs a database. And so the growth of OLTP has been, I think, pretty consistent over time. And for a long time, if you were going to do analytics, you reused whatever system you were using for your transaction processing system. I mean, even I started my career like...

writing queries on top of Oracle's OLTP database and MySQL and stuff like this. And it was bad, but as long as your data wasn't huge...

it wasn't a giant problem. And so why did OLAP start to become a bigger thing? It's just the rise of the internet. Like the rise of the internet led to clickstream data, led to, you know, advertising data and the data volumes went up. And so you developed more use cases for which you needed the capability process, larger sets of data. I still think, and I, I,

I don't... You folks probably have better market research on this than I do, but my guess is that the OLTP world is still, from a pure dollars perspective, is probably still significantly larger. But it's also a little more stable. We've been doing this for a long time. The growth rate's probably pretty consistent. And so the...

The novelty is in analytical databases. And that's why you see companies like Databricks and Snowflake kind of come from nothing because I think the folks who had done

uh ltp databases for a long time didn't anticipate just how big an opportunity there was here oh that's interesting so it was a little bit overlooked by the ltp guys i think so and then and now they're kind of like backwards integrating and on the point of like what is driving storage and compute workloads i my speculation also on this acquisitions is also what type of workloads

you know, these players want to see on top of their platform. All the OLTP databases, they say at this point, added vector search capabilities or capabilities already, where I think that's a majority of the workload when you were thinking about AI, that's driving a ton of usage on top. It's, you know, people who are trying to leverage the data in the database to build applications and, you know,

OLAP have a role to play in that, but it's still not as direct as sort of these OLTP databases where there's going to be a lot of synergy between the two to leverage one for a predictive, maybe more batch workloads, but the other one for more of these forward-looking use cases. Thanks for listening to the end. If you enjoyed what you heard, please do rate the podcast on Apple and share it among your friends and colleagues. And stay tuned for even more talk about AI and data next week.

AI, Data Engineering, and the Modern Data Stack

AI + a16z

Shownotes Transcript

AI, Data Engineering, and the Modern Data Stack 35:07 Share

AI + a16z

Shownotes Transcript

AI, Data Engineering, and the Modern Data Stack