We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Data Management for Enterprise LLMs

2025/2/7

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Derek Harris

George Fraser

Guido Appenzeller

Topics

George Fraser: 数据准备的核心在于从业务中获取上下文,理解数据的真实含义。我发现,数据准备不仅仅是技术问题,更多时候需要深入业务,与相关人员沟通,理解数据背后的逻辑和规则。例如,Salesforce 中两个字段有时会同时存在数值,需要理解其背后的业务规则。我认为,未来的解决方案可能需要一个 LLM 代理,能够主动提问,澄清数据含义,最终简化数据视图。数据准备的本质是创建一个简化的世界视图,掩盖原始数据集中那些公司特有的、难以理解的特性。这不仅仅是数据问题,很多时候需要推动组织变革,才能从根本上解决问题。 Guido Appenzeller: 我认为在大型企业中,对于“收入”一词可能有多种不同的定义,AI 尚未理解这些语义,因此仍需要人工参与。例如,企业内部不同部门,如销售、财务、税务等,对收入的定义和计算方式可能存在差异。AI 目前还无法理解这些细微的语义差别,因此在数据准备和分析过程中,仍然需要人工的参与和判断,以确保数据的准确性和一致性。

Deep Dive

Chapters

This chapter explores how generative AI, particularly LLMs, impacts enterprise data management. It highlights the increasing importance of handling unstructured text data and the potential for LLMs to improve enterprise search. The discussion emphasizes the importance of using existing data infrastructure for AI projects, rather than creating entirely new stacks.

Generative AI enables processing of unstructured text data.
LLMs enhance enterprise search capabilities.
Reusing existing data infrastructure is recommended for AI projects.

Shownotes Transcript

Translations:

中文

If you look at the work people actually do in data prep, it's mostly going out and gathering context from the business. You have to walk around and talk to people and find out, hey, what does this field mean? There are two fields in Salesforce and sometimes one is populated and sometimes the other. Why is that? I suspect a solution to that

would actually entail an LLM agent that goes around asking people questions in order to clarify things and simplify the data that they're looking at. That's fundamentally what you're doing when you're doing data prep is you're trying to create a simplified view of the world that obscures these idiosyncrasies that are in the original data set that are always highly company specific and that are not at all self-explanatory. You can't just look at the data and figure it out. It's not actually a data prep problem. You have to go change the organization.

Hi, and thanks for listening to the A16Z AI podcast. I'm Derek Harris, and I'm joined this week by Fivetran founder and CEO George Frazier, as well as A16Z partner Guido Oppenseller for discussion about data architecture and data management in the age of LLMs. If you're an enterprise organization thinking about how to integrate language models into your existing environment, the good news is that in George's view, you probably don't need to change much.

Thank you.

Whereas good old-fashioned dashboards might get a bad name in some circles, they do theoretically make clear what matters to the business. But 10,000 employees prompting language models can construct 10,000 different cases to support their own ideas. All that, plus everything from the origin of SQL to the data engineering skills of the future, after these disclosures.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

So to start, George, can you just explain like from the 10,000 foot view, how you've seen generative AI change things in your world over the past couple of years in terms of what customers want to do, the types of projects they're pursuing? Just overall, if you went from 2021 to like 2022, 2023, and now to 2025, how that's looked? At Fibetran, we move data.

on behalf of 700,000 or so customers, depending on where you set the threshold for customer. And, you know, the problem we're solving for people is the same for the last 12 years, getting all your data in one place. People do a lot of things with that data. Probably the single most common thing they do with that data is sales reporting. AI workloads are a new thing people do with the data. And the really exciting thing is that we can finally do something with text data. Fivetran has...

been delivering lots of text data since the very first connector that I wrote in 2015 and Salesforce. We've been syncing notes since then. In the past, there's not a lot you can do with unstructured text. Now, unstructured text is machine readable, and that's a really profound evolution. It's a

But one way it's a profound evolution is for businesses trying to make use of the data they have, they can actually do something with unstructured text data. And for example, at Fivetran, we have an internal knowledge base search bot. Because as a Fivetran sales engineer or customer service rep, you have to be familiar with this huge array of systems that we connect to, because we're data movers, we have all these sources and destinations we talked to, there's

There's more detail than any human being can actually be familiar with. And so we have this tool that indexes all of the internal documentation and pass support tickets and Slack conversations and things of that nature. And it's like the big brain that you can ask any question of like, how the hell do I configure, you know, an Oracle database, this version, this whatever, in order to connect Fivetran to it. And it's a super useful tool. It's powered by the exact same data warehouse, the exact same tables, right?

as everything else, as all the rest of our analytics stack. Correct me if I'm wrong, but unstructured data was kind of like the reason for being for part of the big data movement back in the day, right? I mean, you know, when we looked at systems like Hadoop or and some of those, right? I mean, I think unstructured data was a big part of that. It's just like the... Honestly, the big data movement was like a collective insanity nightmare that people woke up from.

It was such a crazy thing. I mean, we could spend the whole hour talking about this, but it all started with like Google Envy. You know, Google created MapReduce, which was a bad system. Like they don't use it today. All of the database management people in academia, in industry looked at this and they're like, we've had better systems than this since the 1980s. What on earth are they doing over there at Google? And then everyone copied it.

because they thought Google's doing it and it must be a good idea. And then I think people were looking for a reason to just, it was a solution looking for a problem. And so they start trying to find unstructured data to go store in this system. And one of the most peculiar things we've seen in a lot of the takeouts that we've done, because we've replaced a lot of systems like this, is people will do stuff like they will call Salesforce, they'll take the API response and they'll store it as like a text blob in

in a file. And we look at that and we're just like, what are you doing? You're trying to find a nail for your hammer. Like it's a JSON response with a known structure, parse it and turn it into a table, which is what it represents. There was a lot of silliness in the whole big data movement. Like people looking for a reason to use this tool that, but like I said, don't get me started because I'll go all day.

I guess what I was getting at, though, is the answer like, well, maybe just feed it to an LLM and use that data in that sense versus trying to build a movement

Around that idea. Well, so now we can take the real unstructured data. We do have things like commentary in messages, documentation, just internal Google Docs presentations. We can actually do something with them and interpret them. We can synthesize, we can understand them. We can say things like this document has a semantic meaning, is about subjects that is similar to this other document. And

And we can synthesize them and summarize them, which is amazing. Whereas before, we really could do very little with this information. And there's a lot of good information inside these documents. And, you know, this is all in its infancy. It can all change tomorrow. We're still at a very early stage with all this stuff. But from like a data management perspective, what this means is it is actually possible to do something with primarily text data that

We've had the ability to aggregate a lot, but we just haven't been able to do much with it. You mentioned like sales reporting is the number one thing that people use their data for, at least that your customers are using this for. I mean, has that changed with the advent of LLMs, let's say, or is that just amplified, right? Is it like people just want to do more sales reporting and different types of sales reporting? Or are people looking at this as like an opportunity to do something different?

with their data? Because again, like you mentioned, you actually have, can aggregate and utilize a lot of this text data. So there is this funny thing in the world of data management and analytics, which is there is this like conspiracy of silence about what people actually do with the data. So like,

95% of what people do has been the same for decades. There's people who have spent their whole careers basically building the same reports over and over in different companies. And there's nothing wrong with that. These things are super useful. Just because they're not different doesn't mean that they aren't useful. And I expect that

the dominant workload in, let's call them enterprise data warehouses, which is a term that people are always trying to run away from. But I think it's a great term. It's the place where you have a replica of all the data of everything happening in your company. The primary workload is going to continue to be reporting. People want to know what's going on in their company. Are we going to make our number? Inventory is a huge deal. Managing inventory has been and will continue to be a

a huge workload. AI is going to be a new workload for these systems. And it's really exciting and it's new. So a lot of people are talking about it and it's going to be really valuable. And we're in the early days, but I don't think it will represent the dominant workload of these systems ever.

It's another participant in the system, and there's a lot of existing participants, and the existing participants are going to continue to be very valuable. It seems like one new area that we're seeing in AI is sort of a reshaped enterprise search that I can basically now use AI to do much more complex queries. I can use it to aggregate data, to summarize data over repositories with different modalities with unstructured data. Is

What does that mean for your business? I mean, that must create huge, huge challenges. If I want to, as we suddenly search, you know, things that are stored in lots of different silos with lots of different access mechanisms. That is, I think, the most common thing that's happening today. We're seeing a lot of that and we're doing that ourselves. The example I gave earlier of that tool that's been in production at Fivetran for a year is exactly that. It's enterprise search over our internal network.

And people ask questions. A lot of it has to do with Fivetran. We do data movements, we connect to sources, and the destination is usually the easy part. The destinations are very robust and accepting of data. The sources are the finicky pieces. So a lot of it is of the form like, how do I get connected to this source? How do I configure this database that I can read the change logs, et cetera. And it's super useful. Interestingly, when we set that up, most of the data that we needed

was already present in our existing data warehouse. And that doesn't mean these challenges are easy. It's very difficult, let me tell you, correctly replicating data from all over your enterprise into a single data store. There's just an absolute nightmare of incidental complexity. But it's the same challenge, whether you're replicating numbers that represent transactions in your payment processor, or you're replicating texts that represent transactions

replies to customer questions in your support system, the fundamental challenge of how do I basically ask the source what has changed since the last time I checked in is fundamentally the same. So from a Fivetran perspective, it's basically just the columns have different types. And when we built this system, we built it on top of the exact same tables in the exact same database management system that we were already using to power

all of the rest of our reporting. And I highly recommend that to people. I think one of the mistakes people make is they say, I'm going to do an AI project. I need a whole new stack. You don't. From the data management layer, if it is possible, you want to use the same system you're already using to power all the rest of your enterprise data workloads. Because

Even though the outcome is very different, the internal challenges of getting all your data in one place are very much the same. Now, there are some things that are different around dealing with things like PDFs and images. There are definitely some new challenges there. But a lot of the data you want actually is already available in a plain text form somewhere. You just need to connect to the right system. This might be a dumb question, or naive at the very least, but where does an LLM fit into the enterprise data ecosystem?

architecture as it were, right? And if you're talking about traditional data stack, like how would I envision where the LLM fits into that and how that process looks, right? If we had an ETL, then ELT, like is there some sort of new, like feed it into an LLM portion of that workflow? I'm curious how you think about that conceptually.

Well, the word ETL gets used to mean multiple things. Sometimes it means getting the data out of the source system and replicating it, which is what we do. And then sometimes it means transforming the data into a more ready-to-use format. LLMs are really even downstream of that. So you need to get all the data in one place. You need to usually transform it into a format that's more convenient for whatever you want to do. And then

Your LLM is going to come into the picture at that point. So it's going to consume probably a bunch of text data that is sitting in a column of this highly transformed version of your enterprise data set. And it's going to do in like, you know, RAG, which is I'm not at all suggesting this is the end of history, but this is probably the most common thing people are doing today.

The LLM is going to index this data and then it's going to read the data again as part of some kind of prompt at retrieval time. So it sort of has the same position in the stack as like a BI tool, but it's not really a user interface, but it's at that stage.

What about in terms of the transform stage, right? Have LLMs or AI changed the way we think about transforming data in the sense that, again, this may be an ape question, but in terms of cleaning it up and in terms of putting it into tables, in terms of making, you know, adding structure, otherwise, you recall data scientists saying, you know, it was like 80% data munging and data cleanup and then like 20% the actual part of doing the data science. Does that change the calculus on that at all in terms of how difficult or how much of the process changes

That's what I think takes up. Well, it depends what you mean by that. If you're doing a project and your goal is for the data to be consumed by an LLM as part of some kind of knowledge-based search project or something like that, then you're going to transform the data differently in order to serve that goal. Although many of the challenges are the same, the biggest challenge as ever is permissions. Who has the right to read what?

And listen, the tools we have for managing relational data are great for figuring out permissions. Permissions are very relational. And that's a great reason to use the data platform you already have to power your LLM-based workloads. Because a lot of the work you've already done to get data about who works at your company and...

what each of their job roles is and what they might have access to, what they may be supposed to have access to is going to be exactly the same in your LLM workflow as it is in your revenue dashboards or highly overlapping, right? A lot of the same work is going to need to be done there. Now, if you're asking, can an LLM do the work of a data engineer in doing data prep?

As far as I can tell, not yet. And I think the primary challenge of doing that is if you look at the work people actually do in data prep,

It's mostly going out and gathering context from the business. You have to walk around and talk to people and find out, hey, what does this field mean? There are two fields in Salesforce and sometimes one is populated and sometimes the other. Why is that? What are people doing and why? Who knows?

We may need, you know, human-like robots who can walk around the office in order to solve this problem. That maybe is a little bit of an exaggeration, but I suspect a solution to that would actually entail like an LLM agent that goes around asking people questions in order to clarify things and simplify the data that they're looking at. That's fundamentally what you're doing when you're doing data prep is you're trying to create a simplified view of the world

that obscures these idiosyncrasies that are in the original data set that are always highly company specific and that are not at all self-explanatory. You can't just look at the data and figure it out. You have to go talk to people in order to understand why is this this way. I think this is actually a really good point. I've seen startups that give this example of here you have an AI application. I just ask it to give me a table of the revenue and it comes back with a result.

If you work in a large enterprise, I want to say there's probably a dozen or so different definitions, maybe two dozen different definitions of the term revenue, right? Is this for internal sales comp? Is this for a public company reporting? Is this for our tax folks? Is this allocation to the business unit? All these different metrics are different. There's so much semantics involved there that I think AI does not understand yet. We still need a human in the loop.

Yeah, and in order to solve those problems, often you can't act. It's not actually a data prep problem. You have to go change the organization. I'll give you a perfect example. You just mentioned definitions of revenue. So last year, and still as of this moment, Fivetran has three definitions of revenue. Okay, which is pretty good. Actually, as companies go, we have gap revenue.

We have ARR, which takes out a lot of the annoying accounting aspects of gap revenue and gives you more of just like a straightforward view of like, how is the business doing? And then we have this construct called model ARR, which takes out certain non-recurring signals in ARR.

to make it even cleaner so that you can get an even more clear view one month at a time of how the business is doing. The most important difference between model ARR and ARR is it takes into consideration month length. Fivetran is consumption-based pricing. So if the month is 10% longer, as January is from February, then it's going to have about 10% more revenue. Um,

We want to get from three definitions down to two definitions. And in order to do that, we had to change our sales compensation rules because the salespeople were compensated based on the ARR, not on the model ARR. And we wanted to get we wanted that those features of model ARR, but we wanted just two definitions. Perfect example. Not only can an AI not do that, a data engineer can't even do that. I had to do that. I had to go say, guess what? Next year.

your compensation is going to be based on, you know, revenue taking out the effect of month length and a couple other things. But we also had to, you know, we had to make that model AR definition a little simpler so that we could use it for this purpose. There was an element of compromise here, but this is like perfect example. A lot of this, a lot of data problems are really like org problems.

that have to be solved by leaders. The data is complicated because someone somewhere is not making a clear choice about do I want to do it this way or do I want to do it that way? I think about what I might be getting at too is there a point where you see like companies start to institute some sort of hygiene or infrastructure. I was thinking about like

If self-driving cars are going to become a thing, right? Like, you know, become ubiquitous. Like you have to put the infrastructure in place, electronic cars or self-driving cars, for that matter. Like you have to build the infrastructure in place to kind of reorient how you do stuff in order for them to actually function on roads and in areas. It's at a significant scale. I mean, it's AI now at the point where maybe we actually do have to get our data house in order.

If I'm a large enterprise and maybe I start formatting stuff with the idea that this is now the end destination of this data. Yeah, you have to adapt the business to AI. Totally. I mean, that phenomenon comes up so often. It's just everywhere. You know, there's this new technology. E-commerce is like this. One of the things that so many companies had to do in order to adapt themselves to e-commerce was they had to have technology.

fewer SKUs because the need to have them in stock at multiple warehouses at all times. Right. And so they have to make these really hard choices. Like, listen, do you want to have a website where people can actually buy your product and have it make sense? Or do you want to continue to have all of these different, highly unique SKUs? You can't have it both ways. And in many cases, the answer is like the businesses that were willing to make those trade-offs thrived, uh,

and grew, and the businesses that weren't didn't. So yeah, it's hard to predict exactly how, but I'm sure we will have many instances where in order to take advantage of AI, you have to actually change how your business works to make it AI friendly. And people have real trouble accepting that, you know, they've been doing something a certain way for all these years. And

It's like, well, you weren't wrong before, but now you're wrong. Now you need to do it differently or you're not going to be able to really leverage this new technology. So in that regard, do you think of AI and LLMs maybe in particular as like a data technology or as like a more of a business level technology, if that makes sense? Because e-commerce is like a shift, but I'm wondering if there was like a different shift in the data landscape that also mandated these types of changes or if this is kind of one of the first ones in a while that... Well, I mean...

Language models are like this cross-cutting change that's going to affect every element of life in the world.

I think a lot of the most important applications will be in consumer type applications. And I think a lot of them in business will be people just using products like ChatGPT in their business. So they're not really engaged in the production of AI. They're just consumers of tools that are made by others, right? But then there will also be this element of people using AI on their own data and using

very much the part that Fivetran is involved in is in the same way as previously we delivered data to serve analytics. We are also now delivering data to serve AI. This is maybe a little off topic, but like you and I spoke a couple of years ago now working on a piece about like why SQL needs software libraries, I think was the

the title of something. So I'm curious because that was one of the early, I think, business holy grails maybe for LLMs is like, we're basically going to do away with SQL, right? We're going to be able to do a natural language to SQL queries, that sort of thing. I mean, maybe this gets back to the data transformation question, but is there a world that you envision where the database is

engineer, right, the database admin, whatever, like where any of those roles are shifted or evolved because again, like LLMs have, for lack of a better term, democratized the ability for other people to extract information from these things or to otherwise interact with them. This is, first of all, extremely ironic because SQL was designed to be as close to natural language as possible.

which actually some of the flaws in SQL are because of this goal. It's very verbose. The syntax is extremely large. There's lots of exceptions and it's not a very regular programming language. And this is all because people were trying to make it like natural language, but it's not like natural language despite these efforts. My personal opinion is that natural language to SQL is

People who think that's going to be a big deal are misunderstanding what is the hard part of these problems. The hard part is curating your data into a data model that

is sensible and simple. Once you have that table or that dimensional schema that has a single concept of revenue, or maybe only two, and that has clear definitions of things like, is this an enterprise company or a commercial company? Once you've clarified all that and simplified it, you can just put it into a menu with each item, and people will very happily go and check boxes and create the reports that they want.

This is what BI tools are. The alternative to that being I can write a piece of text that describes what I want. I do think that will happen and it will be useful, but it's not as revolutionary. It's solving sort of like the easiest part of the problem.

The hard part of the problem is all those layers underneath, all that work that has to get done. And part of why it's hard is it's just incredibly company-specific. And it changes constantly. Rules that synthesize the data as it comes out of the systems

into this highly curated model constantly have to be updated because the way the business works is evolving. I think if I can pile onto that, I think there's a second aspect here. And tell me if you think that's right. What we've seen in programming languages is that it's actually really hard to translate from natural language to a programming language. The problem is not so much

that programming, I mean, why are programming languages difficult? They're not difficult because somebody tried to make them difficult. They're fundamentally difficult because we need a way to explain

express something very precisely without any ambiguity. And that requires a very formalistic way of writing it, right? And so the hard part is basically dealing with all the edge cases and how should you behave then. And the programming language is precise. Natural language is often not precise. I think we're seeing the same thing a little bit with SQL. Like a simple SQL query, yes, no question, I can describe in English. And if the data representation is good, it'll work.

a complex SQL query with a join, or what happens if I can't find, you know, a value in another column or so, right? If I have a lookup error, and there's so many edge cases, which I think are very hard to holistically describe in natural language. I totally agree with that framework. In programming, they are still quite revolutionary. Tools like Copilot and Cursor

You are absolutely right that they will not help you with the biggest thing, which is how do you state your problem and your goal precisely. But they will help you if you're someone who's at an early stage of learning. They will help make the learning curve of programming much shallower. 100% agree. Yes. And then the other thing they do is libraries.

Even if you know what you want to do, what library should I use to help me accomplish this? What arguments do I give to which functions and what order? These language models have actually really revived my habit of programming. There's a lot of things where I will ship to my team at Fivetran like a little Python notebook that's like, here's what I want you to do. And I'm only able to do this because most of the code is being written by these LLMs.

it's become so much more efficient for certain things for me to just write a little bit of code myself to explain what I want with the help of an LLM and to try to go back and forth with someone on the analytics team because these are usually data-motivated things. I'll give you an example. Last year, I had a hypothesis. I thought, you know, I think Fivetran probably...

loses quite a bit of revenue when a customer connects a new database and it takes a long time to copy all the existing data over. Sometimes it can take weeks, but I bet we lose a lot of revenue because people give up waiting. And there are things we could focus on this and try to do better. But first I wanted to sort of test this hypothesis of like, you know, is this a big problem? How will we be able to tell if it gets better?

And so I knew exactly what I wanted to do. I wanted to do a Kaplan-Meier model, which is the concept from science. But I was a scientist before I was a tech CEO. I'm like, I want to do a Kaplan-Meier model of the survival of customers who are waiting for the sync to complete because this will give me a very precise answer of how big this problem is. But I had never done this in Python. And

And so I just asked Chad GBT, I said, here's what I want to do. I have customers, this is how it works. I want to do a Kaplan-Meier model. This event represents death. Kaplan-Meier models are for like clinical trials for survival of people. And this event represents right censoring and just boom. I mean, it got it on the first try. I wrote a piece of code that, you know, showed that like, hey, this was potentially a huge opportunity if we could speed this up. And I said, okay, here's the definition of the problem. I

I want us to make some changes to make these things go faster. And then we'll rerun this analysis. And this is exactly how we'll tell whether it worked. So I never would have been able to do that without language models. It would have taken too much time for me to go learn the lifetime's API of Python. I was not going to do that. But with the help of language models, it's a very different calculus. And I think stories like this are taking place

all over every company. And so they're revolutionary as like a productivity enhancement in programming. They're not going to solve the problem of precisely stating what you want to do. But if you can solve that, they can solve a whole bunch of other problems. So you're able to solve this problem without going to a data analyst or without having to go somewhere else. There's always this talk about like, we want to make data driven decisions, right? And we want everyone in the company to be making data driven decisions and

I think there's been kind of this goal to broaden access to these tools or these skill sets that may or may not have actually taken place. In that sense, is there a place where you're just seeing more and more people inside of companies now and inside of large enterprises with the ability, again, to like query data or to run these reports or to execute on ideas they might have had that otherwise, again, would have taken however long going through traditional channels to bring in engineering or bring in data analysts or bring in other teams?

Yes, I think the productivity is a lot higher. People are able to, if you have a precise question, formulating it into a SQL query or a Python script is much, much easier than it used to be. And that means you're going to have a lot more people doing this. That is a double edged sword. You know, people use data in two ways. They use it to seek the truth.

And they use it to justify what they wanted to do anyway. And mostly they do the second. So the more consumers of data can be a force for good and evil. Yeah.

That wasn't my next question. It was like, what are the downsides or like where do the diminished returns start as you broaden access and more people are doing this? Yeah, I mean, the challenge is motivated reasoning, listening. The more you dig into data, the more danger of motivated reasoning you run. And we all run into it. If you think you're not subject to it, you're the worst. This is one of the reasons why we build dashboards. Dashboards are not simply because we can't have everyone writing their own SQL queries. Even if we can have everyone writing their SQL queries,

We still want to have lots of dashboards because part of the purpose of dashboards is to say, this is how I want you to look at the data. I want you to look at it in this way and no other. Otherwise, everyone will come with their own version of reality.

So if I think this through, like this next generation of reasoning models, like reasoning LLMs that we're seeing, isn't particularly dangerous there. I can tell them, find a good storyline and reasoning to prove my point here, independently of what the data actually says. That's really funny to think about. Yes. You know what is funny is I actually did that.

recently. So one of the things that tends to happen, you know, if you have a good month or a bad month, is people always love to speculate about seasonality. They're like, well, it's August, so people are on vacation. Well, it's February and everyone is past the vacation bump and they're ramping up for the new year. And like 99% of the time, this stuff is bullshit. These trends do not recur. At Fivetran,

The only months that we have true repeatable seasonality are February, because it has fewer days, and December, because Christmas really is bad. All other holidays, we can't see them. They're not consistently low in the data over many years, right? But people love to speculate about seasonality. It's like their default go-to. If the month is a little high or a little low, they're like, must be some kind of seasonality. Right?

And so I wrote this memo to the company entitled, Fivetrain is not a farm. And one of the things I did is I had ChatGPT come up with a reason why every month would be better or worse because of seasonality. It made a little table and it came up with an explanation of why this month was either good or bad based on seasonality. And they were all highly plausible. So you absolutely can use LLMs as like an amazing motivated tool.

reasoning engine. But then you can turn that around and use it for good to like remind yourself, hey, look, like I could come up with a story like this in either direction for all kinds of scenarios. There's always a way to come up with an argument for this thing or that thing. And it's a good reminder of like how difficult it is to actually

figure out the truth and understand how the world works and overcome our own tendency towards motivated reasoning. If we shift back into like the, maybe like the more of the infrastructure architecture piece of this and take a much broader view here, we have like the last generation of, let's say, data industry winners, for lack of a better term, right? And that might be Five Trans, probably among them, Databricks, Snowflake, these companies. Do you see a different stack of players or companies emerging again at the LLM layer

So basically, the data and LLM nexus, right? I mean, is there room for a new type of company? I'm just curious how these new types of companies, how these new founders will look different than the companies and the founders who came in this previous wave. So there are people who are trying to create, you know, Fivetran for AI. And my take as of this moment is that the first few layers of the data stack look very much the same.

in if the workload is AI, you still have to solve basically the same set of problems. Now, maybe I'm engaged in motivated reasoning of why my company is going to succeed, but I've tried to look very hard at this with an open mind. There are some narrow exceptions around moving files, basically images and things like that. And we are trying to extend our product to cover those cases. But I do think that those first few stages look basically the same. I think that

Most companies should use the same data platform for AI as the repository of data. You should have one enterprise data repository that feeds both your traditional analytics workloads and your AI workloads. It is just fine if the first stage of your RAG pipeline is select star from...

and you just read all that data out of that data warehouse. It is not the most efficient way to do it. You could build a highly optimized data platform that could do that step more efficiently. But guess what? The subsequent steps of that process

workload are like 1000 times more expensive than that query. So that is not the place to focus your optimization. And then what happens after that, that is just like the Wild West, like it is all open to discovery right now. How are enterprises going to use AI to work with

with their own data. The state of the art right now is you build a RAG chatbot that you use to answer questions about your internal knowledge base. A bunch of people are doing that, including Fivetran. It's super useful. I don't think it's the end of history. I think there's going to be way more stuff that happens, including things that none of us anticipate right now. And it's super cool and it's really exciting. So the answer to that is basically the same companies? Well, I'm making a distinction between the first stages and the subsequent stages, right? So the stages from system of record to...

to central data store has all the data about everything in your company, I think those are mostly gonna be the same players. We're seeing data lakes as a major emerging trend, but it really has nothing to do with AI per se. It's just the next logical step in the separation of compute from storage. And it has benefits for traditional analytics as well. Everything after that is like totally open.

And there are a lot of players who straddle this, right? So companies like Databricks, like Snowflake, you know, Google Cloud, BigQuery, they build systems that do the storage piece. And then they also do working with that data. And so they're all succeeding to greater and lesser degrees on that second part. But I also think that second part is just totally wide open. Somebody else can come in and make a new company who is going to sit on top of your company.

data lake, data warehouse, whatever you want to call it, and do amazing things with the data in it. And that could be like a company that doesn't even exist right now. Because that whole side of the workload is just totally different. All right. That makes sense. Okay. Curious of both of you kind of wrapping up here, like, you know, what skills five years from now, like if I'm a data architect, I'm a data engineer, I'm a data analyst, what skills are more important than ever? And what skills are, you know, maybe are relevant? Yeah.

or at the very least antiquated. So I'll take a stab at this. Let's start with coding. I think coding still matters. Not so much because you probably will write less code and you have the AI write more code for you, but you still need to understand what the AI does. And if the AI gets into a red hole, you want to be able to dig it out of the red hole.

I think the higher level sort of architectural conceptual understanding is more important because there's just less execution. You can focus more on the strategy and architecture part. I think understanding how to precisely specify what you want for a model will become very important. So every engineer needs to be more product manager than they are today because you want to busy. You have now this.

you know, smart intern or something like that that can help you with stuff, but only if you give them a very clear explanation of what you want. Yeah, I totally agree. Being able to precisely state what you want, being able to think precisely, more valuable than ever, more leveraged than ever, mastering the syntax, the family of libraries that are out there, much less valuable than it used to be. And a lot of these career paths are...

now much more accessible than they were before. If you have that first piece, you can get a job that requires the second piece and you can use AI to bridge your skills and make that steep learning curve much shallower. And that is all for this episode. We hope you learned something. We hope you enjoyed the episode. And if you did either of those things, we hope you rate and review the podcast on your platform of choice. Until next week, take care.

Data Management for Enterprise LLMs 38:12 Share

AI + a16z

Deep Dive

Shownotes Transcript

Data Management for Enterprise LLMs