We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI, SQL, and the End of Big Data

2024/8/30

AI + a16z

AI Deep Dive AI Insights AI Chapters Transcript

People

Jennifer Li

Jordan Tigani

Topics

Jennifer Li: Jennifer Li 的SQL技能比较生疏，她使用ChatGPT辅助编写SQL查询，这反映出LLM正在改变人们使用数据的方式。 Jordan Tigani: 许多人会忘记SQL语法，因此像自动补全和Copilot这样的工具非常有用，它们改变了人们与数据交互和编写SQL查询的方式。DuckDB是一个内存中分析型数据库，它直接处理数据，无需数据移动，这使得它对数据科学家非常有吸引力。DuckDB的改进速度非常快，这使得它成为一个值得选择的数据库技术。大多数客户的数据量并不巨大，他们只使用一小部分数据进行分析，因此不需要使用为处理海量数据而设计的工具。人们使用为处理海量数据而设计的工具来分析数据，但实际上他们只使用一小部分数据。分布式数据库系统非常复杂，而单节点系统则简单得多。现在云计算提供的机器拥有巨大的内存和核心数，很少有工作负载无法在单台机器上运行。单节点架构使得DuckDB能够更快地改进。MotherDuck利用单节点架构的优势，实现了易于扩展和缩减的功能。Google推动了“大数据”的概念，这在一定程度上误导了人们。Google的一系列研究论文促使人们认为每个人都需要处理像Google一样的海量数据，但事实并非如此。大多数企业的数据规模远小于Google等大型科技公司。人们不需要使用Hadoop等复杂工具来处理数据，因为现在有更简单的方法。小数据的世界更简单，并且可以利用更快的网络连接和本地计算能力。使用DuckDB在本地进行数据分析比使用云数据仓库更快。MotherDuck支持本地和云端双重执行，这使得用户可以获得更低的延迟和更快的交互速度。小数据与本地模型和本地推理引擎相结合，可以提高AI应用的效率。小数据和AI的结合可以提高AI应用的效率，并且可以将AI应用部署到本地设备上。DuckDB可以用于构建AI驱动的应用程序，这些应用程序需要聚合数据并理解全局上下文。向量搜索更接近于分析型数据库而不是事务型数据库。大型语言模型已经改变了人们与数据交互和编写SQL查询的方式。MotherDuck使用GPT-4来修复用户在编写SQL查询时出现的错误。大型语言模型可以帮助用户更快地编写SQL查询，并提高交互性。将自然语言转换为SQL查询在实际应用中存在挑战，因为需要考虑组织特定的数据结构和业务逻辑。在数据模式简单、数据量不大且元数据完善的情况下，自然语言到SQL的转换可以取得良好的效果。语义层可以帮助提高自然语言到SQL转换的准确性。AI分析师目前还无法完全取代人工数据分析师，因为需要干净的数据集和完善的语义层。AI可以帮助改进数据准备工作流程，例如自动推断数据模式。 Derrick Harris: 无

Deep Dive

Key Insights

Why is DuckDB gaining popularity as big data wanes?

DuckDB is an in-process analytical database that simplifies data manipulation and eliminates the need to move data around, making it ideal for data science use cases. Its single-node architecture reduces complexity and allows for faster development and scalability.

How does DuckDB handle CSV parsing compared to other databases?

DuckDB has a dedicated team working on its CSV parser, which is considered the best in the world. It can handle messy CSV files with errors, weird separators, or null characters, making it highly user-friendly and efficient.

What is the significance of scaling up versus scaling out in database systems?

Scaling up, or using larger single-node machines, simplifies database architecture and reduces complexity compared to distributed systems. Modern cloud hardware offers large machines with high capacity, making scaling up a practical and efficient approach.

Why is the small data movement gaining traction?

Many companies don't need to handle massive amounts of data. Small data focuses on simplifying systems, reducing complexity, and improving user experience by leveraging modern hardware and local processing capabilities.

How does DuckDB integrate with AI workloads?

DuckDB can handle vector search and similarity queries, which are closer to analytical workloads than transactional ones. It can also work with local inference engines, making it suitable for AI-enabled applications that require data aggregation and visualization.

What role does GPT-4 play in improving SQL query writing?

GPT-4 can automatically fix SQL errors by analyzing the error line and suggesting corrections. This allows users to stay in the flow of writing queries without needing to consult documentation, enhancing productivity.

Why is text-to-SQL not yet widely adopted for complex analytics?

Text-to-SQL struggles with complex data models, specific organizational nuances, and data quality issues. It works better in controlled environments with clean schemas and well-defined semantic layers.

How does DuckDB improve the data science workflow?

DuckDB reduces the toil of data preparation by simplifying data manipulation and offering a robust CSV parser. It allows data scientists to focus more on analysis rather than setup and data movement.

What are the challenges of scaling out in distributed systems?

Scaling out involves complex distributed transactions, data movement between nodes, and handling failures. These challenges increase the complexity and time required to develop and maintain distributed systems.

How does DuckDB's architecture benefit developers?

DuckDB's single-node architecture allows for faster development cycles and easier scaling. It focuses on what users care about, such as query speed and usability, rather than the internal mechanics of the database.

Shownotes Transcript

Translations:

中文

One of the first times that I really realized that like, hey, LLMs are actually changing the way people are using data is when we had not launched too long ago and you were using MotherDuck and you mentioned that you were kind of cut and pasting between ChatGPT and our query UI. It's a mission of like, I'm very raw on my SQL skills. Yeah.

No, I think it's the kind of thing where everybody forgets syntax for various SQL calls. And it's just like encoding. So there's some people that memorize all of the code base. And so they don't need autocomplete. They don't need any copilot. They don't need an IDE. They can just type in Notepad. But for kind of the rest of us, I think these tools are super useful.

Welcome again to the A16Z AI Podcast. I'm Derek Harris and joining me this week are A16Z General Partner Jennifer Lee and Motherduck co-founder and CEO Jordan Tagani. If you're not familiar with Jordan and Motherduck, the short version is that Motherduck is a commercial database offering built on the very popular DuckDB open source project. Among Jordan's past accomplishments was building BigQuery while at Google, and he's also presently the unofficial spokesperson for the small data movement.

In this episode, we discuss DuckDB and the move away from all big data all the time, as well as where database technologies are slotting into the stack for production AI applications, for things like vector search and more. Jordan also touches on the future of text to SQL and the importance of clean data if you want to stick an LLM on it. Also of note, MotherDuck is hosting the Small Data SF Conference on September 24th in, you guessed it, San Francisco, and we have registration information in the show notes.

With that out of the way, get ready to hear Jordan, Jennifer, and myself talk about data. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

I encountered DuckDB, this upstart database, when it started showing up in these benchmarking reports. I was working at SingleStore and we prided ourselves on performance, and all of a sudden we were being compared against this out of the blue database that had been created by a couple of academics in Amsterdam. I started poking at it a little bit and it's like, this seems really interesting. You can do stuff that we wanted to do when I was in Google, we wanted to do when I was at SingleStore.

and things that customers were asking for in terms of how it scaled and could scale down and do things incredibly low latency. Somebody should probably take this and build a SaaS service around it and put it in the cloud. Well, maybe that somebody should be me. I've built two database SaaS services. And as I started thinking about it, I reached out and I talked to the DuckDB co-founders and wanted to see if they were going to do something similar to this.

And they said, no, they said they really want to just focus on kind of building the core database, but they thought it was a good idea. And they said, we'd love to partner with you and sort of with your background, you know, it would be great to work with you. And then I was sort of just off and running. So we raised money pretty quickly, put together a team around the idea, pretty quickly coalesced around sort of what we were building.

And the idea that we kind of had this extraordinary database, we could put this in the cloud, it was going to move really quickly, we could do low latency stuff, it could be a data warehouse, it could also work in a lot of the other ways that DuckDB could work, which was, you know, for, you know, data science use cases, and then sort of in a lot of these sort of nooks and crannies. And the other thing that really kind of struck me about DuckDB was, you know,

the way they focused on what database users actually care about. A lot of times, I worked on databases and we spend all this time focusing on improving the time from when we as the database get a query,

and to the time that we have finished that query. But that's not actually the time that's experienced by users. I mean, first of all, there's the time it takes to get the query and to get the results back, but also there's paging the results. So that's just the actual query mechanics. And then you step back and there's sort of like, you know, I have a problem that I want to solve. I formulate that problem. I have an idea that I want. I have a business problem. I want to understand something about my data. What's the time until I get my answer?

And that was actually something the DuckDB team has done a great job of actually solving. A good example of that would be the CSV parser in DuckDB. CSV parser is something like in BigQuery, it was like an afterthought. We had a college hire work on this, like new grad. And they were great, a great engineer, but sort of like, okay, we've got CSV parsing done. All right, we'll move on to the next thing. And DuckDB has a full-time person working on this. They've written research papers about it. I mean, I would say this is the best CSV parser in the world.

The advantage for the user is, I just pointed at a CSV file, it doesn't matter if it's got wonky things in it, errors or has weird separators or weird null characters. There's all sorts of ways that the CSV files can be messed up.

But the effect of like it for a user is, hey, it just works. I can just query against it and it works and it's fast and it runs in parallel and I don't have to worry about anything. And to me, that was really captivating because I feel like it's just an area that the whole database community was really missing the boat on and that there was kind of an opportunity to just make something easier to use and make something more user focused.

You already jumped into this a little bit, but I would love to step back and give a bit of an overview to the audience who are not familiar with DuckDB. There has been in-memory databases before DuckDB existed from SQLite to others, and there has been other analytical databases from Druid to ClickHouse. BigQuery was one of them, and you worked on it. Where do you see DuckDB really shines, and why do you think it has taken over the database ecosystem?

By storm, is it a long tail of life quality improvements like what you mentioned on the CSV parser or there's a fundamental design principle that enabled DuckDB to be so widely adopted today? So it is actually, I think, the first in-process analytical database or it's the first in-process analytical database that I know of. Like the other well-known in-process database, transactional database is SQLite. And SQLite is the most

most commonly used database in the world. I think on my phone, it's probably running 100 times. Every app has a SQLite copy. But if you want to run things that are more analytical queries that ask questions about a dataset versus find this or update this, it's more how many users do I have and how does that vary over time? An analytical database is much more well-suited towards answering those questions.

They originally started, I think, going after data science folks because it's in process, you don't have to move data around, which is really nice. The data is already in the process. If you're using, say, Python and you're using Pandas and you have Pandas data frames and you are slicing and dicing data in your Pandas data frames,

Appendix is really not set up for actual data manipulation or it's set up for making the users, but not easy for the computer and it's very hard to optimize. And whereas databases, they sort of have solved the problem of, okay, you get the user intent, which is our query, and then they can sort of optimize it under the covers.

And so I was able to take these data science use cases and you can point it directly at your data frame, your data structures that you're already using and get SQL answers back from that and then store it. And then so like a lot of the reasons that data scientists tend to not like databases is because you have to set it up. You have to how do I get my data in? How do I get my data out? What happens here? What happens there?

and DuckDB just sort of solved that problem. The other thing they did was they just, I mean, they just did a really good job. Like, and they made it simple and easy to extend. And I think because of that, the velocity at which DuckDB gets better is extraordinary. So you kind of, you think about like, okay, when we first started working with DuckDB two years ago, it had some gaps, it had some things that it wasn't great at. And it's just sort of like, okay, in three months, then they released the next version. Three months, they released the next version. And just the rate at which that gets better

is amazing. If you think about it, like when you pick a technology, yes, you need to pick a technology for what works for you today. You're choosing something that is not static. And so often it helps to pick the thing that is moving fastest.

And I think that has also been something that has worked out really well. We kind of had seen that ahead of time, and that's paid off because over two years, you know, DuckDB has just gotten better and better and better. And we have the previews of the next version and the next, and that's, you know, all this other stuff is getting added. And we're looking forward to that. Yeah. You put out a couple...

I would call seismic shift type of pieces into the data space last year. One is called Big Data is Dead. The other one is about scaling up versus scaling out. Give the audience maybe just a brief ideas behind those posts and how does that relate to MotherDuck and DuckDB? One of the things that we had

that I noticed as a product manager on BigQuery and when I was at single store and just talking to people in the industry, like most of our customers, first of all, didn't have giant amounts of data. Many of them had sub-gigabyte data

you know, total data sizes, sub 10 gigabytes. Like it was pretty rare to have huge data. And we had some of them, we had some of the largest like data users in the world. We had Walmart, HSBC, Home Depot, Equifax, like some really, really big customers. Of course they had big data. They had petabytes of data, a couple who were bordering on exabytes of data.

But we looked at the data they actually use and they only use a tiny fraction of that. And if you think about kind of the way you take data, as you take a giant amount of data and you kind of condense it and you condense it and condense it, and then you run your reports off that condensed amount. So like the tools that people use to analyze that data, they use tools that are designed for the giant, giant amounts of data. But if you're only looking at this like thin slice of data, you really don't need something with that much horsepower. And again,

And having built, you know, having worked on these systems, these distributed database systems, there's just so much complexity that goes into virtually everything that you do gets harder. You know, you want to write something to a database. Okay, well, it's you write it to a log file and you write it to a data file. But you have like distributed transactions going on and like in multiple nodes and like, you know, just life gets so much harder or you need to like...

You would join and you have to move data from one machine to the other machine so that the keys are on the same place and you have to deal with one of them failing and restarting. And like, there's just all sorts of messy, messy stuff that has to happen. Whereas in a simpler system, a single node system, you know, none of that matters. Now, of course, in a single node system, you run into scaling limits, right? It's got to fit on a single machine.

And so that was sort of where the simple joys of scaling up came into play was that like, actually, when we started building these systems 15 years ago, a few gigabytes and a couple processors was a pretty beefy machine. And like now, you know, AWS has tens of terabytes and hundreds of cores in commodity machines that you can just spin up and spin down at random.

There's very few workloads that won't fit on these very large machines. It's just like, if you can build a system that is targeted at that scaling up, your life gets so much easier. As I was saying before, you can move faster, you can get better faster. I think that's one of the things that's enabled DuckDB to get so much better so much faster, is they just have this single node architecture.

And so that's something that MotherDuck is really kind of taking advantage of is like saying, hey, what if we didn't have to spend all this time building this really complicated distributed system? What if we just build this thing that would scale up and we can auto scale it, we can scale it down, scale it down to zero, DuckDB scales, like can run in 100 megabytes, can run your browser. Like there's all these sort of like exciting things that you can do, but it can also run on the biggest EC2 instance and

And yes, there are challenges that are involved when you do that and things don't always scale linearly, but those are engineering problems and those are much more solvable than trying to solve. You know, the things you solve once versus having to solve every time you add a new feature, which is what happens when you have a complex distributed system. Scaling up used to be kind of a dirty word.

Yeah, exactly. It was synonymous for it's not scalable. Somebody that I worked with in the past said, if you're not in a distributed database, people will laugh at you. That actually was kind of a trigger that was like, yeah, maybe if you're not afraid of being laughed at, you can actually build something pretty amazing. It's such a paradigm shift in thinking about database world too, because you're right. When people think about scaling databases, you first think about how to do sharding, how to distribute it on...

several machines where now even the personal laptop or computer is super powerful while you have all these very beefy machines that's hosted by the cloud. I guess now looking back, especially tying into your BigQuery experience and journey, do you think

big data fundamentally was the wrong direction at the point of time or that people were chasing because Google have done it, Meta has done it, the bigger companies who have the luxury of expanding their data platforms to do this and the rest of the market followed, or there were legitimate use cases that's actually pragmatic and practical. And where is the balance between the small and big data paradigm?

So the early draft of the Big Data is Dead blog post started out, I blame Google.

And then it was basically rant for like kind of poisoning people's brains. You know, so Google had these like series of research papers. There was the MapReduce paper. There was the GFS paper. There was the BigTable paper. There was the Dremel paper. You know, taking computer science problems and then by kind of splitting them apart, running on lots of different machines, you're able to get so much better performance because, hey, big data is coming and you need to be ready for it. I think there was this sort of assumption that

Someday, everyone's data is going to look like Google's. But 15 years later, that turns out not to be the case. I think Google, I remember several years ago, had something like nine products with a billion users.

As a SaaS company, you know, it's like you're a B2B company. There aren't a billion businesses that you can sell to. There's a lot like 6,000 enterprises. You're selling to enterprises like you have 6,000 potential customers. The scale at which you kind of need data is vastly different. And if you kind of assume that I'm going to need this sort of massive size, you're hampering yourself because in order to achieve the scale that you thought you were going to have, you're going to have to have a lot of data.

You ended up having to sort of change how you did things and you had to like split up your workload into a map and reduce phase. And then all these sort of awkward things that you had to do with Hadoop, it still kind of poisons people's brains that they think they kind of have to do these things differently. And it turns out that that, you know, that world never really arrived. There are easier ways to do the job.

That's a very helpful point of view. And if we even bring it to sort of the trade-offs people are making while they're pursuing the big data dream, I can imagine, you know, a lot of the engineering effort is spending on maintaining a very heavy and also distributed system. It's

quite hard to debug and maintain. What are people gaining while they're investing in, let's say, single node, very fast and easy setup process on the user experience side that you already alluded to? It's just so much snappier and faster. What else are we getting from the small data world?

Yeah, I think there's a simplicity. I think when we started BigQuery, one of the like mantras we had, and I think it was kind of came from Jim Gray, the Turing Award winner. With big data, you want to move the compute to the data because the data is so expensive to move. Now, if you accept a world where your data may not be huge, then...

it opens up a lot more opportunities for where you're actually doing that computation. More people have fiber to the home, fiber to the workplace. They have 100 gigabits or even faster. I'm spending all this money on expensive Cloud hardware. I have a really fast pipe to the Internet and I've got this incredibly fast laptop. Why don't I just do some of this stuff locally?

And so, you know, I think there's, that's one of the things DuckDB is really good at is helping you kind of, you can pull that data down and you can do all your work locally. George Fraser, the CEO of Fivetran, he did some benchmarking on his laptop. He has like a three-year-old laptop against, you know, a cloud data warehouse that must not be named. And his laptop was faster at running these sort of industry standard benchmarks using DuckDB than running against the cloud.

And so if you can run stuff locally, it opens up a lot of opportunities. It creates some more challenges like making sure you don't have really high egress fees, moving data multiple times, data management, cash management, making sure the data is fresh.

And so those are the kinds of things that actually some other duck is doing. And so we run, we run duck TV locally and we run duck TV in the cloud. We call it dual execution. And so we can basically allow workloads to move, to move down to the client, including in the web browser. So like, so our web browser runs in a Wasm extension, web assembly extension that can run duck TV in the web browser. And so you can do like these like 60 frame per second video game style flying through your data, data visualizations.

that are literally impossible if you have to go back and forth to the cloud. Because you basically go back and forth to the cloud, you just didn't do 100 to 200 milliseconds. And so that puts physics limits on how fast you can do something. But the more you can move stuff down, you can make that just incredibly, incredibly low latency. And so this sort of opens up some opportunities for kind of doing things in different ways.

That's very fascinating. Combined with the Column Explorer, I can imagine how much that has changed the data scientists, data analysts paradigm in working with data, both locally and in a hybrid fashion.

Transitioning into small data and AI, help us understand how does small data play in this world of AI? There are quite a lot of local models or local inference engines out there today. How do you see DuckDB and MotherDuck playing into the AI world?

There's a bunch of sort of interesting AI companies that are kind of saying like, hey, well, if you look at how expensive it is to run on, you know, these, you know, fancy GPUs in the cloud that are even hard to get your hands on, you've got a fancy GPU on your local laptop, why not run things on your laptop? So I think there's a real kind of alignment of, it's the same problem that we're solving, just sort of a different, slightly different domains. And, you know, kind of put those together, we can think it's sort of like peanut butter and chocolate and

And so, you know, we're excited about working with some of these folks and what becomes possible when you slim down your models and as models are getting better and better, you need fewer parameters and less storage to keep them around and you can, and so that they can fit on your laptop and they can give you good results. There's a world with where, you know,

DuckDB and MotherDuck come into play is as people are building applications, I think there's just this wonderful wave of application builders, people who want to capitalize on what becomes possible and sometimes even easy when you have AI, you have these large language models.

is they still need access to data. We tend to focus on the sort of the lookup case, vector lookup and to do RAG, but there's also cases where you want to aggregate data and you want to understand global context. And that requires a different type of database. A database like DuckDB can be super useful for building AI enabled applications where you show data, show charts, et cetera.

A lot of the kind of early use cases for AI were more narrowly focused, but I think as people are trying to make their AI applications actually useful, I think there's going to be more and more cases where you kind of want to tie an analytical database to an LLM and to the outputs of LLMs or to vectors and embeddings.

the vector database space has exploded and everybody now has their own vectors. There's PG vector and there's a lot of the transactional databases have added vector support. If you squint the shape of compute you need for doing vector similarity search, vector search,

looks closer to analytics than it does to transactions. I think as you create your transactional database, the form factor is important. How much memory you need, how much CPU you're going to need, how spiky it tends to be, how much caching comes into play. That's all, for doing vector starts, closer to an analytical database than a transactional one. So I think adding vectors, I think just sort of not knowing...

nothing else about the industry like adding vectors to an analytical database i think is going to be more effective than than to a transactional database and so you know duck db has you know vector search it has embedded cosine similarity but it has a vector search extension and you know i think we have we have customers that are that are kind of using this for sort of analytics rag applications

Are there particular type of applications you think using DuckDB's vector search as well as the full-text search and build on top of this analytical power engine that is going to be more fitting than others? Or we're still in the very early days exploring those use cases? I think you're right. I think it is early. I think just cases where

you know, you want to use AI to help shape what you're showing to users. And you're not just looking at one result, you're looking across results.

I think those will be useful, but it's hard to put my finger on exactly what those applications will be. Yeah. If one thing we have learned, again, still very early days in building AI applications, the models work much, much better when given relevant context and personalized context as well. I think about my aura ring and the recommendations giving me aggregating the last seven days and last month of sleep data, how much it's going to, or what kind of advice it's going to give me for it.

Tonight or tomorrow night's sleep, same for Strava. A lot of this processing power are very much residing locally, but given huge analytical data loads as well. I can definitely imagine more and more of those applications coming up for not just consumers, but also to business as well. Yeah, absolutely. That's a great example of now that we're seeing larger and larger context windows, you can use

use analytical type queries to fill those context windows. How are you seeing the data stack shift in general as we're shaping the paradigm around the small data idea? There's so much momentum and ecosystem built around Hadoop and big data. Do the up and downstream players need to shift their thinking as well, whether it's ETL or visualization or other tooling and processes?

Tell us how you think about that. Like, because if you think about the modern data stack as, you know, you have the ingestion tools, you have the query tools and you have the BI tools, kind of the ingestion tools and the BI tools, we're always a little bit already in the small data world because you're, you know, the funnel at which the rate at which data arrives is,

It's not monstrous. It's not huge. You're not like most of the time you're not ingesting gigabytes of data a second. Most of the time, the stuff you're ingesting can be is relatively self-contained. And I've had discussions with George from Fivetran on this and with Tristan from DBT on this. Their world is not necessarily any different in the smaller data world. And then on the BI side, you know, the data visualization side,

were treating the world as if it was small data anyway. They were pushing a lot of stuff to the query engine. Yeah, and that can be slow. I mean, that was one of the Looker innovations is like relying on cloud query engines that are actually pretty fast. And so you get reasonable performance versus like Power BI, Tableau. They kind of relied on actually pulling all the data, all the important data into local memory. And they were really these single node small data applications.

engines anyway, so they can do things fast. I think the interesting thing will be, can you do more pushing workloads down and still having it be just as fast on the visualization side? Can you incorporate DuckDB and hybrid execution into your actual data visualizations

into the things you're presenting to your users to give them an even better, lower latency experience. I know that Omni, for example, is using DuckTB in their front ends. You know, Hex uses DuckTB as well. So there's a bunch of people kind of building interesting startups that are using DuckTB pretty heavily. It's interesting to see whether there emerge other kind of

small data specialists, but my guess is the world didn't change for those people for big data nearly the same way that it did for the query engines. Another question around how you see, again, this interaction between users and the data itself

changing in the AI world. What's your take on using LLMs in writing SQL queries? Do you use it on your day-to-day analysis? And where do you see that goes? One of the first times that I really realized that like,

Hey, LLMs are actually changing the way people are using data is when we had not launched too long ago and you were using MotherDuck and you mentioned that you were kind of cut and pasting between ChatGPT and our query UI. It's a mission of like, I'm very raw on my SQL skills.

Everybody forgets syntax for various SQL calls. And it's just like encoding. So there's some people that memorize the entire, all of the code base. And so they don't need autocomplete. They don't need any copilot. They don't need an IDE. They can just type in Notepad. But for kind of the rest of us, I think these tools are super useful. And I think we have seen that these tools have already changed dramatically.

how people are interacting with their data, how they're writing their SQL queries. One of the things that we've done in MotherDuck is we focused on improving the experience of writing queries. So something we found is actually really useful is when somebody runs a query and there's an error,

We basically feed the line of the error into GPT-4 and ask it to fix it. And it turns out to be really good. We give a bunch of context, but it turns out to actually give you really good answers. And the nice thing about that is if you forget the syntax, if you forget the order of parameters for the date diff function, there's a date underscore diff, and then you tell whether you want hours or how do you do it. There's just a bunch of stuff that's fiddly, and you just type what you think it's going to be

And then like, it'll just fix it. And then it'll show you, okay, is it this? And you hit yes. But it's a great way of letting you stay in the flow of writing your queries and having a true interactivity versus having to be like,

Oh, crap, I don't remember the arguments to that. And then you go to the docs, and then you wander through the docs, and then you come back. And I think there's tons, tons more that you can do on that. And you can kind of, you can provide more context about the schemas, etc. So I think an autocomplete, similar to the types of things you do in like GitHub Copilot, I think there's a great...

opportunity to continue to build on those. And then, and I think PECS does a lot of these things where you can automatically visualize it. What would be an interesting visualization on this data? And very often that will be the thing that you'd want to actually do and that you can build that kind of stuff relatively easily. It used to be that if you wanted to build some of these types of features, you'd have to have a team of experts working for years. And instead you have a

An intern who's really excited about this stuff, who works over the weekend and comes up with some amazing things. You never underestimate the power of motivated interns. The other kind of interesting thing, is there a world where you go the next step further, where instead of just helping you write your queries, you write in something like natural language?

There's a lot of really good demos you can show about how you can write English language and it turns into a good SQL. I think it's hard to go beyond demos because there's so much in typical analytics that is actually specific to that organization and specific to how the data is laid out, the data is used. If I say, what was my revenue last quarter broken down by region and rep or something like that?

What's a region? What's a quarter? Somebody else might use a different fiscal quarter, but then, you know, revenue is like, oh, that's incredibly hard to compute because it's like there is this other column that is like it's fraud or, you know, some of the data is like users that haven't actually signed up yet. Or then there's currency issues and there's OK, there was a bug in the data three months ago. And like you have to put in a fix for that. There's just all kinds of like really complicated stuff that has to go on to actually answer simple questions.

And so I'm not super bullish on doing this sort of in the general case. I think there may be cases where, you know, and that's actually perhaps for like for applications. We have customers actually doing this that are building applications

And because they have basically the same schema for each one of their users, and they have a relatively simple, very clean data schema, and it's not massive amounts of data, and they can name everything well, and they can add really good metadata about all the fields, and then they can actually get pretty good performance over doing English text to SQL. That's kind of one area that's promising.

And to me, the other area that's promising is I think you need to go through some sort of semantic layer. And semantic layers have been interesting for me for a long time. But I think there's some really cool things you can do with a semantic layer. The semantic layer can basically... You define what revenue is. You define what a quarter is. You define how your data model interoperates. And I think armed with that, you can do a much better job doing text-to-SQL and more arbitrary things because...

You just have to map English text to the thing in your model versus English text to the physical layer of how the things described in your database. Sounds like you're on the camp of AI analysts are not going to replace real data analysts just yet under the condition of AI

There is a very clean data set to be analyzed and also a semantic layer to be laid out. There could be some more automation, but in the broad definition of using AI analyst to parse through data, we're still far away from getting accurate results.

Yeah, I'm not quite an AI maximalist yet. I think AI can do some really cool things and amazing things, but there are going to be limits. That said, you know, a lot of people that have said there's limits to what AI can do have been wrong. So we'll see. We've been talking about trying to replace data analysts or democratize

that ability for well over a decade now, it feels like, and it doesn't seem like there's been a lot of movement. So maybe you're onto something. I think the same thing with the copilot too. Like it just sounds like people who program are like, you know, it's a good augmentation tool, but like you still need to know what you're doing and know the ins and outs of what you're actually building or like in this case, the data you're working with. Yeah. And I think on the copilot side, like I think because there's people that think coding is more like writing or

writing as a stream of consciousness where you just sort of, you write, you write, you write, okay, you're done. And maybe you test it, then you're done. But I think most of software engineering is more like editing. Whereas you add some key phrases here, some key pieces here, and you tie these things together and you transform these things. That's something that AI is...

You know, maybe you will get good at that, but is is a pretty, pretty long way off from that kind of thing. And that's that's what engineers spend the vast majority of their time on. Yeah, I remember back in the day when data science was having its moment, at least as a term of art, and it was the sexiest job on the planet and all that sort of stuff. But then you would talk to someone with that job title and be like, well, most of my time is spent like munching data and data prep.

Have we made movement, like forward progress on that? And is that an area where AI actually could help out the data process workflow and help out the data science workflow in the sense of cleaning up this data more programmatically? I think absolutely. You know, like I mentioned the DuckDV CSV parser. I mean, that's sort of an example that doesn't actually apply AI, but my guess is it would be an interesting research project to sort of apply AI to inferring schemas in data

poorly structured data like CSV and being able to sort of get through problems because there's a bunch of heuristics that get that end up getting used. But I also think that like that's one of the reasons that the data scientists love DuckDB, I think, is because it just helps them reduce a lot of the toil. But it doesn't mean that I can't make it better. And there you have it. Hopefully you found that to be an interesting and insightful conversation. And remember, you can check out the Small Data SF conference on September 24th in San Francisco.

And if you like the podcast, please do rate it, review it, and however else you feel like promoting it across social media and your platform of choice.

AI, SQL, and the End of Big Data 33:08 Share