We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Talk to Your Data: The SQL Data Analyst

Talk to Your Data: The SQL Data Analyst

2025/2/28
logo of podcast MLOps.community

MLOps.community

AI Chapters Transcript
Chapters
The Token Data Analyst agent at Prosus aims to empower employees to access data more efficiently. It functions as a virtual data analyst, answering data questions via Slack, thereby reducing the burden on human analysts and increasing data accessibility. This agent has proven successful, resulting in a 74% reduction in response time and a significant increase in the number of questions answered.
  • The agent is in production and answering numerous questions.
  • It's designed to be an addition to existing data analyst teams.
  • It provides faster response times and increased accuracy.
  • It has resulted in a 74% reduction in response time and a significant increase in the number of questions answered.

Shownotes Transcript

We are back for another edition of the Process MLOps Community Limited series all around using agents in production. And this episode, we break down the token data analyst, which is one of the agents that actually made it to production. So Paul and I talk about the inspiration behind it first, and then we bring in Donae to talk about the technical aspects and what

where they hit challenges. Again, it's very rare for me to see companies that are this transparent about the technical difficulties of putting something into production. And so I got to give a huge shout out to Process and thank you for making this happen and being this transparent. Let's get into this episode.

Welcome back to another episode of the Prosys AI MLOps Agented Production Podcast. The miniseries. I like it. What are we going to talk about today? Data analysts and everything that that means because I'm sure when we say it, nobody understands it. It is an internal project that made it through the great filter of your projects that have had success. So break it down. What is it?

Well, what we talked about last time is that we're moving down this sort of evolution of agents and getting them into production in ways that can add value. And you spoke about the graveyard of AI agents.

The data analyst is one of those that is not at the graveyard. It's live and kicking. It's in production answering a ton of questions. There's a lot of experimentation going on, and the data analyst is the agent that essentially helps anybody answer questions about data that sits in any of the companies in the pros' group.

using an agentic framework. And it uses all of the agentic tools that we talked about earlier. So the ability to go out and fetch data because it can talk to the database, it understands the fields and the columns. And in essence, it's like talking to a real data analyst on Slack. That's what it is. And we're going to bring Done on later to talk about the technical ins and outs of this. But before we do that, let's talk about

more about the inspiration, the actual tool, what it means on the business side to have this and a little less technical. If somebody wants to just jump to the technical part, they can skip ahead right now, but we're going to set the scene for what this actually means and who's using it, what successes have you found from it.

Yeah, the inspiration is at the end comes back to, you know, who we are and what we do at the group at Prozis. So we're about 30,000 employees spread across various tech companies where people continuously try to innovate, make decisions in the product, try to understand, you know, what the customers are asking, where their orders went and so on. And that often requires going into a database and, you know, fetching the right information.

information to be able to make a decision respond to a customer dispatch an order and so on and so we've got thousands of people whose full-time job data analysts

are to help others in the organization, enable them to make decisions. And as we found our token data analyst was actually able to answer those questions with high fidelity, high precision, we wanted to roll that out to basically beef up everybody's ability to ask questions

and talk to data, ask any question and talk to the data. - And you say, so there's thousands of data analysts. I presume they know how to go in and ask questions of the data already.

So is it only them that are getting use of this data analyst agent, or are other people able to now get more familiar with the data and answer questions? No, we see it as an addition to any data analyst team. So you've got existing teams of data analysts, and the TOCAN data analyst is agent, generalizing

joins a team of data analysts and is able to answer a whole bunch of questions with them for end users. And it's part of the team. So, for example, when people go in and ask a question to the data analyst, it's in a Slack channel that the other data analysts are also there and would maybe typically be the first person

point of contact. Now the data analyst, the token data analyst is the first point of contact. It can answer the question. There's other data analysts in these teams that can see the answer, make sure that if there's issues with the answer, it can't retrieve it or it's incorrect. They can flag that to the end users. So it's very much an addition to the existing data analyst teams.

course the token data analyst agent is always available so it answers very quickly we've seen that that's a huge benefit it can't always answer everything um but because it's it's fast it's got high accuracy um it's something that has been able to give people a lot more speed and finding the right information they need yeah it's almost like the support burden

of people coming to data analysts and maybe it's other data analysts coming to another data analyst to say,

hey, I'm trying to get deeper on something that you're very familiar with. Can you give me these answers? That support burden has now been almost like filtered out a bit by the data analyst because you do not need to go to another human as your first point of contact. You can see if the age

agent can answer it. And then if it can, cool, I got what I need. I'm going to go and do my thing. But otherwise, if it can't, then you can escalate it up to a human. And I think that's one of the pieces that we're seeing time and time again on these use cases is the support use case for agents to be able to be that first line of contact. And if they can't

this or get the ticket closed, then you can escalate it up to a human. And so it relieves a bit of, of that first line. Absolutely. Those simple questions, right? Yeah. Let me give you a little example of how it's being used. So let's, let's take iFood, right? It's a, a,

a leading global food delivery player based out of Brazil. And they do about a billion, more than a billion orders every year. And there's a lot of people in the organization who are trying to understand what's happening. So how many orders has this restaurant done? Or what are we seeing in this particular zip code?

or a customer operator that's trying to figure out what happened to a specific order with a certain order ID that needs to figure out what's going on.

To get the answer to those questions, you need to be able to interact with the data. But these are often repeat questions. Hey, tell me what's the status of this order? Or like I said, I'm going to visit this partner or this restaurant. How have they been doing over the last three months? Those are the kinds of things that the token data analyst is able to do really, really well.

because it understands the data. They're fairly simple queries. There's still a lot of very complex queries that the human data analysts obviously need to help with. But it's sort of, as you said, the first layer of high-frequency questions that we know we can answer well, that the data analysts can support these people that aren't able to write SQL queries themselves or don't even have access to the databases to do so. Yeah, that's a great point. And

You talk about iFood being one of the companies that you released this to. Did you release it to everyone in like the process group? And so you have...

Anyone who wants to can now use the data analyst or is it something that you rolled out to specific companies in the portfolio? Yeah, we have. It's not sort of a flip of the switch that you just connect it and it works. There's a process that takes, you know, a couple of days to onboard a token data analyst on a specific use case. So it's available to onboard for everybody and we've made it fairly seamless with

to do that onboarding. It's very quick and easy to do so, but it still requires effort because, of course, there's some integrations. We can connect with Snowflake and with Databricks and with BigQuery and many of the other commonly used data stores. But that still needs to happen, right? That's an integration that needs to take place. And it needs to be a specific instance. Like you have the iFood data analyst instance so that you're not grabbing data analyst stuff from another... Oh, yeah, this is all...

I mean, this is all logically separated. And that is sort of by design how Token and its agents work. So it's per company. But then even within a company, so when it goes to OLX, which is one of our global classifieds players or global, which is also in the food delivery space, we're all using this data analyst. They need to then go and figure out, okay, this is a marketing platform.

use case and the data analyst needs to be connected to marketing data or it needs to be connected to the logistics data set, which actually tracks the status real time of orders. And so that integration needs to take place. And then when, once that's done, then they can use it. Anybody can ask questions about that data set. And we see that, you know, when people do that, there's two things that happen. One is the time to get the answer reduces. We measured it's about 74% reduction.

- 74% reduction. - Yeah, so now, no. - That's huge. - That's huge. I mean, for folks who are- - I thought it was like 20%. Yeah, even 20% would be a success, right?

But 74%? First of all, how did you measure that, right? We got to go deeper into these numbers because that is huge. Yeah, I mean, so when somebody has a question, they typically post it to one of these Slack channels. And then they need to wait for somebody who's available. And then that person needs to go spend some time, write the query, check it, and respond. We're only measuring here the actual time it takes for somebody to go. And once they're already working it, they're basically...

Okay. Yeah. So query submitted and then response submitted type of thing. Okay. Yeah. So that's one piece. The other piece is the fact that we see many, many more questions being answered as a whole on these data sets. So we saw very quickly after connecting token data analysts to certain, you know, for example, the marketing data set, the logistics data set,

Those became the most used data sets in the company because now anybody can ask a question to them. So we have, you know, a doubling or tripling of the number of insights, we call that, that we can actually generate based on this data because it doesn't, there's no bottleneck or limit to the number of humans who can run queries. In fact, now anybody can go into that channel and ask a question. Yeah.

So those are the two interesting effects, time reduction, but also net new insights generated as a total for a company.

A lot of people, almost everybody in an organization wants to be able to quickly get information. And some organizations have really spent a lot of time structuring how to make sure that they democratize data access. It's sort of the popular term. But to give you an example, at iFood, they have a term called BADHU. So it's an acronym for Business Analyst Data Heavy Users. Oh, wow. And these are the people that are basically...

going to need to interact with data very regularly, continuously, daily in their jobs to be able to do their job. Often it's in a supporting role, but not always. It may be part of a sales team, a logistics team. And they realized that this was an important skill they needed to have as an organization to be able to move fast. And so they created a whole program. They gave this name to this cohort of people that they gave access to a certain set of tools. And it's exactly...

that constituency that would benefit a lot from this kind of a tool because they are interfacing with the data all the time. They know it intimately well and they've got a lot of repeat questions. And a lot of organizations aren't as structured around this as, for example, iFood, but we see that almost every organization in the group

has that need and has some sort of a, you know, group of people who are continuously leaning on data to kind of move forward. Asking questions. And if it wasn't with Tokan, they would be pissing off the data team, probably. Probably, yeah. If they were asking things of the data teams, it reminds me of a conversation I had with like a big media company back in the day, right? And they said that they...

They were inundated with data questions continuously and so much so that they couldn't get their work done. Eventually, what they did to fix this problem, because it was before the agent world and any notion of being able to have self-serve,

They said, we're going to send out data advocates into the company and we are going to pair them with each person and make sure that that person knows when they ask a question of the last marketing campaign data. They have to have a next question.

action. So they can't just ask these random questions that later they're not going to action because those are what we're taking up more time for the data team to answer and service. And if it didn't actually do anything because, oh, the marketing team now knows that this was better than that. And these like small insights that,

you recognize they are very powerful because they might just like percolate in your mind for a while and then boom you have a stroke of insight in the shower and later it's like it's all because I had those random queries but

But for this team, they didn't have the resources to do that. Isn't that silly? Because you want everybody to have a good intuition of where the business is going and how we're doing. For us, often we're managing by certain types of North Star metrics. And there's not always a next action immediately. You just want to understand where are we? Are we on track? How does the stuff I do contribute to that? And sure, in some cases, there's a very direct action. But sort of this feels like...

constraining anybody's ability to understand. Exactly. So, but I think we're here to talk about, you know, the technical aspects. Might as well get into it. Yeah. Why don't we bring in Doné, who's our resident expert. Exactly. We just heard.

the breakdown on like the why, the inspiration, but you get to tell us what's behind the scenes, what happened to make this actually go into production and what were some of these headaches and what would you do differently, hopefully, if you knew then what you know now. But maybe we can just start with like, what does it look like behind the scenes when someone asks a question of the data today?

Yeah, so the SQL analyst, I mean, it is an agent. So if you're familiar with agent frameworks, it's kind of running on those same fundamental concepts of, you know, you have the LLM, which is the brain. You've got the tools which are doing the actual work. And then you've got the history, which is maintaining context, which allows the agent to continue a conversation with context and like understanding its history.

The details is because it is a specialized agent, it also has specialized tools. So I think when we talk about chatbot agents, often it's like web search, there's some file handling, and these are not necessarily...

what the SQL agent is going to need to do its job. So it's got specialized tools, which are specifically for it to understand context on the table metadata, schema, data types, what the tables are used for, and where they fit into the broader business context. And then I think probably the most fundamental one is the ability to execute SQL queries. But it doesn't have...

specific databases to call as a tool? Or how does it know where the data lives and which database to go to? So we've built it in a way that kind of, to the agent's perspective, is a generalized ability to execute SQL. So it's an execute SQL tool. And then under the hood, we're building integrations one at a time for the different databases you could connect to. So when you can set up your agent, because it is based on...

company's data, they do kind of set up the agent with their context and part of what they set up the context is what SQL connector are you using and what dialect are you using? And that's part of what the agent knows as part of its context and it will then execute SQL

And let it know, okay, I need you to use the Redshift one. I need you to use the Snowflake one. And this is the dialect and the version of that that I'm using. Oh, interesting. So basically there's almost like a knowledge base that says if it is a question around marketing, you go to this database and this is the kind of database. So here's the SQL that you should be using. Yeah, it's not...

At this point, it's a bit more fixed, a bit more deterministic. So you set up a SQL agent for your marketing database. And so when you set that up, you say, my marketing database is using Postgres and it's using this dialect of SQL. And then anytime a question is kind of asked on that use case, the agent only kind of gets that information. And that's where the Slack channels come in because that's how we're managing that.

the agent knowing are you on the marketing database or are you on the logistics database or are you on the groceries database? Because it's a specific Slack channel for each? Yes. Oh, okay. So then you don't even have to have the agent calling different databases. It's just that

It knows I'm a marketing database agent or it doesn't necessarily know that, but that is how you architected it. That makes sense. So from the agent's perspective, all it knows is I write SQL in T-SQL and this is the database I have access to. And under the hood, we take from those pieces of information, okay, you've got to go to this particular executor in order to actually run the query. One thing that I...

wanted to know about was are you using specific text-to-SQL models

like fine-tuned in certain ways or is it just the throwing the model throwing it at whatever open ai model yeah so right now it's uh just purely on the open ai model um when we poc'd it um the the efforts around you know hosting your own model like that it wasn't necessarily worth it as well as the additional cost because we were working with open ai generally for this kind of work

So the POC had made sense to kind of use the quickest solution. And because what we found is even a generalized model like that, they're really good at generating valid SQL results.

You don't need a specialist at generating SQL. They're very, very good at that. What they're not very good at is reasoning over the context around the table metadata. And you don't need a specialized SQL agent for that. You need a good reasoner. So it made sense to kind of stick with models that were good at reasoning rather than models that were specialized at SQL. Also because I think we do a lot of model fine-tuning and so on for other use cases. We had a pretty good idea what it would take, but looking at the

token data analysts initially, we wanted to make sure that it was usable and it needed to have certain accuracy and reliability from a user perspective. And generating the SQL was not the bottleneck there. There was a lot of other reasons sometimes it wouldn't answer. Yes. But that's what we need to talk about right now. It wasn't the text-to-SQL capabilities. Yeah. What were the reasons that it wouldn't answer if it wasn't the SQL generation? Ironically, it wasn't so much that it wouldn't answer, it's that it would continue to answer when it shouldn't.

Too much confidence. So much confidence in these large language models. I think one of the

it's maybe one of the first places where we really had to sort of start thinking about prompt fine-tuning, you know, trying to set up the persona, trying to instruct this thing away from answering questions that it didn't know the answer to. Because these agents get like a bit of business context and they get the user question. And from that, they are then sort of expected to evaluate the user question for clarity. Do I have enough information to answer the question? And very often,

especially at the beginning, it was almost impossible for it to say no. Because it always thought it had enough context. It would always sort of assume, right? It was doing... And I think this is a large language model problem in general is...

Where there is ambiguity or there is space for assumption, they will do that. And so it becomes quite unpredictable in these cases to know what it will do. And generally what it will do is answer. Even if you were prompting it like crazy to say, make sure you ask the question this or that. We had some use cases when we were building where we had requirements on the amount of information it needed to have.

And he would still try and answer. He would hallucinate the requirements. Of course I got that requirement. I mean, it's actually one of the scariest failure cases because if the model is powerful enough in its reasoning capabilities and it has enough examples, it will generate an answer that's plausible. So it's like, yeah, this took 15 minutes to deliver. And it's like, yeah, that sounds like it's actually, sounds about right, but there's actually no query ran, right? And so it was a very difficult failure case to spot.

forcing it to do it. It was very, very hard. So, and this is one of those places where we had to find another way to solve the problem because LLMs were not sufficient. Prompting was insufficient. And so we've kind of

introduced this idea of like a pre-check. Like you ask in isolation, does this question have enough? So the agent's job at this point is not to generate SQL. Its job is to say, do I have enough information? And you isolate that task and then it starts to behave a bit better. And for that purpose, we've also built like a clarity data set, which is a data set where we give the question and the context and

The test is, is there enough information here for you to answer the question? And we test the models on Perl.LM on how good is this model at doubting itself? And that's really been a big part of building the trust in the SQL agent because...

the answers are plausible and if it's not a user who's you know kind of into the data enough to see that problem it's the perfect storm they quote numbers that aren't necessarily true and then the analysts who do know uh they they lose trust in it because it's kind of their assistant if they don't trust it they won't use it and they won't give that to people to use and because they own it they're the ones that have to add people to those slack channels that let them query the the actual agent

At the sake of going on a little tangent, not related to the data analyst, I want to talk about how you are breaking down the queries that are coming through or the agents on the back end that are like modular in this sense, because what you talked about there is this...

verification agent or I don't know if you have a specific name for it? So I think in the original version of the token analyst we just called it like a clarity check and it was part of the original flow. So as the user question comes in, so it's not really part of the agentic framework in the typical sense, it's part of how the architecture of the system is built and the architecture of the system has this sort of pre-step before it's fed to the agent.

So it goes into a single task, which is not agent specific. It's just an LLM call. And then that comes back. And based on that decision, either the agent is kicked off or it goes back to the user. Yeah, because it feels like that can be very generalized across all of use cases. Anytime you're interacting with an agent, you should have that as your first step. Just to make sure, like, do we have enough information? You have to give it the context that it needs to get, but...

then if you build that once, you can now have that as a modular piece that you stick into all agents. Yeah, and I think that's one of the cool things that the SQL agent has taught us about general agents. And I think as we go into more specific agents, because they have their task and when we build them, we understand this is what we want the agent to do.

building this step that goes, "Okay, let's just double check, let's set up the question correctly for this task." That's like a super useful step to implement for any kind of agent to kind of enforce this uncertainty until LLMs get a little bit more human. A little less confident. So that was one big piece. What were other parts on like how it fell down? Sequel generation isn't the hard part, as you said.

clarifying if it has enough information. I know, I think

You mentioned before that you had to create context for the business, and that was really difficult. Because if I'm in marketing and I say an MQL, what does that mean? An SQL, what does that mean? And we know, yeah, okay, MQL is a marketing qualified lead, but what isn't a marketing qualified lead? Yeah, so I think that remains the biggest bottleneck to getting new use cases set up, right?

And it is the biggest task that any analyst who wants to use this has to go through is setting up the context. So when we talk about context, generally we're talking about the schema and the schema includes table descriptions and columns, column types and column descriptions. There's a lot of information that's needed. And many companies that have quite good data platforms, iFood was one of them, have a lot of this available. They have very sophisticated data on their actual tables and

And even once we could kind of plug into that, we found that when it's written for humans, it doesn't mean it's written for agents. That's an insight. Wait, pull on that thread a little bit more. What does that mean? It's written for humans versus agents. Well, because the metadata for these tables is still written by someone in marketing for other marketing data analysts. So there's still this kind of

expected understanding of the abbreviations. There's an expected understanding of how the data connects to each other. I think we had one funny issue where the table, someone was asking for location data or based on locations and

The instructions still spoke about you have to use the country, you have to use the country for this. And someone asked for the city. So other examples like financial metrics, right? Somebody would ask, what's the contribution margin? Or what's GM1, gross margin 1? Which are all very, very strictly defined metrics.

terms in a certain context, but the agent wouldn't know without that being, you know, actually said, okay, gross margin one equals this cost or this revenue minus this cost and so on. And so having that available to the agent at time of then interpreting a user's question, you know, is important because it isn't just, you know, there's a data catalog that describes gross margin one

is this, you actually need a formula, right? And so those are the kinds of things I think we were missing typically in the data catalogs.

we're naturally always sort of, there's a lot of implicit company knowledge already embedded in the way these things were described at the column level or table schema level that the agent still didn't really automatically be able to know or understand. How do you, just continuing on the idea of agent versus human descriptions, how do you now think about like what needs to happen so that the analyst gets the,

all the data it needs or it is more agentic in everything that you're building out? So we, I mean, because we tended to build these, especially at the beginning, quite closely with the use cases that were, so the teams that were actually using the analyst. We learned a lot about like how you need to phrase this information so that it is clear and there's very little space for ambiguity. And so when we have new use cases, we have, you know, like recommendations on how you structure and instruct it, you know,

Yeah, so it's just... I think the big thing we've learned is that we have to learn how to phrase statements without ambiguity. And there's things where you're like, oh, but it's obvious, but it's stupid things like...

If you have sufficient information, to us that's clear. But what does sufficient information actually mean? You run into these kinds of barriers all the time, especially when you're kind of describing what this is useful for. And so that's kind of what we've had to sort of learn and the information we try to give as we're setting up new ones. So basically anything subjective, you've got to get that out of there. Yeah. Anything where, especially if it's subjective or if there's,

an assumption of knowledge, an assumption of context or history, you have to get that out or you have to explicitly put it in. I mean, and when we did these onboarding use cases, it was often like, well, imagine you hired a new data analyst and this is day one. They also wouldn't know a lot of these things. So what are the kinds of information you need to provide to them

and making sure that this stuff is embedded into the system, that it doesn't behave like a day one data analyst, but a hopefully year one data analyst or maybe even more down the road. Yeah. Well, now let's talk about how the architecture looks because it is...

Calling the database right away, like what does it actually go into? You said that you have the verification step and then it generates the SQL or does it call a tool that has a function where the SQL is already generated and this is a...

hot topic of debate that I love asking people because it's like, well, do you have thousands of tools or thousands of function calls or do you let the LLM generate that SQL, which may or may not be the best SQL that's generated? Yeah. Um, so we, we went through a couple of cycles with this, um, as when we were POCing, of course, you're working with one kind of understanding of the problem. And as you release it, that, that changes. So the big thing is where do you generate SQL? Um,

Initially, we generated as a separate tool. So the tool was sort of like generate SQL and it would be given the context generally like the business rules and the user question. It would come back with the SQL and then that would be sent over to an additional tool which would actually run the execution. But the kind of problem as context and the problems we were working on got harder and harder with that is

sharing enough information to the tool that generates the SQL for it to generate SQL that is valid. So at the moment, we generate SQL as part of the main agent responsibility and that then is passed through as a parameter. But it's one of those things where you are having to weigh up the consequences of that because also what we then found was

If that happens, it generates SQL as part of its main flow, which is used in a function, is if it breaks, you have to go back to the main agent flow and kind of start from square one. And often that's not necessary. You know, you talk about like an analyst on day one, and I think that's a very good way to think of this analyst as like a trainee. So, and I think even very practiced SQL analysts are going to,

You always have to look it up. It doesn't matter how long you've been doing it. If you've had to swap between date functions once in dialects, you're going to confuse them. And that's one of those things that this agent frequently makes mistakes on is I don't know which date function to use even though I know what dialect I'm in. Wow. And it's a waste, right? So you go to execute SQL function, executes, breaks, and I've got to go back to the main flow to fix what is probably like, you know, three characters that are wrong. So we've also, as...

as we've learned, made these tools more sophisticated and allowed the tool that's getting SQL to also update the SQL. And that, you know, so there again, like, where is the SQL being generated? Well, actually, it's being generated in two places. And that

And that's helpful because it's really like, how much context do I need to do this particular task? And when you're generating the initial query, you need all of it. But when you're fixing a date function or a small error, syntax issues, you don't need the full context. You just need the error message. MARK MANDEL: It's almost like you checkpoint it. And you don't have to go back to square one. You just come back to where there's the problem. MELANIE WARRICK: Yeah, exactly. MARK MANDEL: And I want to hear a bit more about the evolution and the design decisions that you didn't take and why.

I think one of the big things we get asked about very often is are we using RAG and are we using embeddings for the task of selecting relevant schema. So getting relevant schema for the SQL analyst is the biggest step. If you...

Think about going through your table documentation, finding the correct table, and then finding the relevant columns for your question. That's the biggest part you're going to have to do in this whole task. And that's, again, for the agent, that's going to be the biggest thing it has to do correctly. And documentation can get quite big. We've got use cases that have 18 plus tables. And we can't feed all of that information to...

the agent in one go, especially when we first started building, the context just weren't long enough for us to do that, which is why the tool was built initially. And that process we do kind of in plain text right now. So we store the metadata in a structured way and that gets fed into an LLM call along with the question and we

through like a series of steps, filter out the relevant tables and the relevant columns and the relevant information and then share as little as possible back to the agent. There is, of course, the option for, you know, we could have embedded this information and embedded the user question and just kind of done it as that sort of process. Part of the reason we didn't was just the complexity of a solution like that because we then have to continuously be updating those embeddings because they change quite frequently, especially at the beginning when the analysts are setting it up because they're

Making context, testing it, changing context, testing it. And if you're doing this via embeddings, it just gets incredibly high maintenance. We didn't want that to start with. And I think what we found when we've tried other knowledge-based projects is when you're using embeddings, it's quite difficult to embed the content in a way that will...

Get your question to retrieve the correct information. And that comes again to this idea of clarity. Like questions are short. They don't have a lot of context. And in a table, what's being embedded, like the gist of it is quite different to what's in a question. And so it just created such a big problem that didn't need to be solved for us at the time. Maybe as projects get bigger, if we start looking at like better ways to structure things,

so that they more accurately match what an embedding for the table would look like, that would work. But at least initially, it's so much easier also to rationalize over plain text than it is over embeddings when it starts making mistakes, especially. Yeah, and you don't have to figure out the pains of RAG. And where are we chunking? How are we chunking this? There's so many...

things where rag can fall over, especially if you're doing naive rag and then you're like, do I need to make my rag more advanced here? Or is that where the problem is? And so you're adding this complexity that it sounds like you didn't need. Yeah, and it's just another thing you have to evaluate, right? So now besides evaluating, can I retrieve relevant context as evaluating is the embedding correct? Or not correct, but appropriate for the project. And it just, it created so much more complexity than I think is needed for this kind of project. Yeah.

Yeah, it starts to sprawl. Yeah, because you're trying to solve one problem, right? So what's the best tool for that? Not the most sophisticated tool for that. So accuracy is like a really hard thing with a project like this or I think just with like SQL in general, there's...

A lot of ways the same query can be written, so you can't really go, okay, well, let's just make sure that they match. Because of certain date factors, you also can't just check the outputs are exactly the same, because if you ask it and the test set's not appropriately set up, you can't just use the numbers. So it's very difficult to measure accuracy, and we're very reliant on the analysts who are setting this up to kind of evaluate accuracy.

Is it appropriate? So initially when we were building, we tried to push really hard to have like fixed test sets with like a defined question, the expected SQL query and the expected output question.

it's quite difficult to get people to commit to those kind of like hard requirements. But it was a good place to start also just when we were making changes to check that everything was working. But I think part of what we learned when we were doing those kinds of evaluations was, again, kind of refining the problem down to as small a space as possible. So when analysts set up the project, they set up

I think we call them use cases. So it's like the use case for collecting orders for the last month based on said parameter. And they'll say, okay, we're 100% sure that the agent can answer these types of questions. And so they release that as being a question that people can ask regularly. And in that field, it can do that use case. And then use cases have levels of complexity as well. So the harder a use case is,

the harder it is to get the agent to reliably answer it. And so we've classed the use cases into difficulty, and then we've classed them also into, is this a releasable use case for users, or is this something that's still under development? And so we also have, every analyst has two versions of their agent available. They've got what we call the prod and the dev version,

And the dev one is ones where they're constantly adding new tables, they're changing context, and they're running many, many queries to test it. And only when they're happy with context, they release it to prod. And prod is where they're adding the users that they would normally be serving or the other analysts that need the agent, because that's where they're trusting those use cases to be correct. We don't get a lot of hard numbers from the analysts in terms of like,

you know, we've got six use cases and, you know, 80% of them are at 100% accuracy. We don't generally talk in metrics. I think, again, if you think about it as an analyst, they use it, they talk to it. And when they're like, I trust the agent understands, they kind of go, okay, you can now be released. Well, you're saying something that's pretty cool there is how closely integrated the development is with the analysts. Yeah. And I think that's part of why this has worked as well as it has is because we didn't,

build a general SQL analyst and say, okay, there's the analyst, go forth and build context. We worked very closely with the people who were going to use the analyst and they kind of built it for them. And because they had built it specifically for them, they knew and they trusted it and they were invested in it and so they pushed it. And that's a huge part, I think, of why it succeeded the way it did.

And yeah, it's a very important process because again, they have that knowledge. They have the domain information that we don't have. The agent can't infer and they're the only ones who know how to share it

in a way that, you know, is clear. And where we helped was sort of this, how do we get rid of the ambiguity in your descriptions? You know, where do we deal with the fact that there's assumptions here and where do we put those assumptions in explicit texts, you know, so the agent has access to it. But we did have to work very closely with them and

Now it's a bit more mature and because the teams we work with share information with each other, like iFood, there's loads of data teams they share with each other. They can kind of run on their own. But that first step, the analysts had to be very, very involved in how

they build the context. And I think this approach is to highlight, you know, why this is, to your point, why is this so successful? We've seen and of course thought ourselves about other things that we've seen fail, right? So I think two common ones that others have tried that haven't sort of seen this uptake. One is the kind of

a really boiled down version of we've got a workflow that can generate a SQL query on our data for our data, right? And so it's basically an LLM, a prompt, which has some access to the business context and its output is given a, you know, human readable query, right?

a SQL query. But that didn't solve really, like a lot of teams have built these in-house solutions, didn't really solve the problem of where these thousands of other people who can't verify whether that answer is correct to run it. And so, plus an actual data analyst doesn't need help with those simple queries. They need help with the complex joins and the ones they actually spend a lot of time on.

So that's one avenue where sort of these homegrown solutions, build your own SQL query writer, actually don't really solve the big problem. The other one is there are solutions out there that claim they can entirely substitute a data analyst. They can basically say, well, listen, we are just a completely no-code data, talk-to-my-data solution, and there's no data analyst involved whatsoever.

And our solution, the token data analyst, is sort of a mix in between the two because you're interacting with a token data analyst in a Slack channel, public Slack channel, or at least a shared Slack channel with data analysts. And that only happens after the data analyst has curated...

honed and developed, co-developed this system to actually give sufficiently reliable answers. And the data analyst really becomes the kind of orchestrator, supervisor of this actual agent. And I think that's probably why we're at a point now where there's real usage.

because these other two where you don't really solve for the actual, how this thing gets used in production, because we're talking about agent in production, if you don't think about the users and the expectations and who else needs to contribute to this thing to be reliable, then there's less likelihood for it to be really adopted. What about the complexity? Have you seen there's a red line where it just tops out on questions that are super complex and the agents can't get to it?

Yeah, we've definitely seen, and I mean, this is something we're still talking to some teams about because the complexity kind of has generated this. There's a ceiling. Yeah, there is a ceiling. So, yeah.

Maybe I give it too much character, but I kind of think of the agent kind of always trying to solve it in the simplest way possible. And so it was quite difficult to convince it to sort of do very complex queries where it's doing very complicated joins. You know, there's like CTEs and, you know, multiple step queries. Yeah, no, it's basically, it always goes to the simplest solution, which is if you go to the complex things first,

there's sort of a natural ceiling where I think users, still a substantial constituent of our users, they actually think it's still not good enough because they want to ask it many more complex things and it defaults to the simplest route, which is if you look at the pyramid, it's a healthy share of the questions coming in, but there's still a lot of other more complex questions that it could

and probably soon will be able to answer, but doesn't yet. And users, I think, find that disappointing or frustrating sometimes. MELANIE WARRICK: But I think that's why we also position this as it's not going to replace a data analyst. A data analyst is going to be able to do these very complex queries.

And it's also not going to be necessarily for someone who knows nothing about the data and has zero kind of insights into sort of what the information they need, like what the technical requirements are. It's really that middle ground where it's people who need some assistance with the data. They have a little bit of understanding of what they're looking for and understanding

it can do that right and and the where the value is is that that analyst who can do the complicated queries doesn't have to do it anymore um so there's this offloading or like kind of first line where we can say okay try the analyst and you can get your answer from there and and then yeah the analysts who kind of would normally be doing that they can focus on those complicated questions uh that and also their actual work right so the insights they're trying to generate for

like, you know, for business users, for like campaigns or like conferences and that sort of thing. Whereas, you know, like the sort of in-between users, they can use this to catch whatever insights they want. And that means they can ask many more questions and they can make more data-driven decisions than they could before just by virtue of this thing always being online. Yeah. There might be a couple of other things that are, I think, interesting on the technical front that we learned. At the beginning, for example, because it's...

got a validation step and it can sort of self-correct and sometimes it can go back and fetch other columns. At the beginning, one of the simple things we had to solve for is it gets stuck in loops.

So that's one thing. How do you get this thing to at some point just say, I'm done? I don't know or I do know. Another thing was, how do we make sure that we have that integration set up that's agent compatible? Because you're actually running queries on a database. You don't want to do select star. And because these are, like I said,

the billing orders a year, right? So if you just like start on the order database, like you should have some kind of guardrails on what it should and shouldn't, what query it should be passed. Yeah, the agent could get really expensive really fast. Or drop table, obviously. There's no, you know, write access or that kind of stuff should never be able to. So that those things, you know,

we had to solve, right? Then you had to make sure that some of the queries, because when people started to use it a lot, would basically require that much memory that they would shut down the database or overload and nothing would work anymore. We had to deal with the fact that at some point people wanted to visualize the data

Right, so okay, well, we had a code execution tool, so we connected that, but it needed to be in certain standards. So how do you make sure that the charts that are produced are correct? We had issues on, again, I'm sharing problems here and talking about real hard learned lessons. These are real problems. Other things that we struggled with, well, like,

what data should it actually return to the user? Sometimes users wanted a set of orders. Give me a sample of orders so they want a CSV file back. So how do you sort of extract what is allowable? How do you return that in Slack or in other environments to the user?

Oh, other things. We went through generations of models, right? So we've gone through various generations of models that were doing parts of the flow. And as we were upgrading, you know,

you know, frequently on a new generation, a new version, the whole thing would, you know, unexplainably start to behave very differently. Yeah, it would stop working, but also sometimes just be very different. And so you just convinced a cohort of analysts to be able to phrase their questions or the tool would, you know, look for the column names in a certain way and then that wouldn't work. Were you running prompt evals to...

solve that as you were going to upgrade? Because you have all these questions, I imagine, from the analysts.

Did you run them through? How did you know? Or did you just plug in the new model and then it was like, oh, it doesn't work? A little bit of both. Yeah, well, yeah, it is a little bit of both. So when we did the initial transfer to try and move from the existing model to one of the newer ones, we had some test sets from the original use cases and we just saw massive changes in behavior and just...

The prompt had been truly optimized for the model that it had been built on. And the context that the analysts had been building and changing and updating for the agent, that had all been optimized for this prompt. And that was one of the first times I had seen that you can also... We don't make machine learning models anymore, but you can still over...

Over-engineer? Overfit the prompt. Yeah, you can overfit the prompt. That is the new risk of when you're working with these kinds of things is you can overfit the prompt. So yeah, when we moved models, we ran some of the tests we had and we saw huge changes in behavior. And so we sort of had to roll back to an old model and it's kind of one of those problems. We're still trying to figure out how...

to make these things sort of more agnostic to the model that they're being given. I think as models get smarter, maybe that'll be a little bit easier. But at the moment, yeah, it's just the model is the model. I think this is also a space where, you know, we're now, we're recording this early 2025 when we're just seeing this next wave of reasoning models come in. There, you know, it feels like that's a big opportunity, especially on these kinds of

fairly mechanical tasks that are very rational, right? We're not doing a creative marketing piece on. No, no, we're actually asking it to go very logically through a step of actions, pulling certain sets of data to get to an answer. And I think these reasoning models actually will offer for this particular agent a ton of potential to do that well.

because of this intermediate rationale generation, the inference compute that you can sort of scale, and I think we'll soon be able to fine-tune on these kinds of tasks. We'll probably see a leap in performance and probably a lifting of the ceiling of what these models can do to actually be able to answer questions using data sets, running queries, and so on.

That's my prediction on the agent for this year. That's the next one. So where's this all going? Where do you see this heading in the future? Yeah, I think that's one of the things I'm excited to see is, especially if we're thinking about this like pre-validation step, because the reasoning will help it to understand, to reason, do I have enough information? And also,

And this is one of the things I'm excited is like, can we do this, use this pre-step to generate the question in a more structured way, which would again, remove the assumptions, remove the gaps in knowledge, remove the ambiguities. And that would then help down the line as well, because you would only need to have that version of the question, not the original one. So the reasoning one is, I think that's going to be where we see some, finally see some big jumps. And I wonder, because it's funny how, when you put the context,

you don't know if you've given it enough context until you run it and you see, ha, that didn't work. But if you could just give it to the reasoning model and say like, is this enough context for you? That would be...

easier than running the whole thing and seeing if it fails and passes. And we could also see to some degree what it knows, what it's assuming. And so we can diagnose more easily where the disconnect is between what it's being given and what it's expected to do. Because right now we're really reliant on going through it and kind of

guessing or you know they're educated guesses after this long but it's still kind of a guess of why has the 70 billion parameter model done this it's a bit of a I mean we see the ability for us to like take these like that's a deep seek R1 is doing it basically distills its reasoning into like a smaller quen or llama model I think that's a perfect application of here you know for SQL generation you can look at

query and then SQL answer pairs that we've generated in the past, stitch rationales together, figure out why did you do that. Obviously, that's the label, the training data set you need to then use to fine-tune this model. And then it should be able to apply that reasoning

about the data sets and the kinds of questions coming to do the downstream future tasks much better. So I think that's certainly something that we'll be looking into. As long as the data schema doesn't change and everything stays perfect, right? Sure, but it should get also a lot more business context. And so maybe the data schema will change, but it will...

be able to do it much better than the current version of these data analysts, which definitely when the schema changes, now it has an idea. It makes a false assumption. And they can act as assistants, right? So one of the things we want to work on but haven't quite managed to do in a scalable way yet is let

do some kind of evaluation of the schema as it exists and say, okay, point out the ambiguities, point out where there's not enough information, check where there's terminology that's not clear. And if you could give that to a reasoning agent, again, with the context of question-answer pairs, it could suggest, okay, you need to change this, which for the onboarding process also would ease the process significantly for the analysts.

The Process AI team is hiring and you can find all the links to everything you need to know in the show notes below.