We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Real World AI Agent Stories // Zach Wallace // #283

Real World AI Agent Stories // Zach Wallace // #283

2025/1/15
logo of podcast MLOps.community

MLOps.community

AI Deep Dive AI Chapters Transcript
People
Z
Zach Wallace
Topics
Zach Wallace: 我主导了Nearpod的数据平台转型,旨在解决数据分散和难以处理的问题。我们通过整合各种数据源,并利用DBT和Redshift等工具,实现了数据的有效处理和转换。我带领团队构建了一个数据产品交换平台,作为数据网格的基础,使得我们能够以一致、可靠和自信的方式处理大规模数据。虽然我们正在朝着完全的数据网格迈进,但目前仍处于过渡阶段,因为我们需要定义清晰的领域,并实现流式数据产品的传输。我认为数据产品是数据和数据定义的交集,它能够清晰地表达数据的含义和用途。

Deep Dive

Chapters
This chapter details Nearpod's journey of transforming its data platform using dbt and Redshift. It describes challenges with disparate data sources, the implementation of a data product exchange, and the transition towards a data mesh architecture.
  • Nearpod used dbt and Redshift to consolidate data from disparate sources.
  • They implemented a data product exchange, laying the groundwork for a data mesh.
  • The process involved ELT (Extract, Load, Transform) for real-time data updates.
  • Data products were defined as the intersection of data and its definition, enabling consistent data sharing across the system.

Shownotes Transcript

Translations:
中文

Hey, everybody. My name is Zach Wallace. I am an engineering manager at Nearpod. We are in the ed tech space for K-12 throughout the U.S., but also throughout the world. And how do I take my coffee? I take my coffee black every time. Yeah, I learned that in college where I was desperate for money and cream was expensive.

All right, this guy feels like he is smack dab in the bullseye of the Venn diagram that I would paint my picture of. I knew that Zach and I were going to be friends, to be honest. I knew it. I had him on here. It was excellent conversation. And why did I know that? Because he wrote a whole blog post on how they made their data platform more efficient at Nearpod. And...

Right when he got on, he said, you know what I've been doing, though, which is pretty wild, is diving into the world of agents. And I have now done some crazy stuff when it comes to breaking down barriers between departments at the company. So we talk about all that. Welcome back to another ML Ops Community Podcast. I'm your host, Demetrios. Let's get into it. Let's start it with this.

We were supposed to come on here and talk all about the data platform and the transformation that you did there. Maybe you can just give us the TLDR, like super...

CLDR of that because we're going to take a bit of a curveball here and go on a totally different path. But I feel like there's a lot of valuable information in what you've done with the data platform. So let's go over that real fast and then turn left. Yeah, for sure. So

Trying to summarize this as best I can. And my, you know, 10 minute read on medium felt like a summary. So I'm going to try to do a little better there. So essentially we had data in a bunch of disparate systems, a bunch of disparate data sources. And we have monoliths, even on the data, the data architecture side and microservices on the data architecture side. And with that, we had data all over the place and we didn't really have a good way of processing

condensing the data, transforming the data or processing for reports in any matter. Right. And what we were able to do is utilize a combination of dbt. dbt core is what we're using under the hood. It's been fantastic. Some from some of my engineers have just absolutely fallen in love with it because it is transformational data.

for an engineer going into the data engineering world. The best way to describe it is it feels like you're engineering data, like you're software engineering with data. And then we're using Redshift, which in the past has had a bunch of really large issues, but we're using it like in an Apache Spark. So it's just for transferring the data and bringing the data to where we need it.

And then we're passing it to other areas like, you know, we're a subsidiary of a larger company. So maybe we're sending it to Snowflake or maybe we're using S3 to process other areas of data. And we built ourselves a data product exchange, which is the underpinning of a data mesh. So you can identify where the interactions are across the data products.

how you will be able to interact with data products throughout our system. And again, we have like 20 disparate data sources. Then they have anywhere from millions of rows to billions of rows. So we're talking large scale data to some degree, not meta large, right? But larger than a POC, if you will. And so we're able to do this with consistency, with reliability and with confidence. Yeah.

Dude, so explain what does it look like with the disparate data sources and how did you pipe it in? Or did you just make each one of those its own API? Like, I guess what I'm not super clear on is...

Give me the breakdown. What you had this database over here and then you had another database over there and you had to connect those two with DBT and then you're joining them and then putting them in another database that then has this data contract around it or something. And it's a data product. What does that look like in practice? Sure. Yeah. So we have data. So we have data stored in these disparate systems.

And most of them are Aurora DB. Some are Dynamo. But one of the key pieces is with Redshift, they enable something called zero ETL. And essentially this is an ELT. So it's not an ETL like you would typically see it, but it provides real-time updates for any data within any of these disparate systems. And if you hook it up, you're able to get the data transferred over to Redshift and

And then you're able to process the data if you want in Redshift or we've set up exposures with DBT. So we're able to bring that data out of there into any other system like Snowflake, for instance. Right. So bringing this in the Snowflake and enabling us to actually do transformations in Snowflake where it's very powerful. So, again, we're using really Redshift like everything.

in a batchy spark and it's a wild mental shift because of how easy zero ETL is to set up. And so we set it up probably each database took like five minutes to set up with, you know, with the console AWS console. And then we're able to, to send that and process that wherever and however we want.

We use S3 as an intermediary between different data sources beyond Redshift, right? So the microservices go into Redshift, the DBs go into Redshift, and then from there, they'll go into S3 or Snowflake or something like that.

But where does dbt come in? I didn't catch that part. Yeah, totally fair. So dbt is how we process and transform all of our data in any of our areas. So we have multiple dbt repos based on the domain that it's living in.

And so DBT will typically live, we have one inside of Redshift and one inside of Snowflake. So if we want to process the data for any of the domain relevant areas inside of Redshift, then we'll run that through DBT. Then that gets passed down to Snowflake and we can do even more transformations in there.

Nice. Okay. And you did say that buzzword of 2022, I think, the data mesh. Why do you feel like this is data meshy, but it's not like the full-on data mesh? Right. So the data mesh by itself provides the ability to send data and define data to anywhere it wants to go across your system. The issue is that

Similarly to how you build microservices, you have to build this with domains in mind, right? And so as we're building this, we're working towards a data mesh right now where we're defining the domains, but it takes a while to break down microservices and monolith data architectures and really define the right domains. So we have a team working on that now, and they've been working on it for about

12 months. We have some data products, probably, you know, 20 or so, 30 maybe, that are able to transfer between different areas of our architecture. But ultimately, the goal would be to implement

implement streaming data products where you can go through. Right now, we're only batch processing, right? So at this point in time, zero ETL works really well for batch processing, but not for streaming. And so we have to build the other side of that. And that's why I would say we're

sort of a data mesh because you can get data from a batch processing, but not from a streaming. And we need to really define how other teams can get in there because this is a, you know, an organizational shift when you talk about going from the,

traditional MySQL data, you know, maybe it's PHP, maybe it's, you know, some sort of TypeScript type ORM sort of ordeal, right? And you're defining these systems and how they work. But now we're bringing these out into a separate area and actually implementing them back from our transactional layer. We're able to send that to the analytics layer and then back into the transactional layer for further processing or real-time updates or anything. And that's

That's where we're still trying to learn, if you will. And what do you mean by data product?

Yeah, that's a great question. So we define and data product is a word you'll hear defined 18 different ways. So it's important for you to define it yourself. That was kind of why I was like, okay, for the listeners at home, aka me, what is a data product in your mind? Yep. So a data product, as we've defined it, is the intersection of the data and the data definition. So

So, um, it's going to be the transfer of, you know, a set of data that has a clear definition of what it is. So let's say a user, for instance, you send, you know, um, the times a user is logged in, you send the times that a user's, uh,

done something in your application, and all of a sudden that is user usage, right? And that is a data product by itself because you're able to define exactly what that is and you're sending the data, whether it's aggregated or, you know, just single rows, a bunch of rows. Nice. And so then theoretically you would have many different

types of data products. Maybe there's user usage, but there's user profile and there's user whatever else you can think about. Exactly. Okay, cool. So now let's...

take a hard left and talk about what you've been doing recently, because that was almost like your past life. And you just told me, for the last three months, I've been diving deep into agent architectures. And I thought, well, that's perfect, because I'm all in on agents too. So I do love talking about data engineering for ML and AI and data platforms.

But I also am fascinated by agents right now. And so what's your story there? So I would rephrase how we stated it's a hard left. I would actually say this is the next step. So in this architecture...

The key that you need across your system is data, right? You need the data of your users, of your system, of the architecture to be able to facilitate quality in your system with LLMs. And that's key because you can ask an LLM to do whatever you want and it's going to give you however it interprets it. But without data, it's not going to have the quality associated that you need to provide reliable and confident answers to

or, you know, suggestions to your users, however you're going to use this. And so the data platform is really the first step. It's getting the data in places that now you can utilize that data for better quality responses or better quality LLM responses. And

So we took an endeavor on the agents. We tried to see, okay, the market's demanding that we use LLMs in some capacity. I'm in the ed tech world, and that's a dangerous world to get into LLMs, right? Because we have to consider the students, the parents, the state legislature. We have to, the country legislature, and we're a global company. So if we go globally, how does this affect different cultures across the world?

And that's a tough problem to solve, as you've probably seen, whether it's language barriers, because LLMs are not great at trans adaptations. They are good at translations in some cases, but not trans adaptations. They're not great at identifying cultural significant events or cultural specific events.

sensitive topics. Nuances. Yeah, nuances is a good way of putting that. And so as we're getting into this, there's a lot to think about, right? And what we started with in mind was question generation. So we're in ed tech, we're trying to provide value for the teachers. They work in almost every country that I've ever heard of. They work

intense hours. They don't have enough time in the day to do what they need to do. And they're getting burnt out. The students are affected by that burnout. The parents are affected. Everyone's affected by this. And so we focused on the teachers. And that's a powerful way of approaching this. So with questions generation, how can we reduce the time that teachers take generating questions?

And we started building agents to do this. And that was really powerful because we're able to define, the way that I would define this are these agents are

are almost like three-year-old consultants, if you will. So when you start, right? So you're going to bring these into production, but you're essentially asking a three-year-old to generate school questions for school teachers. And we know that's not going to be what they need right now, right? Like we know this, but what it does is it enables us to start building out these domains of specialists. And this is where the interesting part comes in because it...

It took us seven hours to build something like this, right? Which in the past is something that's just unimaginable. We would have had to take 60,

at least, right? At least. Maybe it's eight months to do something like this or even something remotely close. But the dev cycle has been reduced so drastically to get a proof of concept out so that now we're actually building independent services that other teams across our company can access and provide insights on. So we need other people, other teams across the company to help us. In the past, we've been the bottleneck, but now we're the problem.

But now we're providing this opportunity for other teams to collaborate so closely with engineers that have never been able to in the past. And when you say other teams, you mean other engineering teams or anybody in any department can help? Other departments.

Yeah. And that's the key. Yeah. Right. Because we have all this knowledge. We can build this super quickly. You've seen a bunch of different ed tech companies come out with like question generation or slide generation. But if you bring those to a teacher, at least from what I've heard, they're all going to say that they're really subpar for quality. Yeah, they generate lessons, but they don't actually they don't meet any standards. They don't they don't help you design a real lesson that you could.

immediately use in your classroom. Why is that? Because engineers are building these. We don't, we're three-year-old consultants ourselves. So now we're the three-year-old consultants telling other three-year-old consultants what to do, right? And so we need to get those subject matter experts closer to the code, closer to the development cycle. And why do you say that

you're using agents or why is this an agent problem as opposed to just like pinning an LLM? Yep. So as you think about this from a consultant perspective, right, you're building a very domain specific agent.

to define and handle a problem. So as we're going through this, to give you, we have input validation agent. So what that does is it validates our input. We have a lot of legislature that we need to handle, a lot of sensitive topics across the nation, across the international culture that we need to think about. And we don't want, because this is coming from our company, we do not want to give the stance that

on these, regardless of how we feel about that. We need legal, we need curriculum development. We need these other teams to be closer to us, right? For the actual question generation, that's its own domain where it's just generating questions. The entire purpose of this consultant is to generate questions for all of our teachers, right? And as you're thinking about it, you're actually building, again, these mini consultants, but you start to understand that these tasks are

need to be broken out because you can be, you know, it's one of those idioms of the past. Are you a jack of all trades and a master of none? Or are you a master of one and you don't know the rest? And I totally botched that, but you get the point, right? And we don't want to build a jack of all trades in a lot of cases. And so you're going to start to see these agents going through our system and so much so that now we're building an agent registry that's able to be seen and utilized throughout our system.

That other engineers or whoever wants to can come and grab them off the shelf and say, I'm going to put these three or four agents together to create my product. Exactly. Wow. And that's why you can...

empower the other departments. Yep, exactly. Exactly. So let's say, you know, we have 12 to 13 different teams across our company. Let's say one of the product engineering teams says, oh, I need to go and build product, product feature XYZ, right? Well, I want to use an agent for that. So what agents are available to me? What agents do I need to create? What agents do

How do these agents interact? So to try to give us a very specific example, let's say that you want to tackle, and I need to take this out of ed tech because I don't want to cross the line of sharing too much. Theoretically, if you were in a different business like e-commerce. Right, like e-commerce, exactly. So let's say that you go into e-commerce and your goal for a product feature is to identify or is to get the right user, the right product in front of them.

Right. So you're going to think about agents to understand, okay, but what are the individual functions that need to happen? And, and,

And how would I associate them on a larger scale? Right. Do I need to have non-deterministic orchestration? Can I actually use deterministic orchestration to find which steps of these processes need to happen? And so for the e-commerce example, you know, you're going to need to understand, has this user ever bought anything on your site? What are the typical products that they enjoy? Maybe you throw on curveballs, you know, to spark interest in other areas.

So you would have an agent to go through and understand, like, what are the interests of this user? And you'd have an agent to go in and understand what are the typical things we're selling today. And then you'd have an agent to kind of, you know, merge those two together to have this user profile that you're trying to generate. And then there's a lot of other things, whether, you know, you can think input validation, you can make sure that you're not throwing errors. But then this is where the data platform comes in because you can, again, you're sending data on the front end.

But what if you're collecting data from these agents and you're thinking, okay, so these agents are throwing errors, you know, 30% of the time, hopefully it's not that bad, but you know, an example. And then, but the agents are letting you have this success value of someone purchased something with a, you know, on our platform, on this e-commerce platform, you can start to identify which are working well and which aren't. And so as you're taking this through, you can get into the data platform, start processing this data.

build a feedback loop to understand what can we update where can we do this autonomously where the agent is actually learning right now if you're using chat gpt or open ai or something like that you know they don't have uh the ability to um i can't think of the word right now but where you uh bring data back in and let it learn itself but if you're retraining yeah or the fine tuning yep um and

And so you can, but you can, you can customize that a little bit by using a rag and updating the data that you're passing it and whatnot. So like there's ways you can implement a feedback loop tying this whole system together. Yeah. It makes me think of this guy, Tom, that I was interviewing a while back and he's doing

mix panel for voice agents. And so with voice agents, you can see it a little more clearly because you're on the phone, it's real time, you're talking to them. And if something goes the wrong way, you want to know about that. Or if there's an expected call duration and all of a sudden all of your calls are just taking three seconds when you think that they should take or the averages before this would take a minute or a minute and a half.

You want to see that type of stuff. I hadn't thought about it with agents for what you're talking about, where you want this mixed panel type of view to be able to understand where the agents are successful, where they're not successful in doing work.

In moving the needle on one of these metrics, that's important for you. Yep. Yep. So you and that's a power of agents, to be completely honest, because you can now have a department of your analytics working on one agent, building out what does it mean for these to be successful? You can have your product engineering teams that are working on implementing this exact thing.

agentic flow. Right. And then you have this idea of, well, why are we going to recreate the same agent in four or five different places? Right. It's sort of like object oriented programming in some cases, because you have to come back to the fundamentals and understand the

How can we repurpose this and reuse this in another area? How can we understand what the breakdown is of the problem to pick apart, you know, the standard problems or maybe some of the more intricate problems that are very domain specific? And I guess when I think about agents, one thing that I think about is how they're able to

take some kind of a question or some kind of a request or an instruction and then figure out out of all their possible actions that they can take. Okay, I'm going to use this tool and I'm going to, first of all, they have to understand that. And so they have to know, should I ask for more context? Should I really clarify what is wanted here? What the outcome they're trying to do is. And then cool, I can go and I can grab this tool and I've heard it before.

viewed my buddy Sam talked about how in every case that you can you want to try and a narrow the scope of what you're trying to get the agent to do but be narrow the scope of what the tool is doing and so when the agent interacts with the tool you want to narrow that scope a hundred times at when possible you want to narrow it as as much as possible because and he gave me the example of

if you are trying to have an agent write a SQL statement or if the agent just has hundreds of SQL statements it can choose from and it chooses the correct SQL statement because it knows what you're trying to do. Yeah. Yeah. And that's, that's a powerful concept. There's a lot of implications that you're bringing up. So let's go through them. Yeah. So,

One of the things that we're learning a lot about right now is how do you define the number of tokens and relate that to the cost and the time required to process this? And so as you have specialized agents, you're typically going to have less tokens, right?

And so they can run and get whatever they need much quicker from the LLM versus having highly specialized, highly defined agents that are going to take longer to process the information, understand everything that, you know, all of the context and all of that. Okay. So, so you're saying that when it is super narrow and you have this scope that is smaller, um,

you can not only save money, but it's more reliable. Exactly. And that's when you start considering multi-agent approaches, right? So when you're using agents, and we're using multiple agents for everything we do because it makes our processes easier to understand from the engineering and easier to adjust. So let me give you a breakdown for our time requirements for this project.

So we noticed that it takes about 10 to 20% to actually build a POC and get something available for end users. But assessing the quality and understanding how this works with other departments or diving into what we call false positives, which is where you're

the agent is reacting in a way that believes that it's doing something correctly, but it's not sort of like hallucinations in, in some capacity. Um, when you're trying to fine tune those with your prompt or with the code and there's a blend there, it takes 90, 80, 90% of your time to, to debug that. And so now, um,

Now, again, it's sort of flipped, right? So you need to be able to communicate exactly what's happening in each agent, understand exactly what the task is to reduce communication channels between engineers on a team, between departments in a company and other areas of communication channels.

So you are probably coming to a place where you've got thousands of agents that you're dealing with, or is it not that sprawled out? It is not anywhere near that right now, but we will, we will. Okay. So that's the end. That's kind of like, if you extrapolate this forward a few months or years, um,

You expect that to happen. Yeah. Yeah. And I mean, you've seen everyone from HubSpot to Salesforce to Zuckerberg talking like there's going to be more agents than there are people. I mean, these agents could be apps like mobile apps on your phone that you're seeing. You could think about that as the size of agents that I could see in the future. So I really like the idea of going into it and

And looking at agents as part of a DAG, and it's just like one more step in the DAG, and it just so happens that in this step, it's a little non-deterministic, and we're giving it information from the other steps in the DAG, but it can...

It can be that type of like visual representation in my mind. Do you tend to build them like that or do you look at it differently? Yeah. So that's a great question. And it can be a little complex to identify that at times. So I'm going to go with the base case and, and, and move up from there. So let's say, you know, you're building a,

just a standard agent that's that's a single agent the flow in which this would go is you know you you're coding and then you have one node in this DAG that you're calling and it's going to give you some sort of response right however you're intending if we go to the e-commerce it's going to say oh these are the user interests right if you will and so it's going to be able to identify those um

Then if you scale that up and you start to say, OK, we want to add more to this to understand, well, what what is the business interested in selling today? Right. Because there's going to be different valuations on all of your products, different margins, et cetera. So then it's going to the next step is saying, OK, instead of just calling this user interest agent, we're also going to call.

this next, this other step, this other tool, which is identifying business needs. And so you actually almost have to build a third node, which is orchestrating this, right? And that's where this non-deterministic orchestration comes into play and it becomes fascinating.

Because you can now say, okay, I want you to bring both of these tools into the play based on what you are seeing, right? You can either choose to bring one or none or two or whatever. And so it comes in, it calls those tools as it sees fit and condenses the information. And that's sort of, you know, your high level non-deterministic orchestration. But let's say that it's not really producing quality results, right?

Right. So there's non-deterministic orchestration is giving you some sort of summary, but it's not actionable. What are you going to do? Well, let's add another another agent. Right. And so you start to have this this third node that is on the same tier in this DAG. Right. Because really, you're just calling and receiving responses. And so you're getting this processing this information from the first two and then you're going to send it.

to this next node. So rather than saying same tier, let's say the DAG is those first three are in one and then you gather all this information, pass it to your analytics or whatever it is, your summarizer, if you will. I don't know what to call here because I'm not familiar with e-commerce, but you get my gist. And so you pass it to this next node and then that summarizes and coordinates a quality response.

But then let's say that you actually want this to be flexible based on current events. So then you can make that a non-deterministic agent. So now you're going to have this DAG that just has a bunch of non-deterministic agents that are going and going and going. And they all have separate use cases or separate. It's almost like you're giving the agents more features. Exactly. And you're enriching the agent with all of these different steps in the DAG. Exactly.

Yeah, that's fascinating. How are you looking at costs? Because I know that you kind of mentioned that before. Yeah. Yeah, so...

This is a fun way. So with our agents, we have built a custom evals framework. So we based it off OpenAI, but we had to bring the evals to our engineers. We're a platform team. So our goal is to make the engineers more productive, right? So they're currently working in either Python or TypeScript. And Python for our feature engineers is pretty easy, right? You just have that basic

open AI model, but for TypeScript, nothing existed out in the real world. And so we built our own custom evals framework that can dive in and handle this. And within that, we've hooked this up to CI CD for confidence and different levers. And we're able to assess

how much each of these agents are costing us based off our evals and create an approximation for how much it's going to cost us in production based on usage and other metrics we're looking at. Whoa. And how did you do that? Because that's fascinating. I don't even know where you would start with that. Trying to think, like, how did you even break that down? And oh, my God. Yeah, yeah. So we're able to utilize... I'm...

I'm drawing a fine line between what I'm allowed to share and what I'm not. Right. So we have internal logistics that are able to define and measure cost usage with the LLM and identify what we're able to use, how we're able to use it within, you know, what our internal logic is.

associate that with real monetary values that we've assessed that are like very, very close to the real world. And then we have all of our measuring and monitoring data within any telemetry that you're using to assess, okay, well, how many total users are we expecting to use this and associate those numbers with the actual calls that we're making. And what we found out is

with agents, not only, and especially using non-deterministic for any, you know, any reiteration or any, you know, look backs or reflection that you're using. We're able to identify that we're getting really accurate results to the tune of 98 to a hundred percent accuracy in our evals for a lower cost because we're able to use better, like cheaper models and whatnot. And why do you think that is? Because you're,

Passing in more context, you're giving it better information. What is the...

It's half prompt engineering and half software engineering at the end of the day. And so we need to identify how we can reduce our token size, how we can reduce the number of calls. And you're stuck in this optimization loop. You're never going to have it perfect, right? But we have a ton of optimization nerds on our team that are really focused on, okay, what is the cost and what is the quality and how do we optimize for those?

And when they're looking at the cost, it's like, could we get rid of this sentence in the prompt? Because that means that less input tokens. And when you times...

The number of prompts that we're going to be using with this agent and all of the folks that are going to be using this agent out in the real world, that starts to add up. Exactly. And if you think about this from bringing this full circle, if you think about this from the perspective of consultants, maybe you ask them to do less in some cases, right? Which has less time and money associated with them.

Maybe you lay them off because, you know, and that sounds harsh, but at this point in time, we're talking about LLM agents instead of consultants, right? So let me rephrase that rather. So coming back, I'm going to retry this. So coming back full circle, let's say, you know, we consider consultants and with those consultants, you have this notion where some of the time you're

You ask them to do less for less money, for less time, for less operation, you know, operation cost. But then on other times you say, OK, well, we've kind of completed this task and we don't really need that agent anymore. We don't need that consultant anymore because we're doing something different. And with that, I think that's powerful because you can consider these because they're so easy to spin up.

The dev time associated with that enables us to remove things much quicker and say, okay, look, you spent two hours on this. I'm sorry, but like, this is no longer going to be used. And that's a much easier conversation with an engineer because

Compared to spending six months, 12 months, iterating over years to, you know, remove a project from what they've been doing. Yeah, you're much less pot committed. It's been whatever, a couple hours that you put it together. And when you talk about the marketplace too, I can envision that you have the price tags there also. Like this one, if you use this agent, expect for it to cost X amount.

And if you're using all of these agents together, maybe it does a little fancy math and it adds all of the price of all of the different agents that you plan to be using in production. And so you can get an estimate of, all right, cool, well,

This agent's probably going to cost us X amount of money. Are we okay with that? Can we make that back? And does anyone have the ability to just launch something into production? How does it go from me in some random department? I now want to create my agent and I throw a few of these different small agents together and I say, cool, I think I've got it. Let's launch it.

And then what, how do I launch it? Like, how do I go from that? I've got something to production. That's such a great, great question. So organization dynamics are going to change. And this is one of the craziest realizations I've had over the past couple months. And coming back to something we talked about earlier, and I don't remember if we were recording or not, but, but

In essence, we have to bring the departments closer, right? So when you have this product feature, it's no longer people sitting in a closed room figuring out what the user needs. And, you know, you have two people ideating on this and then they send it to the engineers and the engineers say what they can and cannot do and give you a time assessment. Yes, that's still going to be a factor, right?

But the speed of determining that and enabling the actual POC or MVP is so much shorter. And that is the wild part about this. Because you can now say, okay,

You know, if it takes seven hours to build this, we can give you a POC to see, are we meeting the user's demands based on your department's vision of this? In the past, you know, even though we're iterating quickly, we're, you know, we're using CICD, so we're updating production, but it's with very small pieces of the puzzle.

In this, you almost take a whole puzzle and give it back to the other department and say, hey, look over this. What are we missing here? I'm a three-year-old, remember? So like, what did I not understand from this language gap that we have at this point in time? And then they can give you actionable feedback on where you need to focus your time as an engineer. But that's bringing for ed tech, that's bringing the curriculum development in, that's bringing sales in, that's bringing marketing in. You're bringing in these other departments to

to such a closer collaboration than we've ever seen. But this is for net new products, right? Right. Or do you still feel like

there's going to be that capability with the gigantic code base and the monolith that you go back and you say, well, we want this feature. Okay. If it's net new agent, I can create it in a few hours. If it's, I got to go dig through what Johnny did two years ago. It's going to take me a few months. Yeah. And it, and that's a great question. And as every software engineer,

We'll say it depends, right? Because there's a lot of context that engineers have and will need to continue to have for the monolith and for other areas of your application. But I think what we'll start to see is that there's going to be ways to deprecate old code bases or code bases that have been maybe some of the more technically challenging areas of your code base. And they can be updated with agents to solve the same problems

concerns that we've seen in the past. Wow, that's fascinating to think about because at the end of the day, you don't, the bottom line is the end user just wants the hole in the wall. They don't want the drill. They don't care if it's a hammer that's doing it. They want the hole in the wall. And so if you're giving it to them with agents and it's actually a much better experience, then

And on your side of things, the back end is much less complicated and easier to spin up and get validation from quicker than it feels like a win-win. It also feels just intuitively very scary because you're...

For you too, right? Okay. It's not just me. Yeah. So coming back to how do we get this into production? Because I never really answered that. I just got excited by the first part of this. Organizational dynamics. So what we will start to see is that it is a very scary thing to release the first LLM AI non-deterministic software development cycle. It's very, very tough to do that. And

We took an approach where our evals were the single truth to our system. We have thousands of evals that are very similar in TDD. You think about those as tests.

Right. We're generating like user prompts, not from production, but we're generating user prompts that we would foresee to happen in production. And these prompts are being added to our evals to assess what are the boundaries in which we are seeing? How are these boundaries assessed by our LLM for us? Sensitive topics are something we have to be very careful about.

legislation differs across the world. So we have to consider very different legislative requirements across the world. And when we're doing that,

we had to build our evals. So we have this product team coming with this product goal based on the users. We identify the product design sprints that you'll need with the relevant departments and the engineers. And now we're building out evals. Now, the interesting part about this is the engineers in some companies are so far removed from what the end user does or actually inputs into the system. And so this collaboration loop is just

tightening every step of the way. So you're building out these evals with our CICD pipelines. We're able to get thousands of evals for each agent to run with confidence levels of getting the right result 98 to a hundred percent of the time. When you have that confidence with testing in the past, right? Where you're testing deterministic functions, you feel reliable. You feel that you're going to release reliable work, right?

With this, there's always going to be a notion of, but what if, right? Because it's not deterministic. What if it goes air Canada on us? What happened to this? Yeah. And that's always going to be there. It's non-deterministic by nature. Like you have to accept that. And so partly it's about accepting higher risks with production deployments. But as we're going through this and deploying this,

I mean, really, all we're doing is setting different environment variables. You're releasing this non-deterministic agent just as you would in every single other production deploy. Your risk just is increased. And where are some other areas? Because it feels like you could take what you are doing right now at your company and replicate that with different

or use cases? What are some, I'm sure you've thought about it. Like, huh, I bet this would be a good remedy for these problems too. What are some other areas that you feel like could be useful here? That's such a great question.

A lot of it depends on the data available, right? So, and again, this comes down to quality. And to assess quality, you have to have a notion of what is good versus what is bad. So I've worked in a few different areas, but I focused in ed tech. So my main understanding is ed tech. But I'm a massive fan of fantasy football and other areas like that, right? So let's consider fantasy football as a,

American football or you say football like soccer? Yeah, so both. I grew up playing soccer in America, but football across the world. Massive Arsenal fan. So I have a huge love for football.

from outside of America, but also all of my friends really only know American football. So I have to stay true to what I know and what my friends know. So we're going to consider American football here, but you can consider, you know, injury analysis. That's something you could have an agent for where you're thinking about, okay, is this player going to play? How long is this player out for? You could think about, you know,

You can consider the team's strength of schedule for the rest of the season and identify who are the trade targets that you want to focus on. And I think that applies generally across many different sports. But there's a lot of ways you can implement agents to that degree. The only thing that matters is where's the data and who has the data. And that's really, really tough to find in the fantasy football world because everyone is already price gouging it.

Really, as far as data goes, like to get data on all of that. Yeah, for other purposes, right? Like it's really hard to get data and quality data across many different systems.

to enable you to have backtesting and future predictions of how it will perform. But let's say you go into the health, you know, the health space, you have a lot of HIPAA rules. So you have to be careful about that. But there's ways that you can aggregate the data within there to identify, you know, have an agent that's able to identify, you know,

something as simple as if we take the Silicon Valley hot dog, not hot dog. If you've ever seen that show, you could take this into like serious, not serious issue. Right. And help them identify where they are in various elements of the space. Yeah. I was thinking about for government contracts and agents that can help you identify what government contracts or, or at least help you fill out the proposal. Yeah.

as much as possible. Yeah. Now, one thing that I was thinking about back to that football example, the fantasy football example is what in your mind makes this an agent need versus a traditional ML need? It's a great question. So I don't think those are mutually exclusive. I've been looking at a lot of, you know,

Easy to define agent videos across YouTube and other things, because I've been, I've been basically giving this spiel to other departments across my company. So I'm trying the best way to be less technical. So the way I would define this is that ML or LLMs, any models, right? Whether it's LLM or traditional AI, ML models are the brain power. They can do very specific tasks.

Agents are what are able to perform the job for you with that brainpower. So it's about, it's less about LLMs and the integration with agents and more about giving, enabling a model to perform a task. And that was a wild piece that I noticed. ♪