We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Inside Uber’s AI Revolution - Everything about how they use AI/ML

2025/7/4

MLOps.community

This chapter explores Uber's AI platform, Michelangelo, and its role in powering various business-critical machine learning use cases. It discusses the platform's evolution and how it supports diverse developer needs and model types.

Uber uses a four-tier system to categorize its machine learning projects based on business impact.
Michelangelo powers 100% of Uber's business-critical ML use cases.
Michelangelo's evolution includes adapting to support deep learning and providing flexibility for developers to choose their tools.

Shownotes Transcript

We actually tier all the different machine learning projects at Uber into four different tiers. With the tier one being the most business critical project, we can measure some proxy metrics. How many models we trained last week? How many models deployed?

For those users, we provide a way for them to directly access the Infa layer. Michelangelo is the machine learning and AI platform for Uber. All of the AI and ML use cases are powered by Michelangelo? Yep.

which is a feat. That's a freaking crazy thing to say just because of how many use cases you have. So we are, yeah, we're actually open sourcing Macangio. Do you want to touch this? No. We are. What? Yeah. Holy shit.

Let's align something first. That is the definition of AI. Because nowadays everybody talks about AI. Even my mom asked me, "Hey, I heard you're working on this artificial intelligence thing. What exactly are you working on?" So I think when a lot of people out there talk about AI, they think of chat GPT. That's what AI means to most of the people out there.

From the AI platform perspective, AI is not just ChaiGPT. AI is not even just the large-enriched models for general AI, which is the technology behind the ChaiGPT. AI actually covers the whole spectrum of machine learning, from the very simple linear models to tree-based models like random forest and XGBoost, to the traditional deep learning models like convolutional neural network and recovering neural networks, and all the way to general AI.

So, let's keep that in mind and then let's talk about McAngel. So, Uber's AI journey started back in probably 2015 when a few teams like Maps, Pricing, and Risk, they started thinking about if it's feasible to replace the rule-based systems with the machine learning systems. So, when this team started,

They all build their own one-off workflows or inference to support their machine learning needs for training models or deploying models. These are the random or ad hoc Python code sitting in notebooks. These notebooks were very hard to manage.

They're very hard to share. So it's basically non-reusable and non-shareable. And non-shippable. Non-shippable, exactly. Very hard to productionize. And impossible to scale, which leads to the inconsistency in performance and also duplicated efforts across teams at Uber. So that's actually where McAngel comes into play. McAngel provides this centralized platform for our machine learning developers.

to manage the whole machine lifecycle end-to-end without worrying about the underlying infra and system complexities. So that's why we actually started building Michelangelo back in 2016. Yeah, you can get that scalability. And I think there's something super cool that you pointed out in the recent blog post. Well, recent, like it's now a few months old. Yeah. On from predictive to generative AI and how Michelangelo has had this journey. And you talked about how

folks want their own way of doing things. And so you have this centralized platform that's supposed to be helping them and streamlining things. But then,

you have those people that are pushing the boundaries and they're pushing the edges and they're asking for, "Hey, we want more deep learning use cases or we want to be able to support deep learning use cases." And then you had to go and figure out how to make Michelangelo adaptive for that. And so the flexibility on that platform, can you talk a little bit about how you thought about that? I think one of the lessons we learned along this journey is that you should let the developers choose which tool they want to use.

What we did is that we provided an abstraction layer on top of all the infra complexities with the pre-built templates and pipelines for more than 80% of the users to easily build machine learning applications.

For the rest, like 20% of users, these are the advanced power users, as you mentioned. They want to build highly customized workflows. They want to use different trainers to train different models. For those users, we provide a way for them to directly access the InferLayer. For example, we have a tool called UniFlow, which is a Python orchestration framework.

which allows our users to write their own code, to customize their workflows, and directly access the RAID clusters to run their own training and then deploy the models on Macangio. So that's how we think about this, 80% and 20%. Yeah, and speed that process up so much, I can imagine, from that time

that you have the idea to actually deploying the idea is unreal. Is that something that you look at as a key metric? Yeah, if we talk about North Star metric for McAngel, let's go back to the McAngel goal here, because the success metric is supposed to measure how well the product meets the product goal. So McAngel goal is to provide best-in-class machine learning capabilities and tools for Woovers machine developers.

for them to rapidly build and iterate high-quality machine learning applications. So, there are two keywords here. One is the "rapidly," which measures the developer velocity. That's what you just mentioned. The ideal metric is time to production. Basically, from the ideation of a project, how long does it take for that project to actually launch into production? So, that's a...

That's a very nice metric on paper, but it turns out very hard to measure in practice. That was what I was going to ask you. It's so funny that you say that. Yep. I can tell you why, right? First of all, use cases, they are all different. There are use cases you just need a linear model.

The other use case, you probably need to train a large-angle remote from scratch. I'm just giving an example. So, you can imagine the time it takes for these two different projects to launch is totally different. And the second variable here is the team capabilities. An engineering team with five machine engineers, everyone has 10 years of experience.

probably can do things much faster than a team consisting of one like applied scientist who just graduated. So we also see this kind of difference based on our experience. Unless they're going on vibes. These days, you know, the vibes can take you a long way. Jokes aside. Didn't mean to derail you. It's true that you do have these very different teams.

capabilities and maturity levels and how they attack the problems. And then you also have these different use cases that if you're training a large language model, it's very different than if you're just using a random forest model. And you can't take the average of those and then think, oh, our time to velocity is decreasing or over the last quarter it increased because...

We trained some large language models, you know? Yeah, it's really hard to systematically measure this metric. But what we do here is, we do two things. The first thing is we work with individual teams, just watch how they actually use Macngel to launch their project. So we get some anecdotal feedback from these teams. For example, our rider pricing team.

They shared feedback saying, "By using McAngel, it actually reduced their engineering cycles by 80% compared to them building everything by themselves." Oh, interesting. Yeah, so we have such anecdotal feedback from a lot of users. It's subjective in a way. Subjective, and each team has their own way to measure speeds or to measure the engineering cycles. The second thing we do here is that this North Star metric is really hard to measure, but we can measure some proxy metrics.

Things like how many models we trained last week, how many models we deployed, how many evaluation pipelines we run, how many eval reports are generated. All these proxy metrics can be an indirect indicator about your developer speed, developer velocity here. We track all this systematically on the MacAdre level.

It's funny. Now is probably a good moment to bring up too when you talk about evaluation pipelines. You're probably talking about the predictive ML evaluation pipelines less than the new generative. Now you're talking about both. Actually both. Okay. But they're totally different. Okay. I got your point. I see where you're coming from. Yeah, yeah. They're totally different. So in a way, you're looking at

all these proxy metrics, you probably have a page or a spreadsheet of them and you can see if they're going up, we're doing more is good, right? And less is bad. That means that there's something that's slowing people down and you can correlate that to that North Star metric of velocity. Yeah. So we track this at two levels. One is at the like the McAngelo level. We look at

all the training pipelines run last week or all the models deployed last week. That gives us a very good idea about the velocity or how people are using our platform. We also track this at project levels. For each project, we have a dedicated dashboard to show all these metrics just for that project. So we know if there are certain things going wrong with certain projects we need to fix. So we do it at both levels. It's worth noting that

All of these different projects, I'm sure you are powering a million different models that I can't even fathom because I was telling you earlier before we hit record about a few blog posts that I read that were more on the predictive ML side. And it was a breath of fresh air to read these posts about when one was around recommender systems out of application and the other one was, I think it was like a multi-armed bandit scenario. Yes. And...

they were so unique. These use cases were like, wow, this is so cool to see how mature and how advanced these use cases are. And it made me think when you enable the teams, when you have something that's not blocking them and they can quickly iterate, like you were saying, evaluate if it's worthy to continue forward and then get it into production.

You can throw ML anywhere. It's like the more creative ideas, the better. Yes, you can. Yes, for sure. You can throw ML everywhere. But do you want to do it?

Because ML is expensive. Especially if you want to use deep learning, it's very expensive. Does your use case actually require machine learning or even deep learning? That's the first question you need to answer. Because the machine learning doesn't come for free. So you have price to pay. It's so expensive. So do you actually want to do it? You want to evaluate on that front first.

Then start building your projects, training your models, prepare your data, train your models, deploy your models, do things like that. And are you looking at the cost? Because I remember in the From Predictive to Generative blog, you talked about how there's different tiers of model support. Yes. And so if it's obviously the Riot ETA, that's probably one of the most important models that can never go offline ever. And then if it's something maybe a little bit more experimental, you...

are more relaxed about it. Yes. When Macangelo started back in 2016, at that time, our mission was to enable machine learning for Uber. Basically get Uber started with machine learning.

And at that time, when Mac Angel started, we only had like three use cases on Mac Angel, but now we have like a thousand. Each one has their own dashboard. Each one has their own dashboard. And you are looking at, hey, if the metrics aren't going right, we got to go look and see why. Yeah. For example, if your model performance degrades, it automatically send out alerts to that team. And then they look into why. And for you, if they're not shipping enough,

then you're going to go look into why is there something blocking them in the platform. Exactly. Is there some system bug that's blocking the development? Does the pipeline keep failing? And those things we need to look at. But anyway, sorry, I distracted you, but thousands of dashboards, it just blows my mind. So whichever team wants to use machine learning, come to McAngel. We'll help you get started. And fast forward a few years,

Actually, most of the teams already had incorporated machine learning into their core user flows. This is the end of 2019, early 2020. At that time, we actually pivoted our strategy a little bit. Instead of focusing on enabling machine learning, because that's already done, we pivoted to improving the key project performance.

That's when we introduced machine learning tiering systems. Based on the business impact, we actually tiered all the different machine learning projects at Uber into four different tiers, with tier one being the most business-critical project.

There are some examples, like you mentioned, pricing, matching, like rider-driver matching, and also ETA, fraud detection, those things. If this model doesn't work, it will cause a high-level outage for all the services. So these are tier-one projects. All the way to tier-four, these are some personal experimental projects. Like you mentioned, users just want to try different things. So those are tier-four projects.

So, today we have about 40 tier 1 projects, about 100 tier 2, and 500-600 tier 4 projects. Wow. So, that's how the tiering system works. And when it comes to prioritization, which project should we support first, then it's very clear. We should focus on tier 1 first, then tier 2, then move on to tier 3, tier 4.

We just don't have bandwidth to support tier 3 and tier 4, to be honest. The majority of the use cases still...

can bet that almost all of those tier one use cases are still predictive ML versus generative ML, AI. Again, to be in the tier one project, you have to be in the core, a Wurbel's core user flow. Meaning when user book a ride or order food or buy some grocery, is your model actually in that critical path? If it's not, that means you're probably not tier one projects. So based on that definition,

100% of our tier one today, they're still predictive MML. Yeah. And I can imagine there's been a lot of people that have tried to think about ways to put generative AI into that core product. But when you think about the flow of me signing onto an app and then getting a car or getting food, where can you even put it in? Right?

Very good point. Very good question here. There are some scenarios and use cases, I think, where we can actually apply JNI to improve the experience. For example, we're actually working on two projects right now on the Uber Eats side. The first project is we use the large-uncrypt model to improve the personalization experience to generate

better like these descriptions to actually match your, you know, users

interest to what we actually show on the Ease Home Feed screen. That's one thing JNI can really help. The second one is we actually use large-enriched model to improve the search quality. When users search something in the Ease App, we actually use large-enriched model to improve that search quality. Because it can pick up if I'm saying I want something fancy.

Then it understands fancy a little bit more than if it were traditional. Semantic understanding. Yeah, that's one thing. And also, we can use the large-angle model to build taxonomy of all the dishes and restaurants within Uber Eats. That's done with the large-angle model. In a way, it is a little bit of this recommending with LLMs, especially on that first use case you were talking about where you understand me, I'm a vegetarian, I go on to Uber Eats, you don't show me any

and then when you talk about different restaurants, you're not gonna highlight their meat dishes, you're gonna highlight their vegetarian dishes. - Exactly, so in the past, I don't know if you noticed this, in the past if you log on to the Uber Eats app,

we have these carousels, right? Each carousel will show different restaurants or dishes. And in the past, we only had Korean food, Chinese food, burgers, hot dogs. Very boring. You look at it, you don't want to order. I don't want to order. It's just that it's Korean, Chinese. It doesn't really intrigue me, right? So what we do is, first of all, we use the large-scale energy model to try to understand the menus and dishes from all our restaurants.

then come up with that new set of carousel titles. Now we have, like,

spicy Sichuan food, meat lovers, and all those catching carousel titles. And then we use the Elijah Energy Model to tag each of the restaurants with these new carousel titles. This is how we act. Then we use this information to match our eaters with the restaurants. Oh, fascinating. So that's how the Elijah Energy Model is used for recommendations. Yeah, and I can imagine also you could do it with

like user reviews, maybe not in Uber's case, you're not giving reviews after you eat, right? Normally? - We actually do, but not many people are doing that. - Yeah, yeah. - Yeah. - And that's good. I didn't even know that you had that. But in a way it's like that sentiment, you can see if somebody, or if there's many people that are ordering the same dish, then you can know that, all right, we really want to figure out how to message this dish correctly.

it feels like it's a bit of predictive and generative ML. Because if I'm ordering the same dish over and over, you don't need an LLM to be able to siphon that out, I guess. That's very clear tabular data in a way. Yeah. But that's a very good point. Coming back to the review thing, right? Actually, we just launched a project, a Gen-I project into an online experiment. What we do is we...

We use the large-enriched model to look at all the reviews for certain restaurants and also the scores and other sources of data, then try to summarize how the eaters actually like or dislike about this restaurant. Then we give that feedback to the restaurant owners. They can make improvements on their dishes or on their services based on that feedback. That's one of the projects. We call this Customer Feedback Summary for Merchants.

Are there any other interesting use cases that you like talking about or that you've seen and you kind of got your mind blown by? I can talk broadly how machine learning has been used at Uber, from the predictive machine learning all the way to Gen-I here. So, at any certain moment, we have more than 5,000 models in production, making 28 million real-time predictions per second at peak.

So, that's a scale we are operating on. And everything happens on my schedule. Sorry, I have to brag. Yeah, so virtually every single interaction our users have with the Woven app involves machine learning under the hood. Let's take the Rider app, for example, right? This is all predictive machine learning so far. So, when you try to log into your account,

We actually use machine learning to detect if this is actually you trying to log into your account. We call this account takeover. Oh, wow. Yeah, this is part of our fraud detection mechanism here. Once you log in, machine learning is used for when you search for a destination. It's used for search. It's used for ranking your search results.

Once you identify your destinations, then machine learning is heavily used to match you to the driver for the pricing, for ETA calculation, even to recommend the right product to you. I don't know if you noticed this, but if you open the Uber app,

It actually shows you a lot of different UberX, UberBlack, UberWho, everything, right? Comfort, all that. Yeah. Actually, that list is personalized. Everyone sees different lists. Uber thinks I'm way richer than I am because they're always recommending me comfort. I'm like, give me that cheap option. Why are you giving me the more expensive one? Sometimes machine knows you better than I know. I don't like that. I don't like that at all. I do choose comfort, though. They got me. I'm a sucker.

I let our product recommendation team know. Coming back to this, all the way to where you're on trip, we give you the real ETA, that's also machine learning, and all the way to fraud detection and payment, and also, of course, the customer support. Now we use CNI for customer support.

Then if we take the Uber Eats app, it's the same story. Machinery is everywhere. We focus more on the personalization, recommendation, and also Eats ETD. So these are ETDs, estimated time of delivery. Oh, yeah. Okay. So, yeah, that's on the predictive machinery side. Now let's talk about GNI, right? If we look at the GNI use cases at Uber, we can...

roughly divide them into three different categories. The first category is we call this magical user experience.

These are the Genov AI applications that directly impact our end users. For example, any of the chatbots, our customer support chatbots. We also have this earner co-pilot. We build a co-pilot chatbot for our drivers to answer their questions, to guide them where you drive so that you can maximize your earnings. Things like that. It's still in the works, but that's something in progress.

And something else like the personalization experience I just mentioned, that's one of the use cases belonging to the magical user experience. Then the second category here, we call it process automation. We basically use generative AI to automate some of our internal processes. We'll further automate some of our internal processes. For example, we use AI to generate the

the menu item descriptions for restaurants. Because a lot of times if you log into Uber Eats app, you will see a lot of the menus, they only have the title. The dishes only have the title. There's no description at all. So we're using GNI to generate descriptions for 100% of all the dishes on Uber Eats. And what else do we do? We also use GNI to try to identify fraudulent accounts.

One of the key characteristics of this fraudulent account is that their usernames are usually gibberish usernames. Interesting. Yeah, like, I love dog. That's probably not a real person. I thought it was just like F-X-Y-W-Y. Exactly. So we use the Archive Navigation model to scan through all the accounts, try to identify these potential fraudulent accounts. Something else in this category is the

driver background check. When the new drivers, you know, when they try to onboard to the platform, we do very thorough, strict background check for each of the drivers. And we use large-unit model to accelerate that process. Oh, wow. So that's the second category we call this process automation. But process automation doesn't have anything to do with back-office processes of, hey, our

In a way, bureaucracy of when you need to do something and if I want to submit time off, maybe I need to- That's the third category. Oh, okay. What do you call that? We call this internal employee productivity. Okay, yeah. That makes sense. This is where we build all those tools. As you mentioned, the workflow automations. For example, we also have this, we build this data GPT, which is because Uber has

We have a vast amount of data, and we have so many data scientists, product analysts that need to analyze the data every day. Today, they do that by writing SQLs. We built this data GPT tool to help them, first of all, write better SQLs, and second, just query data with natural language.

So that's one of the internal employee productivity projects. This text-to-SQL is becoming a very standard Gen AI use case. You see it in so many companies. And that lift that you get by allowing the data analysts to

write more queries more quickly or just get answers quicker is very valuable and it's something that you can show value really easily. The other stuff maybe it's debatable. If you do it right. Let's keep that in mind. Yeah. You can do things really fast but it has to be

You have to give your correct answer. That's the key. Yeah, because you can spend 20 minutes trying to get a simple answer because you didn't do it right. Yeah, exactly. Yeah, so you're talking about doing it right from the end-user perspective, not doing it right from the infrastructure side of things and how you're architecting the agent to go in and write this equal. Actually both, because you're from the infer side. If you do it wrong from the infer side... You're screwed. Yeah. Let's forget about end-user experience. Exactly. Yeah.

There is another thing I wanted to talk about, which is the scale and all of this inference. And what are some gnarly challenges that you've gotten into because of the amount of models that you're using and you're now using more?

not smaller models, but you're using all these predictive models. And so those have a certain style and flavor of needing to use inference. And then you're using the generative models and those are a different style and flavor. And so how do you attack that within Michelangelo to be able to serve both of those? Yeah, very good question here. So I think there are a few things we...

we did to enable GenF AI to extend myCangio from supporting the predictive machine learning to GenF AI. First is the compute resource, the compute amount. It's totally different. Yeah. Right now, you need

a lot more GPUs and high-end GPUs, H100, all those GPUs. Tier 4 doesn't get that kind of stuff, huh? Usually no. Oh, by the way, that's a very good point. So we prioritize higher tiers over lower tiers for sure. We also prioritize production jobs over offline training jobs, batch jobs. That makes sense, right?

Yeah, yeah, yeah. Okay. That's how we prioritize. But coming back to the infrastructure for JNI, that's one thing. We need to procure more compute resources. That's the prerequisite for you to do anything for JNI. Yeah.

The tech stack has to change, has to evolve. But for GNI models, because now the model is much larger, it doesn't fit into one GPU. Usually, it doesn't fit into one GPU, so you need model parallelism. For the traditional deep learning, usually, you can get by using data parallelism. It's totally enough. To enable GNI, because of this restriction, we actually integrated deep speed.

and use DeepCBit in junction with Ray. Both together? Oh yeah, you have to. One is the orchestration layer, the other is the model optimization layer. So, use both for large-angle model fine-tuning. We also enabled Triton from NVIDIA for serving large-angle models. And now,

We actually use the same infrastructure to serve one of our large traditional deep learning models, like our recommendation models. So it's not just for JNI. We also use that for some of our traditional machine learning, which is really cool. Wow. So you had to add a few new tools into the tech stack.

But then when you did that, you found those tools help for the other stuff. It's actually useful for something else where we've already doing for years. For those bigger deep learning models, but not for the smaller models. No, not for tree models. You don't need that. Yeah, it's not needed. And then they can choose what they want as far as that? Or you're the one that's optimizing the...

On the inference side, it's us. When we talk about optimization of serving a model, it's always two things. One is the optimization on the inference side. That's what we do. The other thing is on the modeling side.

That's what our users do. For example, quantization, that's something they should do. But we should support quantized model serving. That's on the Infer side. It's always a collaboration. We also collaborate with other teams. I imagine they're coming to you all the time asking you for support for certain things like, "Oh, well, now we want to distill this model. Can we figure out how to make it easier for us to distill it or compress it or quantize it?" All of that.

How do you go about choosing, because there's no tiers there, is there? Or are you still bringing back in this idea of the impact? They're still impact-based. They're still impact-based. There are larger, there are Gen-I use cases with larger impact compared to the other, like more like experimental Gen-I projects. So even, they're still tier three today because they're not in the core user flow, but they still can prove their impact, business impact.

So once the, as long as they can prove their business impact, we can prioritize. We talk about generative AI use cases. How are you looking at

the agentic use cases. And when the agents are going and doing things, at the end of the day, it still is just, hey, you've got models, you're focused on that inference of those models. Are you adding extra infra support for the agents, quote unquote agents? Because as we said before, it's probably good to define what you think when you think of agents. But

But in my eyes, it's like the LLMs or the generative AI that can go and actually do stuff as opposed to just giving you these answers. Yes. So starting this year, our focus has shifted to support exactly what you just mentioned. We want to actually enable Uber to build agentic AI systems going forward. So we're extending McAngelo. So in the past, we call that more like model studio, where you manage your models and all the

relevant components related to model training and serving. Now we want to also build an agent studio for agent ops, for you to actually build, evaluate, deploy your agent and manage your agents. So that's something actually in the works. I just got out of a meeting this morning with our engineering team who is actually building this tool right now. That's very cool to know. So how are you looking at it and

Is it any different than the other stuff? Or if so, maybe the better question is, how is it different? In the past, all that we care about is the model. We make sure you can train the model, then deploy the model to endpoints. Now you can call the endpoint to make predictions. That's what we care about as the platform team.

but the agent is different. The agent is the application by itself. We're now at the application level. We have the model studio here, the agent studio on top of the model studio. In the agent studio, we leverage the models we build in the model studio. It's at the whole application level.

How do you actually accelerate that agent creation, evaluation, and deployment flow? That's something we are actually evaluating and deciding. How do we allow our users to quickly spin up some new agents?

evaluate agent performance, then go back to iterate on agent and evaluate again and deploy. Same thing as what we do for models, but now it's more at the application level. That's a major, I think that's a major. How do you make sure that the agent has the right

permissions, so you're not writing to some database that it shouldn't be and all that stuff. Oh, very good point. Yes, very good point. Because one of the major value prop of the Agent Studio on McAngel is that no matter what model you use, we allow you to access Uber internal data and also open internal tools. And then the security question you just mentioned coming to the picture.

There are certain data you have access to, there are some that you don't have. To enforce that, we actually work with our anti-stack team. We have an engineering security team to work with them to build this security protocol for this agent.

So if I create an agent, then it knows my permissions. Yes. And it is only allowed to do those things. Yeah, it's either yourself or your team. Yeah. So it depends on your authentication. But then the utopian scenario here is that

you have this agent studio and another team can come and grab my agent that I built and now start implementing it for their use case or they tune it and do a little bit more and now cool, I'm up and running with an agent. I didn't have to build it from scratch. Exactly. But the big question in my mind is

does it automatically know that now it's another team, it's another person, and now they have these permissions? It should. It should. So we're still building this. It's not in place yet, but we're still building that. And the thing you just mentioned is actually we call it agent registry. So we have a repo of all the agents built within Uber.

And every single team can look into, you know, look at that agent repo to see if there's anything already existing so that I can just reuse. And you're going to have that because you have the model repo. That's classical one. Most folks have that. But then...

I've read the prompt toolkit blog that you have. Yeah, the prompt repo, yes. And you have the prompt repo, which is another great one, where folks come and they see, oh, here's a prompt for this model, and it is for this specific thing. Somebody already spent the time to tune this prompt. I don't have to start from zero. Yes, yes. And now we have agent repo. We're also building the...

MCP repo. MCP is a cool thing now. Yeah. Yeah. Yeah. Yeah. We want to build MCPs for a lot of the Uber internal services so that it can be leveraged by the users to build agents. You're thinking, all right, well, all of the internal Uber tools should have their own MCP server. Not all of them, but some of them. Yeah. We look at which tools are used a lot.

today by our users, then we won't build MCP for those tools. There's agent builders and then there's tool builders in a way. And maybe it's the same person. It can be the same person. It just is that depending on the state or their

their moment in time, they're building tools or they're building agents. But at the end of the day, I think all of us want to be building agents. Yes. And not as much tools. And Uber, usually the tool owners, they build the MCP server for that tool. And that will be used by all the other agent builders. And then they'll say, hey, this is standard practice. This is what we want you to use the tool for. And I publish my MCP server to MCP repo now.

read the documentation to see if it's useful for you, if you follow this and to use it. Yeah. That's how we envision that. It's still in the works. Yeah. I can imagine you'll come up with some cool stuff again because you can throw AI anywhere, right? Or ML on...

helping folks select tools. Oh, yeah. And helping that search, getting that search really clean on, oh, I want this. And so it's almost like I describe what I want and I see a world where that internal tool can look like, okay,

Maybe you want to use these agents or maybe you want to use these servers, MCP servers, and they have these tools that you can leverage. So we talked about supporting different inference, right? But what about supporting different evaluation? I think evaluating a general AI application or agent is a totally different story compared to all the evaluation for the predictive machine learning.

where you know exactly what metrics you're looking for. You measure the AUC, precision, recall,

and you have standard pipelines or standard frameworks. Maybe you have ground truth. Yeah, exactly. To measure the accuracy there. But for Gen-I, it's totally different. I think this is still a problem the whole industry is trying to figure out, how to do the evaluation right. But at Uber, we built our own, we call it Gen-I evaluation frameworks, which allow users to do two things. One is to

LOM as a judge. Basically use another LOM to judge the output from these LOMs. I think that's a lot of teams out there in the industry are using. The second is to include the human in the loop. Basically, whenever you need human to make a judgment, you just involve the human to do evaluations. Those are the major two methods we're doing, we're using today. And also the third one is

Yeah, of course, you can provide a golden dataset even for your Gen-I use cases, right? Provide a golden dataset, then use a golden dataset to evaluate performance. Those are the few things we've been working on. Sweet, dude. You wanted to touch on that? So we are, yeah, we're actually open sourcing by CanGel. Do you want to touch, Faisal? No. We are. What? Yeah. Holy shit. Yeah. Really? Yeah. Our plan is in...

Two years open sourcing full-backed Agile, starting with our orchestration framework called UniFlow. Wow, that's huge news. That is so wild. So you're open sourcing first the orchestration framework and then everything else?

And it's going to come out in spurts or it's going to come out, you basically have to clean the code base? Oh yeah, a lot of cleaning work in the works. We have to clean the code base. And our plan is to, at least for this year, 2035, we want to selectively work with some enterprise partners in a closed source fashion. So we give access to

give access to them to our open source repo and they can contribute. Then next year we probably want to fully open the repo to the whole. That is so cool. That's why I was asking about the prompt toolkit and if it was ever going to be open source then... It would be open source. Yeah. Probably...

Probably in H2, sometime Q4 maybe. This year? This year. Oh, so there's going to be some things that will come out super fast. We already have something. But again, this year is all closed source. Only those few partner teams have access. Closed open. Yeah, closed open. But next year we try to fully open. Wow. That's so cool. Why now?

When Macangio started back in 2016, there were not many options out there. We had no choice. We had to build everything by ourselves from scratch. Of course, we used open-source technology like Spark and other things. But then fast forward 10 years now, 9 or 10 years,

There are so many startups out there. All the cloud players provide their own ML Ops tools, and there is such a huge ML Ops community out there like yours. We do think allowing external contributions to Macangelo can actually drastically accelerate the innovation of Macangelo. That's why we want to open source this, which in turn will benefit our Uber ML community.

And also, the other thing is now we are in this era of Gen-I, right? And Gen-I, if machine learning has been advancing so fast, Gen-I is even faster. So, our team, we have like 100 people, but still it's a smaller team compared to the whole community. It's hard for us, to be honest, it's hard for us to keep pace, keep our pace with all the advancement in the industry. So, again, that's why we want to

allow the external contributions so that we can keep ourselves always at the front of the technology advancement. All right, so that's awesome. Now, I always wondered what it would be like to work at Uber. I am not the big company type. I think I would get fired very fast. HR would have a problem with the things that I say.

But I want to know, what do you like working at Uber and what do you not like about working there? Sure. I think what I like, first thing first, the engineering team. The McEngineering team is truly world class. Move really fast.

They can get shit done. Okay, let's put it that way. Incredibly collaborative, very supportive of your PM work. As a product manager, I could not ask for a better engineering partner. That's the thing I like most about working in my current position.

Personally, I do believe product management is the best job in the world, if done right. I'm deeply passionate about AI and ML Ops, so my current job is a perfect combination of both. I really enjoy every single minute for the past four years at Uber. What else?

Yeah, since, as I mentioned, since all the Uber's machine use cases, AI use cases are managed by McAngel, it gives me this front row seat, right, to see how machine learning is actually driving business impact across Uber. So from pricing to recommendation yeast to fraud detection all the way to Gen I chatbots, it has been a fantastic experience, I want to say. Yeah.

Inside Uber’s AI Revolution - Everything about how they use AI/ML 45:23 Share

MLOps.community

Shownotes Transcript

Inside Uber’s AI Revolution - Everything about how they use AI/ML