We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

MLOps with Databricks // Maria Vechtomova // #314

2025/5/13

MLOps.community

AI Deep Dive Transcript

People

Demetrios

Maria

Topics

Maria: 我在MLOps领域拥有超过八年的经验，亲身经历了各种工具和平台的演变。早期，我们不得不自建工具来完成模型注册、实验跟踪等任务。现在，虽然市面上有很多优秀的工具，但将它们整合在一起非常复杂，需要投入大量资源和人力。因此，对于大型云端组织来说，选择一个集成的平台可能更为高效。Databricks正是一个备受推崇的平台，它提供了MLOps所需的各种组件，虽然并非完美，但足以满足大多数需求。我尤其欣赏Databricks对用户反馈的开放态度，以及他们不断改进的决心。然而，Databricks的开发流程仍然存在一些痛点，例如过度依赖notebooks，以及Feature Store的一些设计缺陷。我极力推荐使用Asset Bundle进行开发，它可以更好地管理代码和依赖项。总的来说，Databricks是一个功能强大且不断进步的平台，但用户需要了解其优缺点，并根据自身情况做出明智的选择。 Demetrios: 我与Maria的讨论深入探讨了Databricks在MLOps中的应用，涵盖了其优点、缺点和不足。我们一致认为，对于许多组织来说，选择一个集成的平台而非拼凑各种工具，可以大大简化MLOps流程。Databricks的普及程度和不断增长的能力使其成为一个有吸引力的选择，尤其是在数据工程已经在使用Databricks的情况下。然而，我们也指出了Databricks的一些局限性，例如开发流程的痛点和Feature Store的设计缺陷。Maria分享了她在Databricks上的实践经验和最佳实践，为其他用户提供了宝贵的参考。总的来说，Databricks是一个强大的平台，但用户需要了解其优缺点，并根据自身情况做出明智的选择。

Deep Dive

Shownotes Transcript

Translations:

中文

So I'm Maria, I'm an MLOps tech lead at Aho.haze. I'm also a co-founder of Marvelous MLOps, a company where we teach other people about machine learning, MLOps, and MLOps with Aetheryx specifically as well. I'm very critical when it comes to my coffee. You probably know about that already. I like to drink latte with oat milk and you need to have a perfect amount of coffee

perfectly grinded, perfectly micro, how do you call it, found. So yeah, I can make a better coffee than most coffee shops.

We are back for another MLOps Community Podcast. I am your host, Demetrios, and today I have the pleasure of speaking with Maria from Marvelous MLOps. If you are in the MLOps universe at all, you probably have heard of her. This episode does a gigantic deep dive. I cannot stress enough how deep we go into Databricks, the pros, the cons, the good, the bad, the ugly, everything about it, and we're

Stick around at the end. She talks a bit about her course that she's going to be doing and also the book that she is writing. So if you've ever had the urge or if you are currently working with Databricks and you're doing some kind of MLAI project,

You should probably get in touch with her because she is by far the expert in this world. Let's dive into it. Everything you need to know about Databricks and MLOps on Databricks. The last time that I spoke with you, you were telling me how, yeah, I think I'm going to start posting more on LinkedIn. And oh boy, did that work out? Yeah, indeed. Yeah.

It was two years ago, actually. It was exactly two years ago we recorded the first podcast together. And a lot has changed since then, a lot. Well, what have you been up to? I think two years ago I had this urge to start posting about what I do on LinkedIn and also writing articles. Because I have seen so many experts on LinkedIn and everywhere, you know, talking about MLOs and what it is.

And I thought that a lot of things that are there are just not true. And I've been doing this for many years. So probably my voice also should be heard. And well, I kept going since then. It's really, really cool. I met a lot of awesome people. And yeah, I love it. And you've focused, I think, the majority of your, not necessarily your content on LinkedIn, but really everything.

the content that you do beyond LinkedIn, deep diving onto Databricks. And I think I would love to talk about

about Databricks, like why you chose that. This is not sponsored by Databricks by any means, although they are a proud sponsor of the MLOps community and we love working with them. But this specifically, I think I want to talk about the good, the bad and the ugly from somebody who's a user, from somebody who's done the majority of their certificates. And also now you're giving courses on how to better use Databricks.

I would love to just start with like why you even chose Databricks particularly.

Yeah, that's really a good question. So I've been doing MLOps for a very long time, like more than eight years before it became a thing. We built our own tooling around model registry, experimentation, tracking, using some, you know, Teradata databases or any other SQL database as a backend and GFrog artifactory for model artifact storage. And

We were doing it like when no one was doing it and it was really, really cool. And since then I tried doing the ML Ops with all kinds of different tools on the cloud, on-prem and well, as I said, build tools myself. And I feel like I've seen it all to a certain extent, like how do you connect different pieces of your ML setup to make sure that it's robust and you can roll back things whenever needed.

And there are a lot of platforms and there are a lot of tools. So when you look at the tools, you can find almost the perfect tool for, you know, for each of the components of your ML setup. But it is complicated, you know, combining them all together and you need to invest a lot into that and have a big team to be able to do that.

Platforms on the other side, then that's customizable. So they have all these MLOps components which you need, like model registry compute, serving component, monitoring component, things like that for data versioning.

They have them all, they're not perfect, but you still may want to get it working for your US case instead of, you know, trying to combine all existing tools. If you're on-prem, you probably don't really have a choice, you have to go open source.

And then you have this luxury of choosing. But when you are on the cloud, if you are in a large organization, bringing in any tool is really, really complicated. You know, all these sourcing procedures, getting trusted vendors, all of that. It's a lot of pain for everyone included in the process. So you want to go through this process only if you really, really want to and have to.

But platforms really got good enough, I feel. It hasn't been the case like three years ago, but now platforms are good enough for pretty much everything you need to do for MLOps. However, if you look on internet, what people talk, you know, how people tell you should do things, in my opinion, they're just not correct. They're not promoting best practices. Like if you look at data break specific training, it's all around notebooks.

If someone follows me on LinkedIn, you know, I don't really like notebooks. And for a reason. You know, I started as a data scientist myself. I used notebooks a lot myself. That was the only way how Python to data scientists was taught, you know, in 2016. Any course was just plain notebooks.

And when I started learning Python, that was the way. And then I started trying to put my first models to production, like build an endpoint. That was the use case we had. And I saw how much struggle it was to get from a notebook to something in a production ready. It was a lot of pain. And that was the first revelation for me, like why notebooks is not really a great place to start.

And unlearning using notebooks is really, really hard, you know, because people were using it for years. They know how to do it easily. You know how it is. If you use a certain tool, getting to another tool, another way of doing things is extremely hard. And the longer you are doing it, the harder it is. It's just like with anything in life.

And in my opinion, notebooks harm the Mellops lifecycle much more than anything else out there. So that's just how I feel about it. So we need to teach data scientists to do, you know, data science properly, not using notebooks. Or, you know, at least teach them how to translate a notebook into something that is production ready, more or less.

And that's what I try to talk about a lot. And also that's what we teach in the course. So why Databricks specifically, maybe another question. Well, if you look at all the platforms that are out there, and there was a

I think a research at the Ethical Institute of AI did and they just, it was a questionnaire and they asked professionals in the field what tools are using for model registry, for data lake, for serving, for training and DataRigs came out of number one tool. And it's not a surprise for me to be honest, DataRigs is growing like crazy.

So Databricks pretty much becomes the tool of choice for ML these days in many companies just because it's easy. Data engineering is done on Databricks. All the data is in Unity Catalog and Databricks is just a logical, you know, step forward for ML as well to train your models and maybe even serve and do monitoring on that as well. So there are many options out there.

And also, on another hand, it's not just the leading platform. I feel it's very different than other vendors that I've seen. So they're very open to people criticizing them. And when you tell them you don't like something, it's not like they're surprised. They actually know that that's the case, but it's somewhere not in their first priority because they have other things that are more important for them to fix.

And, yeah, if I have thought about a cool feature on Databricks that might be useful, they have thought about it too. And they are probably working on it already. So that's something that I haven't seen in any other vendor out there. And it's pretty impressive. So there's so much to unpack here. Let me try to...

go about it because the first thing that jumped to my mind was you were mentioning how platforms these days feel good enough. They weren't three years ago, but now whether you're using, I imagine when you say platforms, you're talking about the Vertex AIs and the SageMakers out there and the Databricks platforms. And so you can get enough from that platform to where it is not necessary to go out and

cobbled together many different tools. Yes, indeed. The other piece that I think is crucial that you mentioned is we've all been through trying to onboard a new vendor and we've probably...

thrown our fists up in the air and said, I'm never doing this again. I do not want to answer any more questions from the DevOps folks or the DevSecOps team about this vendor or the compliance and the ISO or the SOC2, whatever it may be. It takes so long. I literally just went through this where I chose one tool

Only because they were already accepted into the ecosystem and we didn't need to then do the onboarding process. Even though I knew it was a worse tool, I wanted to get a different tool that I enjoy much more. But it's like, oh, if we want to do that, then we're looking at like a two month period.

lead time to onboard that new vendor versus let's just use what we have and amplify the capabilities of this tool that's already been accepted. And so it feels like

That is where a lot of us stand. And that also is one of the reasons that you chose Databricks because you saw that survey and you said, it looks like a lot of people are on it. I can only see this growing because their capabilities are growing. And the stuff that folks are doing on Databricks is it's inevitable that it's going to grow if you're doing your data engineering there.

the data has gravity and you're going to probably start branching out if you haven't already on different use cases. Yeah. So, well, the reason why we did everything in Databricks because we already had Databricks. And Databricks was not per se perfect three years ago, a bit more. But you could do already back then even enough things on it to, you know, for it to be

pretty okay. And it was still a better choice than Azure components that we also had access to. So basically the company I work for is on Azure and we also use Databricks. And if I have to choose between Azure and Databricks, I would always choose for Databricks just because it's way easier in my opinion to do ML ops on it. That's the reason why. But I also want to know the

The bad part, you mentioned that Databricks is pretty open to criticism. There are things, though, that I know when I talk to you,

you said, yeah, this is how it's done in Databricks, but it's probably not the best way to do it. Whether that's just trying to work in notebooks and pretending like those are ready for production. Or I know we had also talked about the Databricks feature store and how that isn't necessarily the best way of doing things. So maybe you have some...

best practices that you found as you've been yeah going through Databricks and learning about it or getting deeper on it yeah definitely so I think one of the biggest pain points still until today is development process on Databricks so as I mentioned everything is around notebooks and I think ML code must be packaged so that's just how it is um

If you want to have a professionally written ML code, it must be packaged. It is possible to do on Databricks, it's just not very straightforward how. So, for example, Databricks comes with the notion of runtimes. It's pretty much a containerized version of your environment. It has already pre-installed packages and other programs, other software. And you basically want to have a reproducible environment locally to develop.

because you don't want to develop a notebook, let's say, then you can't really reproduce exactly the same environment locally. It's just impossible. You can get an approximation of that. But because it's an approximation, you can only develop a certain way, a certain state, and to test, you want to test it on data rigs again. But you don't want to push between notebooks and local development

So there are other ways like using asset bundle, for example. That's something that I absolutely love using and that is underestimated, I think, way of developing on Databricks. And we actually have a lightning session about it next week. Probably it's not going to come out this podcast on time, but there will be a recording. So we can maybe insert the link somewhere. And asset bundle is just a feature?

of Databricks? Well, Asset Bundle indeed is a feature that is developed by Databricks. It is, well, it has a lot of components, I would say, by itself. But first of all, it's a way of defining a workflow. So if you have an orchestration pipeline in Databricks, you can define it in JSON file or you can define it in Databricks YAML file, which is the definition of the Asset Bundle.

And whenever you deploy that bundle, your workflow definition gets deployed together with all the assets and packages and other files that are required for your deployment.

It used to be really, really hard to deploy things in Databricks because you had to take care of all of that, like that your packages get uploaded to Databricks, that your Python files get uploaded to Databricks and all other files that you need. And you had to build your own logic around it to make sure it was all there. We actually built the whole thing for it. We basically built something very similar to Asset Bundles internally. And now we are deprecating this because Asset Bundles does it all.

But it's not just for the workflows. It's also for development. And that's something that people don't talk about much, I feel. People use it for deployment, but people don't really use it for development. But I believe for development, it's also a really, really nice feature. So that's a great one on this development process. What exactly was it that we were talking about the other day with the feature store and how that works?

Well, there are many feature stores available out there, right? It's not just Databricks, but you have Feast, Hopesworks, and other tools, Tekton, available.

So other tools focus more on just feature stores, but Databricks has all kinds of components in it. So if you look at Databricks feature store, there are pretty much two functions, like two things that you, how can you interact with the feature store? It's a feature function and a feature lookup.

Feature lookup is basically defining how you look up a key in the feature table. So if you want to look up a customer ID and return some values in the table for that customer ID, that's what you could use for that.

it doesn't really have a fallback, so if it doesn't find it, it will just return none, which is by itself not very convenient. So to replace that, you could use a feature function, for example. If that is none, then look up another thing or return that value instead. So feature function by itself, I think, is quite an ugly construct. You have to define your Python function in SQL, and that's the only way.

Like, I don't know who came up with it in the first place, but I'm just not a fan of it for many reasons. First of all, there is no versioning of that thing, right? There is, of course, version control that you have, but when you create this function, there is no way to point it to a version of code or like a version of that function that is used. Also, that function will behave differently depending on the runtime you use.

Of course, because, you know, the Python versions of the Python libraries you use in this import statement in your SQL query, it will behave differently if you are in Python 3.10 or Python 3.11, and that Python version is defined by the runtime. And when you are running on serverless, it's even worse because you can't choose a runtime, but there is a concept of environment in serverless.

So it may get very confusing for people, like how the function by itself behaves. Also, that function has certain limitations when it comes to serving. So for example, when you, there is also a thing like a feature serving on Databricks. And the idea is actually quite good, right? You want to just serve features sometimes and not just models. Like you want to look up certain key in the table and return things back.

It's quite convenient and you can also serve like a combination of this feature function and feature lookups like a stack. It's a list that you define and the order of the list defines the order of execution of these elements. There is no conditional statement, so it always, all of these things always get executed. And, you know, all this together is defined as a feature spec and that's how it's called and that's what you serve.

So the feature function there, if you have to output some complex data types, like, I don't know, something that is not integer or string, it will fail. I hope they will fix it soon because it seems like not really intended behavior.

So things like that. Also, feature engineering package itself, it only works in a notebook or in Databricks environment. You can't run that thing, that Python code on your machine. So I can go on and on with this, to be honest. I'm not a fan of this feature. I'm

It sounds like a mess, to be honest. Yeah, in my opinion, it's quite a big mess. And that's too bad because I think there is a great potential in it. And another thing that I find also one of the most frustrating pieces, to be honest, is

So most of machine learning is done in pandas still, right? Pandas, I mean, now polars is becoming missing and people actually migrate to polars, but still not all the libraries supported. So pandas is still pretty much mainstream.

So feature traceability, this lineage, only works when you use PySpark. So whenever you convert it to Pandas or something else, it's not going to work any longer. That's kind of weird. Wait, did you tell me, though, that there's a workaround for that? Yeah, there are some ugly workarounds for that that we actually teach, but...

Yeah, why would you design it like that? Yeah, that feels like it is part of the downfalls that you get with a managed system. But it's also that trade-off that you're deciding, hey, I would like a managed system, so I want certain decisions to be made for me, right? And it's inherent that if you're using a platform, they're going to have opinions about how to do things.

This is one of those times where your opinions and the builders of the platform's opinions diverge drastically. Yeah, on this specific feature, I would say so. But I think, well, this is just one of the things that I actually really don't like about it. But a lot of other things are awesome. Like I said, bundles is really great and stuff.

The way how you define workforce drastically improved compared to how it used to be. I think those are also really good. The way how you do training on Databricks. There are a lot of really cool parts. I think most of the things are cool parts. So I guess that's one of the reasons why I also talk about Databricks. If all of that was not great, I wouldn't be. Now, are you...

diving into any of the mosaic side of things or how they've incorporated that into Databricks? So not in the course that we have now, but we are going to launch LLM Ops course and also I'm touching on it in the book that I'm writing. So I cover MLOs, but also LLM Ops use cases where also indeed mosaic part is covered.

That's what we're going to run with that term, huh? We're going to use LLM ops. Well, I don't know how you would call that. For me, everything is in LLM ops, to be honest. Yeah, I think we are very biased because for me too, but the LLM ops, I never felt like

was a term that had sticking power and it's an interesting one that... Or AIOps. AIOps doesn't sound really good. No, AIOps, I also interpret with AI for operations. So like using AI to get less alerts in Datadog. Oh yeah, yeah, yeah, indeed. Yeah.

But yeah, I don't know what we can call it besides vibe ops. That's my new term of the week. Let's just call it Mellops. I think Mellops popularity as a term is actually quite big. So I'm pushing for it. Yeah. And I do think it's gone through the...

hype cycle and now it's on the uptick again. Like, of course, when LLM Hops came out, it went down and it was in that

trough of disillusionment and now it's coming back up because I think folks are realizing okay well we need to figure out our production environments no matter what if we're using LLMs or traditional ML it's kind of similar yeah for sure there are way more similarities than people would like to think

And well, I give this example quite a lot, but if you look at, you know, data science hype cycle, data science terms started appearing around 2016, 15 or something like that. And it took another three years or so, maybe five, before MLOps became a real thing. And I guess we are going through the same cycle these days, but the cycle will be even bigger. I mean, AI...

popularity has grown way further than ever happened with data science in the past. And because of that, I also expect that the next MLOps hype will happen faster, but also going to be much bigger than we ever seen before. And well, thanks to the Vibe coding, we will have a lot of work to do. Yeah, that's good job security. That is for sure. Yeah.

Is there times that you have been working with folks and you've recommended that they don't use Databricks?

Oh yeah, for sure. Definitely. I think everyone should use whatever makes sense for the situation you're in. So for example, we use Databricks a lot. Also for model serving, we have some model serving on Databricks, but I wouldn't recommend it everywhere. So one of the situations I would definitely not recommend it is when the whole website is hosted by you, it's running on your Kubernetes,

And so if you want the low latency, you would want to host your model and serve it on exactly the same Kubernetes server, not anywhere else. So yeah, definitely in this situation, don't use Databricks for that. Because that was like the anti-pattern would be, oh, we're going to bring in Databricks and have it be outside of our Kubernetes cluster. It will be anti-pattern because it's going to be slower, 100%.

Also, Databricks is not going to work for everyone. I'm not talking about the serving part, which I think can be quite useful because it simplifies things a lot. But it's not going to work for everyone. So there is a limitation of 20,000 requests per second for the whole workspace. And that's

you know, under some assumptions. So realistically, it's going to be less than that. And that's for the whole workspace. If you have multiple endpoints on the workspace, it all is covered under this umbrella limitation.

For some companies, it's enough. For some, it's never going to be enough. So then you just need to pick the tools that make sense for what you're doing. That's always going to be the case. And for most companies, it's going to be fine because they don't have any hard requirements or anything like that. And most of whatever is done is still batch. Yeah. And this 20,000...

Sounds like you've run up against that. And we don't need to talk about why or how or what. But is there not a way to go and negotiate with sales to bring that up?

Well, I guess that could be, but that's already an increased capacity. So there is a default capacity which is lower than that, and you can create a request to make it higher up to that number. I guess if you're really a big customer, then it might be possible, but I doubt that it would be that easy for any customer, to be honest. Okay, so...

That's the when not to use it. Now, what are some things that you've seen? And I think we're both going to be at the Databricks Summit, Data and AI Summit in June. And you're giving a talk, right? I don't know yet, actually. So I may be giving talk at DevRel Theater. But anyways, I will be around there. What are some exciting developments that are on your radar? On Databricks?

So I actually like, that's not per se a new thing, but some things I probably can't talk about so that I need to be really cautious with that. You're privy to insider information? I didn't realize you were that cool. That's awesome. Okay. I can't say, okay. I can tell you things I can't tell. So there is App Genie, which I think is super cool. What is that?

So basically it's this AI for BI kind of tool. And that's something that within our organization we also are now going to use more extensively. That really simplifies the way how other teams that don't per se have enough knowledge to code things, they can interact with the data.

And that's something that we try to incorporate in our product teams. I think that is pretty cool development. It's been there for a while, but I think now it's getting to a state that it's actually nice to use.

So we're talking about using Databricks in this utopian world where we learn the platform and the ins and outs of it. And that is all we need to know. And if we can optimize that, we optimize our whole setup and system. But for most folks, I feel like

Databricks is just one part of the stack. What have you seen in that regard? Oh, yeah, I agree. So I think from what I've seen, and that's the most common pattern, that everything batch is happening on Databricks. So basically the whole model training, which you need to, you know, retrain probably once per week or whatever your retraining cycle is, right?

That can happen on Databricks and I think it will simplify your life significantly, especially if all your data is in Unity catalogs. It will make it so much easier than using anything else. And that's why I think it's very smart from Databricks to have Unity catalog in place.

So what I've seen a lot, and that's what also we are coming from, and we still have this kind of diversified way of deploying things. So we train our models and it results either in model artifact that needs to be served. So you basically have, you know, model serving in this case, or you have batch serving, which means that you just store data somewhere in some database and

At the request, you just need to query the database. Or you have a mixed scenario where you need to have an artifact plus you need to look up some data somewhere.

So when you just want to look up some data somewhere and without any models, I think data rigs wouldn't be my first choice. So there are multiple ways of doing that. It's either, you know, model serving with some lookup in some other database. That would be one of the ways. Or you could use online tables and data rigs. But then you're limited with feature serving, model serving. There are certain data types that are not supported online.

It's just too complicated, in my opinion. So I would rather go for serving some fast API on Kubernetes and looking up in some database like DynamoDB, CosmosDB, MongoDB, things like that. And that's like the most common approach I see also within organizations. It makes a little sense, by the way.

And when you do model serving, you may want to serve it on Databricks, but also MLflow serve works with Kubernetes. So you could deploy it the same way, but on Kubernetes. And that would be also a nice approach. So I think you just need to think what does make sense for you, for your specific use case and

you know, go for that. Databricks makes monitoring of the endpoints much easier. That's one of the, you know, upsides why I would vote for model serving if it's possible and works for you, like whether it fits the requirements. Because for all the API calls, you can look up that information. It's all stored in the inference table, which you can enable. And that really makes it easier

so nice to monitor it. So it's easier than using some other tools to do the same thing. That's just what I think about it. Because we try different approaches with that. So yeah, as always, it depends. I'm glad you brought up MLflow right there, because I know that there's been a lot of work on MLflow and specifically around

extending MLflow to new LLM capabilities. But when we chatted before, you were saying how cool the new updates are with MLflow. I've seen in the MLOps community that there's been some threads going on in Slack on folks that don't like where MLflow has been going. But maybe you can talk to us about what you like about it these days.

Okay, yeah, sounds good. I actually became a Melpho ambassador, so I'm probably the right person to talk about it. There we go. No, Melpho, I did feel a bit old for a while. Like, nothing was really happening major, in my opinion, for a while. And it was quite, you know, not very intuitive to use for new users.

But I feel data commutation is just awesome. It's improved significantly in the last years. And if you don't understand something, then you can probably find that in the documentation.

So that's one of the things that definitely improved. But also LLM features, that's something that also came out recently. They can have MLflow traces, now they have also prompt registry, which is super cool and it really makes a lot of sense. There is also a YI gateway in MLflow.

Well, I don't know. I don't think there is any other tool out there that is that feature rich, to be honest, which makes it on one hand a really cool tool. But on the other hand, it might not be straightforward to find out how to use it properly again. So that's probably the other side of the coin.

When you talk about the MLflow serving on Kubernetes, is this the managed version or is this the open source version that you were talking about? So MLflow serve is a functionality that comes from the open source MLflow.

But basically exactly the same thing is used on Databricks. It's just you don't have to run MLflowServe anywhere. You need to use commands to deploy endpoints on Databricks instead. But exactly the same thing happens actually behind the scenes, which makes it easy to deploy pretty much anywhere with exactly the same format.

One of the things that I find a lesson to YouTube with Databricks serving is that whenever you deploy it, if you just deploy it, you wait, things may fail and it's very hard to debug. But what people don't realize that it's exactly the same thing as MLflow serve that you could just run locally and debug it. So there are ways to test it.

We're actually writing a blog about it as well now because it's not very clear for people how to do that. But, you know, MLflow model, you know, format, it's very similar to what Bento ML is doing in a certain sense, right? It's just packaging your model and

in a format that can be served in a certain way and you can deploy it anywhere. So you can make an image out of it, a Docker image, and then you can use that for serving. Or you use Databricks, but Databricks does exactly the same thing on the background. So I guess it doesn't matter where you deploy it, it just...

On Databricks, I guess it's a bit easier because all of these complex parts are hidden from you. So if you had the ability to create your favorite stack, we could say, using Databricks and then plugging in different pieces where needed, like that aren't native Databricks options, and you can extend the platform further,

Let's talk about a specific use case. So it's not like, oh, well, it depends if you're using this use case and you would want this or that. Let's talk about a recommender system use case. And how would that look? What would you swap out? Assuming that we don't have to onboard any new vendors and do anything new to specific pieces, let's just...

pretend like the work of actually bringing on the tool is non-existent because we already know that is not true but in this hypothetical world it is how would you extend it so recommended systems usually are largely pre-computed right

It's not something that we are computing at the moment when there is a request coming in. Maybe some parts of it, but most of it is actually just looked up somewhere already pre-computed because it's expensive to compute. And we have some latency requirements. So it means that you have to look it up somewhere in the first place. So that's an assumption that we have.

Model training can be done on Databricks and for recommender system, I think, like at least what we have, we use Spark a lot and it makes total sense because all of the processes that we run can be distributed for data pre-processing. Then also the larger logics of what we do for recommender system is also custom made, also run in PySpark.

And it results basically in something that looks like a very, very large dictionary in the end. And well, that's something that you could store, I know, in some database.

And then it makes the most sense just to have some fast API running on Kubernetes or maybe, I don't know, it depends like what your requirements are, but like even Azure Function can be good enough for you. So there are multiple options, just you need to see also what you already have within your organization, what patterns do you have.

and choose for that. If Kubernetes is a big thing of what you do, I would totally go for it. And then deploy just fast API app and then look up in some Cosmos DB, DynamoDB, whatever you have database and return the value back. And for monitoring stack, then you can't use inference tables and data bricks. Then you have to, well, what we use, we have App Insights and we also have

you know, Prometheus and Gryphon are set up and that's what we currently use.

What are the data layer of things, like the whole data pipelines and processing and moving that around, creating features, et cetera? Yeah. So, well, we are on Azure, and we don't handle the raw data part, so when the data ingestion part, but there are some limitations that don't allow us to use Databricks for that, so that's why it's still an Azure Data Factory, but

When the pre-processing happens, it writes all the data in Unity catalogs. So we basically are consumers of that data. Our team is consumers of the data and the data is produced by another team. So if something is wrong with the data, it's not our responsibility to fix it, but it's that team's responsibility to fix it, which simplifies certain things and complicates other things, obviously.

So that's a data layer that we're dealing with. Then we have our own custom data pre-processing that is required for our models because

Data engineering team doesn't care about ML-specific data transformations. So we have our own data engineering pipeline that is used for our personalization stack. So that is shared across multiple subproducts within our personalization domain. This pipeline is with Databricks or Airflow? It is on Databricks. It's using Databricks workflows. And we write back also to Unity Catalog.

And then subproducts, which is, for example, recommendations on the basket, on the product detail page, or personally offer recommendations is what we do as well. These are also separate subproducts that run after that other pipeline has finished. And the results are either written to some database where we can use FastAPI to look it up,

Like we actually do serving on Azure functions these days.

And the other part, we actually have a model as well. So it's just a model that we do Databricks model serving for that. But as I said, if you have certain requirements and Databricks model serving doesn't fit your requirements, you could use MLflow serve and deploy it somewhere else. Mm-hmm.

So I like how I asked you what your ideal stack is and you gave me what your actual stack is. But it is very ideal. I really like what we have now. It took a long time to figure that out. And we went through a massive migration. We're almost done with it. It's a great feeling. And it's actually really the way we imagined in the beginning. So given the situation that we are, I don't think there is a better stack for us at the moment.

That's why, yeah, I liked your answer. It's like, there's no difference between my ideal stack and my current stack. Yeah, indeed. Yeah. Yeah, we are very proud of what we achieved. It was a lot of effort. Yeah. What were some of the things that you did when you were creating this migration that didn't work out?

Yeah, we actually wanted to use Databricks feature serving and also look online tables for feature lookup. And because of the limitations we faced, we never could use it. So that was one of the downsides because it would simplify our deployment stack, right? The whole deployment could happen just in one big pipeline in the workflow on Databricks.

But instead, now we have multiple pipelines, which is still manageable, but it's less perfect, I would say. And you kind of glossed over something that I want to go back to, which is you're not the owners of the...

raw data that gets thrown into Unity Catalog. You're just a consumer of it, which has its pros and cons. How do you break down the pros and cons? Yeah. So I think the data quality part is where it is all about in the end. Of course, there is some monitoring on the data ingestion side and the way how they process data and what they put in the Unity Catalog to have some quality checks in place.

However, these quality checks are very different than the checks that we do.

For them, it's just like the schema makes sense, the values are within certain ranges that are acceptable, and the count is normal, things like that. But what we check for are very different things. It's more around statistical properties of the data. And the data engineering team, because we're not the only consumers of the data, there are more other consumers, they don't really care about these things. So it can happen that we see that things are broken,

And they haven't noticed it just because they don't check for things we care about. Yeah. And that's a universal problem that everyone has. This feels a lot like where you would want to create this data meshy concept, data contracts.

and have the responsible ones and the consumers, so the producers and consumers, shake hands and say, all right, we agree that these are the quality checks I need or these are the things that I'm looking for and I want it with this type of freshness and I want it in this style or this schema, whatever it may be. Have you thought about doing that?

Yeah, we tried. But I think it's always about, you know, how big teams are, what are their priorities, who they are reporting to. All these things matter. And if there is this kind of movement from above, then things may change. But, you know, if you're just one of the consumers and there is a larger team, you

And their priorities are way different. It's just hard, you know? Yeah. Yeah. That's such a great point. It's how do you influence the other team who has other priorities, right?

to take into account that this data that they're giving you, sometimes it goes all a wire and you're not able to extract the most amount of value from it. So it's almost like you have to go up your food chain in order for them to go horizontal and then down their food chain instead of you just going and talking to that other team and saying, hey, can we set something up like this and that and

Yeah. And so you try the data contracts, the data contracts were actively put into place or it just the producers were like, yeah, we'll get around to it. And they never did. Yeah. Second option. Yeah.

That sounds eerily familiar to a few ones that I've heard. So, all right. So before we jump, I do want to highlight the course that you are creating all about Databricks. I think it is obviously clear to me as I've been talking to you for the last hour on how knowledgeable you are with Databricks from a practitioner's standpoint. You've been getting your hands dirty with everything Databricks and you've also been staying up to date with everything that is coming out.

What's the course about? When is it? How can I sign up?

Yeah, so this is a cohort-based course that we have on Maven. And the idea is that we really want to teach everything that we know about VML and Databricks. So it's a very highly practical course. Every week we go through a piece of theory and we actually show the code to people and explain how it's done. And everyone needs to create their own code

based on their own data set and we review pull requests and every week iteratively we cover another thing and by the end of the course everyone has a full-on project, end-to-end ML project that can be reused in any company pretty much.

So, and that was always our goal with this course to actually build something highly practical. And we are super active on Discord. So people keep asking us a lot of questions. Also after the course, we keep, you know, everyone keeps access to Discord and we kind of build a community in a certain sense.

So when the course starts, the next cohort starts 5th of May and it goes on until 23rd of June. And we also have another cohort that starts 1st of September and it will be the last cohort of this specific format of the course because as I mentioned earlier, we are going into LLM Ops and maybe going to have like extended cohorts. We haven't figured that out exactly how it's going to be, but

you know starting from November so it will be a different course okay so the course is awesome I will also mention that you've been generous enough to give everyone in the MLObs community a discount code so you can we'll leave a link in the description with the discount code and everything in there and that's awesome thank you you are writing a book too what's that about

Well, MLOps with Databricks, what else? You should have known. Yeah, I should have known. It's basically all my brain dump. Everything I know about MLOps and Databricks will be in that book. And that's basically a guide that I always wanted to have myself if I were getting started with Databricks. Also very highly practical with all kinds of considerations involved.

So the book will be coming out beginning next year, but there is early release of the chapters. So the first chapters already are coming out, I think, next week. But also half of the book will be around July, I believe. And I will finish writing the book around October. That's the goal at least.

So the chapters will be appearing in the Riley platform so everyone can read it whoever has access to Riley platform. So yeah if you want to have earlier access that's the way.

MLOps with Databricks // Maria Vechtomova // #314 52:43 Share

MLOps.community

Deep Dive

Shownotes Transcript

MLOps with Databricks // Maria Vechtomova // #314