We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Evolving Workflow Orchestration // Alex Milowski // #291

2025/2/14

MLOps.community

This chapter explores the origins of workflow systems, tracing their evolution from 1920s process engineering to modern rule-based expert systems and graph-based representations. It highlights the shift from code-based implementations to visual flowchart representations and the role of business process modeling.

Workflows originated in the 1920s in process engineering and manufacturing.
Modern workflows evolved from rule-based expert systems in the 1990s.
Current workflows are often represented as graphs with interconnected tasks.

Shownotes Transcript

I'm Alex Balowski. There's lots of ways to pronounce my name, which is a fun fact. But it could be Balowski, or Gmijowski too, if you want to be the Polish version. And I usually take my coffee any way I can get it. I love a flat white, but it's very hard to find in the U.S. Like proper flat white. So I got it cooked on them in Edinburgh.

Welcome back to the MLOps Community Podcast. This is the definite guide on workflows, aka DAGs, aka pipelines. We get into the nitty-gritty of why they're valuable, how you can use them. We're talking workflows within workflows, workflows that are dependent on other workflows. It's just a constant deep dive on

what they are, why they're useful, and how you can best take advantage of them. My man, Alex, did a whole survey of different tools that are out there. He covered everything.

79 of them and we'll leave his blog post to all of the insights that he gained from this survey of all the open source workflow tools on github in the description so you can check it out let's get into the conversation with alex workflow systems

and what those mean what it is what got you interested in them where are we coming from with that because you've got lots of thoughts on it and there's this really cool like summary or survey that you showed me so i want to dive into that but let's just set the scene with workflow systems sure the

Idea of a workflow. I had to look this up and it goes back to really the 1920s, which is when the term itself comes out of people looking at process engineering, manufacturing and other kind of business context. So that makes sense, right?

And it is how we use it today. But the real sort of genesis of the current generation, I think, comes out of more like rule-based expert systems. The generation, maybe some of the 90s, where people were trying to build, instead of writing code that implement business processes, they have rules. And the rules would, you know, we'd use a rules engine and the rules engine would tell the application or whatever, what's the next thing to do.

And so there's this sort of implicit idea of a chain of things that somebody has specified and what's the first step. And if that's successful and maybe it's criteria, then what's the next step? The average user might think of those as, or sort of person might think of those as like a flowchart, but it's more computer science-y talk. It's a big graph.

And inside the graph is a bunch of steps, and the steps are basically tasks that do things and manipulate data and produce results or have side effects. And then when one successfully completes some downstream task, it's invoked. And that is the workflow itself. And there have been processes with systems developed post-rules engines world that think of that more like a graph, and they think it's more human intuitive to draw out that flowchart of

what they want to do. And there's a whole sort of other side of this world that's not used in machine learning as much, but this whole business process modeling. And there's actually standards around that. There's a thing called BPM and there's a business process modeling notation. And those are very cool things themselves that people use to describe processes that they use inside their businesses for all kinds of purposes. And having an exchange format and notation for that is great. But that's sort of, you know, the, I think,

bifurcation here of those kinds of systems that were built for sort of business processes in general versus how we have used them in data DevOps and data engineering, ML operations and so forth. And then there's a kind of a small category of things that are workflow engines that are embedded in applications themselves.

which kind of are all word place. And so my interest was to look at this in a more broader context because revisiting things I've done in the past and all this new stuff and what is out there. And I spent some time on GitHub looking at, I found some people had some great lists and I found some ways to search for these things. And I basically built a big spreadsheet of

workflow systems and features that they have and are trying to sort out these different systems and when did they start and are they active and features do they have versus others. So...

Presumably the majority of them are geared towards technical users. Yes. As I said, there's a bifurcation of the business side of it, but due to more of the business process modeling, they have nice interfaces. They're meant for your sort of average user inside a corporation or someplace to be able to describe a process. Some of them have a sort of a no-code aspect of them. In the end of the day, it's still a very...

technical thing to draw one of these big diagrams, even if you have a beautiful tool to do it with, you have to know a lot and you have to think a lot about like, how did we actually do this? And then sometimes they're replacing or modeling things that are done by humans. And I have a whole story about that from the last company I was working at with scientists, Biden, how I was trying to, we were actually used the, this BPN notation,

just to help them like draw us a picture. And I needed a standard way to draw a picture of what they were doing. And so that business process modeling location in the scientific context was actually a useful tool, right? It is funny that you talk about

how we can represent it as graphs. Because I've heard this before in this last year when we had the AI Quality Conference, one of the speakers who actually is one of the creators of Docker, Solomon, he talked about how everything is a graph and then

My friend that was sitting next to me, David, he said, wow, I never thought about that. It's yeah, everything's a DAG. You can really represent anything as a DAG kind of. And so it's almost like thinking about it like that. Yeah. Graphs were very super useful notation. They're also can get messy really quick. Right. And this is where you need tooling. And so the notations. And so the depends on what you're doing with it. I think the

What I found interesting, and we stumbled on this BPN, and I had not been following that area for a while, but we were struggling with how do we draw a picture of these process engineering flows? Because we want to automate this. We have machine learning, we have robots, we have humans, right? And they're all part of this process. And we have different interactions there. And when a person does something in the lab and they take a, they literally have a

plate with wells in it and inoculants and stuff in it and they're putting it inside a lipid handler robot or they're putting inside an incubator robot. We need to know what that task that the human did before the robot is told to do their task, before the machine learning is told with the post data to do its thing.

And so that modeling of that whole process let us build tools and user interfaces to make the lab more efficient. And that's one of my sort of takeaways is from that experience was that there's what we do in MLOps. We're like just the part, the technical part that we're concerned about. And then there's how it's used in the bigger organization. And that's also a workflow. And so it's like workflows inside workflows. And yeah,

How, where you slice it is important to the sort of result that you're trying to get. So in this case, we're trying to make the processes that the lab used more efficient to get more throughput to the lab, as well as to get all the interesting results from the computational side of it.

And so writing that workflow down as a whole thing with all the technical bits in there too, so everybody understood all the parts, was a challenge, right? So having a notation was first like, how do we communicate this? How do I get literally a drawing? It doesn't have to turn into code, but just a drawing that everybody says, yes, that's what we do, right? And from all sides, right? And even if their eyes glaze over for a part of it, that's not their thing.

We all have a consensus and we're building around the same thing. So we stumbled across this mostly because there was a really cool web-based tool out there from a company called Commandos. But they have an open source thing. It runs in the browser. You can drag and drop things and build the whole workflow and then it spits out an artifact. Oh, nice. It's a VPN notation file format.

So theoretically down the road, you could do something with that. In our case, it was simply just a diagram that we could use as a communication piece as part of our sort of internal technical documentation. But it was a good starting point for that. Just drawing a picture of what are we implementing with our workflow systems. You talk about the way that you slice it and being able to look at

workflows almost like from infinitely zoomed out to infinitely zoomed in and all of the different layers of workflows involved in each time that you're zooming in. And then where one workflow starts and one stops and who owns which workflow. And then you're getting into technical workflows versus non-technical workflows. And do you have any experiences on what

is useful in that regard, like how to slice and dice these? Yeah, I think one of the challenges, one of the things is to maybe not try to think of it as there is the one workflow. So I think on the outside, there's the process of your business, whatever it might be. So if you're a very in silico technical kind of organization, like you have a digital product, right, and you're using machine learning technology,

There's the aspects of the tech internal piece of how that model does inference or is trained or is evaluated. And that's a very technical workflow. That's something that a small team, sorry, somewhere in your organization really understands well. And then it's a black box. And it's a black box in a larger workflow that is how your organization uses that, how they make decisions around it. So

train a new model, the output of that whole process might be not just the model, but how good is it at a particular task? Somebody has to make a decision about what they do with that decision could be automated. If it passes certain criteria, we put it into some kind of production track. It might not be. There might be a human who has to go in there and make a decision and start the ball rolling.

That's part of a bigger process that's over a business process workflow. And so I think there's opportunities here to have a layered model where you can be using and maybe using different technologies, different workflow systems, right? But still they're interacting because one is using the other. And in the case of my last company, the auto workflow of what is the scientist labs doing versus the technical bits of each tool that they're using.

That was just a paper discussion, right? It's a diagram. It's a part of a documentation. It's so that we all understand what we're doing. But maybe the ultimate goal is that it's executed by some system.

But there's a long-term goal versus the short-term goal, which is, so now if you dig into the boxes in this thing, how do we accomplish that task? Maybe there's a big procedure. It's a wet lab thing. Maybe it's a robot. Maybe it's a whole machine learning workflow that runs a bunch of code and manipulates a bunch of data. But so there's, you can have, you know, you can draw the workflow out and you can execute it and you can have a system that actually implements it. And I think those are useful architectural tools

sort of ways to decompose the problem. And then you can choose where you implement your, and where you spend your time and money is implementing a system, right? And there are choices for each of those things. That's what's cool about this is you can go full on, like I got work for everything and I've got a system for everything. And there are tools for those things. That's the cool place where we are at this point in time, which was not true even like a decade ago. Yeah, it does feel like if you are able to

understand the different workflows and what happens after your workflow ends in a way you are setting yourself up for greater success because if we take that example of okay there's your little piece which is the model and then exposing the model to the greater organization that's great

And you can be optimizing the model for things that you think are cool. But if you really know what happens after you expose it to the organization and how other teams are using it or what they're looking for when it comes to that model, then you're optimizing for something that's greater than just what you think is useful and you're understanding how it's being used in the greater context. Have you seen that being a case that...

When you notice all the different ways and dependencies that are being built from your specific workflow, you're setting yourself up for more success.

I think that it depends on the scope there. So I think that workflow systems, when they are properly handled, however, whatever kind of technology you choose, you have to support them. They have to be vibrant and have to meet the user's needs so they can stitch fix. You know, when I was there, there was 140 data scientists running stuff.

For the organization, some of these things were the sort of back house types of things. Some of these things were more daily things that happened that populated their systems internally and they were really critical. And some of them were research work and so forth. And so all over the place. The ability to describe the steps of things and interact with it was Airflow underneath. But they didn't actually interact with Airflow. They had a DSL that let them, it's a term we should define, the domain-specific language.

But they have a way to describe the workflow in a DSL, and then they could give it over to the system, and they would run it through by underneath with Airflow, and then they had a way to execute the tasks on their big sort of batch system. All that complexity was hidden from those 140 users, so they didn't have to become experts in that technology.

And it did lots of good things for them, like the task executor auto-learned stuff like, oh, your task needs more memory, so we're going to retry it with more memory because it ran out of memory. And it learned the right parameters for the user. So again, they didn't have to be experts on deployment. You can do lots of cool things with workflow systems that... So that gives you... They're more productive, right? They are less frustrated until the system breaks, and then they're frustrated with it. But happy users are assigned like this person. It's almost there. So...

I think that I've seen that kind of success. What I haven't seen yet, and maybe somebody will tell me I'm wrong, I haven't seen the jump silos, right? So that's a data science organization inside a big company. The tools used internal to that part of the organization. But you don't see nesting of workflows as much where you see, and this is now part of a bigger workflow organization.

I think those processes are more bespoke in terms of if the data that's used by the output of the machine or workflow, the model that was updated is used by some other system by some other connection that is specific to how we deploy that thing. And the idea of a workflow using a workflow, but there are different parts of the organization using workflow systems, I haven't seen that sort of in real life. I'd like to see it. I think that's a vision.

Yeah. We describe our organizations as process, like, I like to think of it as process engineering, right? And we can take that down. And there's a term I'm borrowing from like manufacturing where, you know, everything has to follow a procedure and we describe what those procedures are and they fit together like puzzle pieces. And I think business and digital context can work the same way, but you have to do that process engineering and it's hard. And so the people kind of skip that hard step and-

There's a payoff. People will use these systems. And so I think we see that with data engineering and MLOps and operations where there is a payoff, right? Because there's all this complexity about how we do our task and we could hide that in the system and make the end user who is a data scientist or a machine learning engineer or some other kind of data person, data engineer, they're just more productive because most of those details are left to a very small group.

handful of people in a corner making that workflow system work for them. Why do you think it is hard, you mentioned, to do this process engineering? Is it just because it's time consuming and cumbersome and there's a lot of friction there to describe the systems and explain the processes? And it's not really like you get

You can't just do it and it gets auto-documented for you, really, right? I think my last company experience is interesting because my main users were internal. They were scientists. And this is not a world view that they're used to thinking about in terms of like, we're going to describe what we're doing in our lab as a graph of things, of tasks.

And they're really writing like procedures, operating procedures. And it's a big document with hierarchies of things and next you do whatever and next you talk to this machine and set up these protocols. And so the idea of looking at it as a graph of tasks and how they interact and how you interact with that and annotating it with here's the data that I need, here's the pre-steps

that have to happen before I can start this task. Here's the controls I have. Here's the failures that could happen in this situation. That level of detail, they're just not used to writing out. And you can understand that because that's not usually there. It's not useful to them in their day because they know how to handle those situations and they're familiar with equipment. They're familiar with the work that they're doing. It's what they've been trained to do. That's what they've been doing in their career.

But it doesn't, when you try to scale, and this is where the problem comes in, when you try to scale it, you try to understand more about the metrics of it, the data artifacts, the other things you might be able to get out of the system, then you need that process engineering to understand that. If I'm going to automate something, you have to be able to draw me a picture of what you do and tell me all the facets of it. And so that's just a hard conversation to have. And so...

I had varied success when I did that with these. Some people really pushed back hard. They're like, I don't know why would I have to do this? Here's the thing I'm doing. Here's the procedure I have that works fine for us. And others were like, sounds interesting. The diagram looks interesting, but I don't quite understand it. And then when they engage on it at different points in time, they all of a sudden they're like, oh, okay, I can see some value in this as maybe as a way to document what we're doing.

in a different way. And that's a great opener, right? If you see value in drawing the picture and I can take the picture with one of these tools and do something with it, actually build a system, that's great. And so I still had that sort of cool, like range of responses for people. So I think that is the challenge that organization has to be able to attach some kind of a value to it. What are we getting out there? What's the ROI of doing all of this process engineering? And so

That's why police are like manufacturing things. Makes total sense that they do this kind of engineering because I have the efficiency, quality, measuring stuff that's all about manufacturing. But when you're, when things are much more, I wouldn't call them geek, but they're, that's probably not the right term. But basically when they're highly trained people and they're, they're building these things, they're running a lab, they're running the business side of your organization, they're

That sort of technical benefit and you're asking them to speak this weird language, where's the benefit of that? And so you have to lead them with like, where's, and so I always like to build demos or prototypes or find some exemplar that's going to give them like the why of why would I do this? Which is where we started with my last company and others were like, you've got to build something that's got a cap value. Somebody can see and then there's an institutional investment and going beyond that.

Yeah, I noticed it with just this podcast. For example, I had a friend tell me really early on, you're spending too much time on that podcast. Why? What is your favorite part about it? What do you like doing? And then let's figure out how we can automate the parts or get someone else to do the parts that you don't like doing. Because there is a lot of

intense things that you get from creating podcasts. And like you said, like I was doing one a week and feeling overwhelmed because there's so much that goes into it. I wasn't, I was dropping the ball on a lot of stuff. So I recognize that early on. Well, I did. So I was lucky enough because my friend told me, Hey, sit down with my buddy. He's really good at

this type of thing and recognizing where you plug in and where you don't need to plug in. And so I sat down with a guy and he just said, so what do you do first? I say, I find the guest. And then what do you do once you find the guest? I ask if they want to come on. Okay. And then what do you do? And we just went through that. And what if the guest says no, or what if the guest says this and

And three weeks later, I had a very in-depth flowchart. And it helped me so much because it was the blueprint. And so I'm wondering, have you seen any trends in the ML world, taking a bit of a left turn, or what are the most interesting trends that you've seen in workflows for machine learning? Yeah.

I think that one of the interesting trends here is that the workflow systems in the last, I would say, decade have really grown up.

And it is not a 10 years ago. It was more of a niche thing. Like, why would we do this thing? It's weird. I can just write all the code or I'll write a bash script. Now it's much more common practice that, oh, you've gotten to this point from building whatever your prototype is to you need to have a workflow system. Here's your choices of systems out there that people are using. And there are older ones like Airflow and there are newer ones like Metaflow.

Everything in between. And so you, and then you pick your technologies and things. And so I think that is that sort of change to, this is not a hard decision. Like you have to convince people and know it's like a, yes, people use word plus systems to train, to, to do inference, to do all kinds of tasks for them. And you should have one. Right. And what's your deployment of social, if using Kubernetes, there's all kinds of choices of other things. There's ones that do it more like a service orchestration and

And you pick your sort of amongst a menu of things. And there are SaaS services and there are things that you deploy. So I don't think that's the great thing is that we've moved from it's maybe a hard sell in a corner to this is just standard practice. And I think there are a handful of companies that are doing this as a business, which is great. There's a lot of open source here. A lot of even the ones that have a SaaS system, but the core technology is open source.

Which is what I did. That's why I went to GitHub. It was easy to find a long history of these things over the last, actually, a couple decades of people building these systems. Some of them have gone active and some of them have gone stale and

are no longer active projects. But they do, they're in different categories. So there's the whole business process stuff, which we've been talking about a lot about the high level. And then there's these other categories. And what I did by survey, I made these things up. There's things for the business processing things with a generic aspect of it. And then there's things that were specifically for science. They have their own challenges there of PC computing. And then there's

The sort of challenge, they start with data engineering, and then there's the data science and ML on the side of it. And then there's a little bit around operations. And so you can see the sort of newer generation of tools came out of that sort of data engineering side and then grew into data science and ML. And then a little bit of offshoot for operations, which is more like system management. How do I add a new node to my Kubernetes cluster? How do I install software across a bunch of machines, et cetera, et cetera?

But the same kind of problem workflows that applied in the operations context. And of the smallest, 79 systems I looked at out there, little like 46% is business process automation and the rest of it is there. But the other big chunk is the sort of data science ML side of things, which is about 22%.

And so that's a trend, right? Of like these systems are growing and they're active and it's not like we're not using the business process stuff. That is a whole healthy world, right? And they are growing as well. But the focus on MLOps, which is obvious in lots of contexts, right? But these are multi-step processes and this is where workflows just of a certain sort can shine. And so there's active. And I spent some time looking at

I actually went in and said, when was this project created? And...

and making sure is it still active and those are two dimensions because some things are personal hobbies some things are products that kind of came in and maybe somebody abandoned it the company went out of business or something there's lots of reasons why things stopped getting developed I guess we're usurped by a new thing and if you look at that the business process stuff is like the oldest repos out there you can find of various products it goes back to almost like 2005 2006 somewhere in there and they're still actively new projects being created new things up until sometime in the last year so

People are still innovating and building new things on the business side of things. Same thing is, but you go look at data science ML, it's more like 2015 and then from there and so actively till through last year. And it shifts back a little bit for data engineering. Science has had its heyday from like 2000s to the 2015. There's lots of reasons for that. They're still actively being used, but they saw the ones that saw a very specific impact

This is my take. Like, very specific HPC problem, like I am doing some massive model, they're still in use, right? And whereas the things that are more like machine learning, data science, data engineering, they have new choices, right? They don't have to use these other systems. And I think there's a bifurcation there of use cases. So it's interesting to see that there are different trends here, but they're also in these different sort of columns of use, right?

It's fascinating that you break down like the data engineering, so besides the business side of the house, but if we're looking at the technical stuff,

I can probably rattle off three or four data engineering E1s when it comes to the most popular. You've got Airflow that's proliferated everything. And most people are using that. Whether or not they like using it is another story because it's been around the longest and it's had the most adoption. And then you have the like the mages or the DAGs

I think it's Dagster and the prefix out there that are attacking it in different ways and almost like the workflow data engineering workflow 2.0 type of thing, I would say, because they're a bit more new and they're taking a different approach to things.

And then in the ML world, you have the ZenMLs and the Metaflows, like you said, and even Flight, I think, is another one that is in there. And those are all fascinating because they're going after...

the ML specific type of use cases. And then the DevOps world, you've got like your Argos and maybe you could consider Kubeflow in that world. That kind of plays in the ML world, DevOps world.

So when I was surveying this, I had a whole column and this is a spreadsheet. I was like, yes, knowing and trying to put categories and things, but I had a whole column about machine learning AI. And one of the most challenging, because I'm looking at the documentation and looking at the code and the repos, everybody who's got an active project is adding ML AI to their documentation as we do this, right? And so I mark those people as neutral.

And then when I found actual evidence of here is we have tasks for it, we have examples, here's a workflow that does whatever and that shows that you can actually do it, then I marked it as a yes. And I think there's a nuance here, which is that any of these workflow systems that have some kind of a model of a task executor,

a lot of the airflow is included in this, they can do advanced machine learning workflows just fine. And because that task executor could be some complicated thing running on Kubernetes, it could be some other system that you're interacting with, it could be an inference endpoint that you're using at some inference provider like Base10.

Those are all, they all have that capability. It's, but I think the challenge here is, and that's true for the people who have business or historically also say an older product was more mature. And now they're saying, hey, we can do machine learning AI.

It's probably true there as well. It's this question of how easy is it for the practitioner to use, right? To actually do that. Do they, even though they said it, do I have to do all the heavy lifting to make it happen? Or do you have infrastructure for me? Do you have examples? Do you have documentation? Have you thought through the nuances of

What I need, say, a model training pipeline, and how do I get all access to assets and GPUs and things that I need? Or is that stuff I have to figure out, even though I'm using your workflow engine? And I think that's the differentiator. And some of the newer systems also are kind of code-oriented. They're on that infrastructure as code track. And so if you're going to describe your workflow, use Python annotations and

You don't have to deal with the DSL and all these other things. You just write code. And that's a trend right now. Everything is code and we just write it in Python. We use annotations. And that works for some people quite well and there's nothing wrong with that, but it's not the only way. And so older systems like Airflow, there's also a differentiator there that the DAG is stored in a database, right? And so there's not necessarily a representation of it other than to talk to the system.

And there's a bunch of systems that work like this. And then some things have DSLs. And it's a YAML file. It's a JSON file. It's something else. It's custom language. And then you write in that DSL. And some things have just code, right? And so I tried to make a differentiation in the analysis of these different systems and looking at how do you interact with these things. And the trend for DSL and HTML is more cooked, right? Less annotation, less DSLs.

but maybe we should talk about the SLS at some point. But I think that, and the older trend has been that there's a serialization format. There's a thing you could author. There's an artifact that is the workflow itself, right? And it's,

It's coded in JSON, XML, YAML, some custom language. And it's a piece of code itself that you can check in somewhere, but it's not Python. It's something else. Describes all the metadata around it and treat it as such. And that has its use cases, right? Yeah. But the trend is for YAML is away from that, I think. Right now. Might not let that stick.

You mentioned to me before we hit record that it feels like everything is moving towards code and this infrastructure is code world. And the big question there was, is that good? Yeah, I have mixed feelings about that. I understand how it is when you're writing something, it's very compact.

And it does a very technical thing and it's all sitting there in Python. It's a bunch of PyTorch and other things like that. And being able to wrap those up in functions and then organize those functions into a workflow as a sequence of steps. It's very elegant.

and useful. But the challenge with that is then that code is the only way you can understand what the workflow looks like, right? So if you want to draw a picture of it from your code, you have to run the code somehow and get an artifact. What is that artifact, right? You don't have a DSL.

Right. How do you draw, generate a diagram from that? A lot of tools, a lot of these things will do that for you. They will make a picture for you and maybe you can save it as an SVG. Maybe you can't. Maybe you can take a screenshot, whatever. But you have a picture and you can give somebody the picture and they can say, this is what we're doing and you can have a discussion about it.

And so I think the problem with the annotation side of things is then only people who can write code, you know, can understand that workflow. And so that's, I think, is a challenge. And then I think it's like me and your ultimate goal is there's more workflows and there's nesting of workflows.

There are people who don't write code. They're just talking about your machine learning workflow is a part of a bigger system, and that's a big workflow, and we have pictures for all the rest. Yours is a black box when you don't understand how it works and what the different failure states are and so forth.

And so I think that it runs afoul of that. So I don't think that the infrastructure as code approach, the annotation approach is bad because it could produce artifacts that are the definition of the workflow. I just don't see a lot of evidence that that is where these tools are going right now. And maybe they just, as they go on their adventure of building their systems and services, that something will come out. Right.

There's a wide variance of what these DSLs look like. There's a lot of history there. There have been some attempts to standardize it in various contexts. The science domain had a YAML-based thing called Common Workflow Language. BPM is something learned from the object management group, I believe, as a standard for business process modeling. And they have a notation, and that's another standard.

for the diagramming of them. But those, have they taken root in various communities? Sure. Are they widespread? Probably not. And so I think it's not clear that there's a real winner there and it's not clear that we necessarily need a standard, but certainly within your organization, if you had 10 different formats, you'd probably be unhappy. Yeah. You know. Going crazy. Yeah. There's some work to be done there. But when it comes to these domain-specific languages...

Is it something that you in an organization, you choose one and you go with it and you can abstract a way like you were doing at Stitch Fix, you mentioned. You had the end users using the domain specific languages and then underneath the hood, you had almost that infrastructure as code layer structure.

And it feels like that was working well for you all? Yeah, maybe. But maybe depending on your person, right? It was working well to keep the system's longevity, right? So that as we change how those tasks were interpreted by the system, right, as technology changes, the workflow is just metadata about what we would like to see happen and how they're chained together. So I think there's a value in that. But I think that there is definitely some...

pushback that comes with that as well because it's another thing, another artifact. It's external to your code. There can be mismatches. So there's lots of challenges there too. And so I think it really depends on the particular user that you are interacting with. What's nice about most of these things, there's a high prevalence of people coding these things in YAML. You could like it or not. I don't find YAML a problem, but a lot of people don't like it.

It's okay. But the structure is pretty much the same. So there's a list of steps and the step has a bunch of metadata and they all have names and they point to each other in some mechanism. And so having that kind of comment format lets you take a workflow from system A and a workflow from different system B. And if they're both in YAML, you can think about how you would represent those.

as there are artifacts that you can check into social control you could generate diagrams from them maybe there's a tool and system for that so I think there's having this has benefits like that where it's a it's just something you can parse and you don't have to run code because that requires infrastructure that requires you know an environment set up you can just parse it and understand what are the steps in that thing and then people that's when people jump off and be like oh we should have a standard but

standards are hard, right? Getting people to agree, they take a very long time. I'm not saying that won't happen. It's not something I see happening right now. But maybe there'll be a need for it in the future. And then there's the whole custom side, right? That's another thing. People have created their own little declarative languages for describing these things. That is definitely on the downside. I don't think that people are doing that so much anymore. That trend is fading

There's a reason why they do it. You can make a nicer thing, right? But then you have the problem of to leap up to learn your syntax and semantics. And maybe that's more trouble than it's worth at this point. Yeah, well, especially onboarding new folks. And everybody's got to go through that. And so you're now creating just a more

cumbersome process to get someone up to speed. Yeah. And these things are, everybody likes to use the term DAG, which I don't always because not everything's a DAG, right? The acyclic part, their workflows have loops, right? So they're graphs, right? In general. And sometimes they're forests. And since that there could be workflows that have two different independent pieces, there's lots of complex things out there. Those are the edge cases, right? Even things that have loops are the edge cases. So the DAG,

term is a simplification computer science wise to make it nice to execute and but it's still even with a DAG you can have meets and joins right so you can have a little sort of loop in your thing and you gotta cut that somewhere if you're writing it in YAML JSON whatever you've got to cut it and you've got to you know and how you make those choices maybe is easier or less easy as the thing gets more complex and that's where you need tools in the end of the day

Or if you just have thousands of different DAGs or workflows in the organization. I've heard so many stories of folks who are like, yeah, we started as a startup and then we had success and the airflow DAGs just kept growing and nobody really went back and sorted those out. And so you have that sprawl. Yeah.

I looked at, when I was at SysFix, we had the airflow underneath there. We had the system with DSL, so I could pull all of the workflows, the thousands that we had, and look through them. And what was interesting is that

There were some oddballs in there that did all kinds of crazy stuff, but most of them are a straight chain of steps. ABC, chain together. That's most of what people are doing. And it's not a surprise. So all this, it's a DAG, it has loops, or it's not a DAG.

Most people's like the 90% case is probably a straight through chain of things. That's a, that number is a guess, but is what I found overwhelmingly the majority way higher than 50% of what, and this was an organization I've been doing this for a while was these straight through chains. And it makes sense. There's some kind of preparation stages of what you're doing. And then there's the main event, you're training them all, you're doing inference, you're, you know, upstarting into a database.

And then there's some cleanup maybe at the end. And that's like most people's workflows. And that's, I think, how these ML data engineering workflows differ from the business process workflows, right? Business process workflows aren't a straight through chain, right? There's decisions being made about, you know,

Did we answer this customer's question? Did that, if that system fails, we go over here to do something else. If there's a transaction, when we've done the transaction goes through, the order is

gets made. If it doesn't go through, there's some other process, right? And so they think those business process workflows, they have much more complexity in them. They have a lot of branching and conditionals, and they have a lot of side effects. Like if we succeed here, we're going to notify another system. We're going to send a message to the customer or something else like that. So they have all these side effects that happen along the way that are just not a thing as much.

Although you can imagine them, they're just not in practice as much in MLOps. So that's how these systems are different, and that's why they're different products for them. It makes sense. I'm just thinking what you get with

Your favorite DAG tool is a Slack message. Right. Which are useful, right? Yeah, exactly. But it's not like what you're talking about with this super complex logic or loops happening or then spitting out into another subset of a workflow. Then it is fascinating to think about what I would, like, I can't believe it took us whatever, 40 minutes, 45 minutes to get into this part of the

the workflow engines, but I have to ask about agents and what your take on them are as almost like workflow engines and workflows in general. And also, I will preface this by saying we had Igor on here a few weeks ago, and he was

talking about how agents should be seen or even LLM calls should just be seen as another step in a DAG. Oh, we're taking messy data and making it neat and tabular. And that's one of the steps of the DAG. But with agents, I've seen so many different examples of folks who have tried to

explain the agents or have the agents work as DAGs where they just make up the graph on the fly. And you now are dealing with this workflow that was created by an agent. There are some companies out there in the mix that I looked at who are specifically focused on agentic systems. I put them in the category of sort of the business process because their tools look exactly like that.

Yeah, and they're a little more generic in the sense that they are, I would think of them supporting more like your chatbot type of interface where you're talking to your favorite airline and things are happening. And when you say, yeah, I'd like to buy that ticket or I have this problem with my luggage, you know, there's a, it interacts with a bunch of stuff and comes back to the agent interface, right? And that's a workflow that they have to manage and that's automated in some capacity.

And so there's people who are building products for that kind of workflow. And I found them amongst the things that I was serving. I think that the sort of interesting side that's not that is this kind of challenge of the LLM-based sort of agent system where you have some inference that's happening there. And then there's a consequence. It's put something out there that is either

code-ish or it's something that's coming back to the user. And that's part of this bigger workflow. And that's more like an embedded workflow. And that could be dynamic, right? Because you say that it could be generated by the system itself. It could be generated by another piece of code.

Or some other model. And there are these sort of part of the mix here are these things that are more like workflow engines and they are, so they're not systems necessarily as much as like it's a library to do this kind of workflow orchestration and and you can write your thing or have it get give it a one of these DAGs of things to do and it will run it right. And to some completion.

And that's more like an embedded workflow engine. And there's a bunch of them like that out there. And that is interesting because it goes back decades to what people were doing with rules engines for workflow systems, because those were engines that you put inside a product. And sometimes there were like desktop applications that were doing this stuff and they're running the rules and acting on your behalf or there's a user in front of them. So now we've got like a chatbot interface or agent out there that somebody's interacted with on a website or through an app and

It's doing the same thing, right? But different scale, right? And it's not running on your desktop app. It's running out there in the cloud somewhere, but it's the same thing. It's an embedded engine running an embedded workflow and dynamic or not. And so I think that's what some really, there's just some cool possibilities there. I have not,

And this is a good research topic for somebody or myself, which is, where are you in that adventure? What have people done successfully? How is it architecturally different from what people are doing now with these interfaces? Where's the sort of there there? And what can you do with the current systems you could do with this sort of theoretical embedded LLM-based thing?

or agent and there's a lot of possibilities what i hadn't thought about before that you just opened my mind to is how

The agent is almost a gateway to choosing the right workflow. So we talk a lot about agents being able to choose tools or have access to tools. And most people think, OK, now it can scrape the web or it can have access to my database. But the thing that you just said is, yeah, one of the tools might be that it kicks off a workflow.

And so then you don't have to worry about the agent spawning a workflow every time and then maybe it spawns the wrong workflow or the workflow isn't exactly like what you need to happen. So the agent just has to choose between what workflow it needs to use.

And that is very much like going back 20 years, but now we have a little bit looser way of having the end user interact with the agents or the if that, then this statement. Yeah.

I mean, that's kind of the interesting juncture that we're at where a lot of the, you know, large language models and we have VGN systems. And so we've taken them apart from workflow systems. So now we have these kind of pieces that are much more advanced than they were back in the

And they're not all in one system just doing pointed in one direction. And so we can take these puzzle pieces, put them together like Legos and make different things out of them. And since the technology is more advanced, we can do some really amazing stuff with it. And I think that's a nice juncture where we're at. I was surprised going through the list of projects, how many there were.

that were still active in all of the categories that I had. And that's 79 is not a huge number. There's a lot of noise out there. But there's a lot of just people who are actively developing these things and maintaining them. They're using them for things and they're in a variety of contexts. And so I think that maybe not

what you hear out there right now in terms of the buzz in the industry, that this is a very healthy, vibrant area of work that people are doing. They're using it for stuff, obviously, because these projects are active and some of them are niche and in the corner and some of them are commercial products and that people are selling and everything in between. And so I think that's good for

Users out there, because they can find the thing that matches it, the only challenge is that there's a little bit of tyranny of choice, right? Yeah. If you're new to this and you're like, I need a workflow system for X,

You've got some choices that you have to make. How do you categorize like the RPA systems? Would that be the business ones that you're talking about? RPA. RPA, what do they call it? Robotic Processing Automation, I think is what it stands for. I don't, you know, that's what I'm trying to think of. Did I run into anything that was more on the automation manufacturing side? And I'm going to say that I didn't see a lot there.

there and what I surveyed. So that might be a whole different thread of this. Certainly there are lots of people doing things like we did at my last company where we're sending a protocol to like a liquid handling robot because it's part of our automation of what's happening in a lab, right? And that kind of like in biotech and in general, like pharma

The kind of automation you can use these workflow systems that everybody's been talking about here because it's a service call to something. The thing has an API and you push a protocol to it and tell it to go. And so that kind of level of automation, I think people are using these kinds of tools for. But the sort of more industrial things, there's a whole other world there. I learned that in science, there's a whole other world of plant automation that uses a different technology

And it used some really old technology, which is scary. Like it uses OPC, which is Microsoft OLE from like 1995. And so, you know, that's why you can hack these like a power plant.

Yeah. Yeah. Hold it for ransom. So there's some areas there that are completely off the radar from this kind of group of workflow systems. And there might be really good tech there. I think that's a whole another world. Yeah. So you've been seeing the trends of what's dying and what's growing. And you did all this research on it. Do you feel like

You have any bets or guesses on what the future of these systems holds? I think there are some good contenders in the SaaS realm for machine learning workflows and general MLOps. And they have some nice tools. I think the challenge for machine learning AI context is that

We're just getting started. And so their customer base are all these sort of early adopters and their startups and people like that. So if you take this thing and you walk into a large enterprise where that's regulated and that is, you know, has like air gap systems, right?

They can't use the SaaS system. Yeah. Yeah. And there's big applications in these places. Making that leap from you can use our SaaS service and everything's cool and we're writing Python code and we're doing all this cool stuff with it to like these places where it's highly regulated. It's and there's all these other enterprise challenges and you have to have certain kinds of certification to be able to operate in there.

That's where the revenue is for these companies to be potentially to go to. And so there's just sort of that to have to make that leap. And then there are some that are doing that. So I think the evolution of some of these companies to be able to provide enterprise products that really meet the needs, these other needs, has to have certain security levels and certain compliance things and be able to work in these sort of non-competitive

cloud-based environments and so forth, or private clouds and so forth. That's the, that, that kind of is a growth area, right? And you have to be able to survive that because it's also expensive audio plus lens. So that's, I'm watching to see who matures in that realm. That's why some of the projects that are open source and good technology and well supported work well in these places because their technology team is going to take them and put them inside of these environments and deploy them. But then they have to manage them themselves.

So I think that's a trend. I was happy to see that I think the business process automation thing, it seems like there's a health community of people using that. I think the agent tech systems is an area of growth for them. To some extent, ML, AI is not an area of growth for them. I'll be curious to know if they get traction in there versus just saying we do ML, AI. Are people turning to these other products that are a little bit older?

We'll see. It's almost like with the business process automation, I see now they're incorporating the capabilities of LLMs into LLMs.

their products. And so now as one of their steps in your whole workflow, you can add whatever an LLM call is capable of doing, whether that is summarizing a bulk of text or it is going and

You scrape a website and then from that scraped website, you pick out the most important stuff. And now you have that data to pass on for the next step. And so you've got you've got new tools that you can work with. And I've seen it done really well with friends in marketing who are trying to create

and what they do, because you've got the, almost like the easy way of creating AI slop, which is just saying to chat GPT, create me a blog post about AI,

whatever GPU consumption in the US, and then it'll spit out whatever it has inside of it. Or you can start with saying, create me an outline, find like three relevant blogs that talk about it. And then you choose as a human, different blogs that you like. And then it, it

uses the information from those blogs, it creates an outline for it. And then you say, now create the intro paragraph, now create three body paragraphs, and you really prompt engineer it to be a much more in-depth type of workflow. And the tools that

are now giving you those capabilities by default because you have the LLM calls. I think one interesting possible evolution here would be that you have engineering teams, which are often very expensive,

taking things like general AI technology and general LLMs, and they're doing various things to in their quality checks and so like that. Does this thing pass our tests? Is it the risk assessment test to make sure that it's not going to do something bad, spit out bad results, guardrails and all this stuff. And so you imagine that there's these more technical workflows that they can take a new model, run it through spaces and say,

this is good, this is not good, or there's some score. Maybe it's not a black and white type of thing. It's a score of how well does it do in these different dimensions.

But as there's more to do and there's more uses of these technologies, you can imagine that there's a higher level user there that's not on the technical team who this is their product. And there's a new model or there's a fix or there's a better model where it doesn't do the bad thing because they tested the new version of it. And there's a workflow for accepting that and rolling it out to their organization or in their application.

If that always has to be a technical engineering problem, that's expensive. And so I can imagine that part of the way that people build, use workflow systems is that kind of multi-layered thing where there's technical workflows, they use very specific technology that's geared towards the task at hand, training a model, evaluating it, producing these risk scores.

And then there's the business level workflow. It's saying, how do I take that model and get it into my application, roll it out to my users, get it into production? And there's a gatekeeper, which is there's, and this is where human in the loop, which we haven't really talked about, but there's a human in the loop step there where it says, do I want to do this? Because there's a,

There's a business decision to roll this thing out. And when you codify that as a workflow, and the only way it gets out the door is that somebody goes and does that human-in-a-loop step of saying, yes, as a human decision maker, it's not just a technical team somewhere who says,

you know, if they do the right thing in their DevOps, you know, thing, it goes out the door. It's a, you know, maybe they'll even know they no longer have that ability. It's only done through the workflow system. And there's a human who makes that decision and it's traceable in your organization. And it maybe is less prone to mistakes, accidentally rolling out the wrong model or a model that doesn't pass your tests. And so that kind of like

level of control and then bringing it out of the engineering organization and back into the hands of a product manager or a business user of some sort. I think that's going to make the thing cost less. You're going to get better results in terms of quality. That's where a workflow system of different and different layers of them can be super helpful. I would love to see that kind of like thing. These kinds of things already exist, but they're run by technical teams.

Right. They're using the same tool to do their DevOps, their operations. Right. And so the tools are all there. But I think that business process automation side of it is not as much because, again, those people are usually like, yes, engineering team roll out the new version. It's a lack message to a person. And then a person goes and does a task.

Looks good to me. Yeah. And that's where we are. I think the challenge with LLMs is they're squishy, right? And somebody needs to look at these risk scores and there's new benchmarks and cool things coming out for

finance it. And then they make a decision about is this risk receptor right to roll this new version of whatever model we're using from whatever provider. And we've done our tests. We have our evaluations and now I have to make a decision in my job as the product manager, as the whomever

just to roll this thing out and you don't want to skip that and you want to record it and you want to just so that we have a true understand how did this thing get out and then want to change your process or change that workflow to to deal with whatever issues your organization might have in terms of the their use of these ai technologies right yeah it's another garden but it's a like a business card 100 in the

fintech or just financial or even just any regulated space that is a necessity because they're going to get audited. And so you got to have that explainability of what exactly you were thinking when you put that out there. And so it makes sense that this would only come in a workflow so that you have that specific area where someone pushed the button and said, yeah, we're good with that. And the logic behind why they chose that

But I do like the idea of taking it out of engineering's hands and just getting a different set of eyes on it because of the

As we had Allegra on here a few weeks ago and she was talking about how she's a big proponent of density of diversity in a room. And so by doing that, by taking things out of engineering's hand, then it's not only engineers that are looking at it. And so by way of that, you have higher density of diversity. That sounds great. And morphosystems are a good tool

for defining that process and recording it and frame metadata and collecting information along the way. So you have a trail of information of what you did and then you can act on that trail in terms of making your processes better. Or when just somebody wants to know what versions of this model are we using out in our systems, they might be different, right? You have that trail, right? And

people are doing this again some other way they maybe they have systems for this but but maybe they're not using these tools that are available and they not have to build a special system for it they can just use a workload system that exists out there and that's so there's that sort of build by choice then too right and you know what this also makes me think of is how a friend told me about

how wild any enterprise is right now when it comes to what AI they're using. And not just like your ML teams, but if you think about governance,

in the enterprise level or on the organizational level. And you have maybe the marketing team is using one business processing software that has some LLM calls or capabilities, AI capabilities within the tool. Then you've got the actual software engineers that are using these AI coding helpers. And you've got the HR team that's using some SaaS software that has some kind of AI capabilities. And

You don't realize it, but you as an organization, you are exposed. If your idea is like, yeah, we're keeping all of our data inside, that is totally out the window because everybody's using a different piece of SaaS tooling that potentially is sending the data wherever it needs to go. And so from a governance perspective, bringing these workflows in and recognizing that we

If it's workflow, if you have it documented in workflows, you also have a bit tighter control on what is being used and how it's being used and how these things are happening, I would assume. But I can imagine that it also can slip through the cracks there too. Yeah, for sure. I mean, there's a lot of data governance challenges in the world here today that...

are made worse, I think. And I'm not sure there's a clear way out of that box right now. Yeah. Even with workflow systems, but... Yeah, it's not going to help that much. That's the truth. Man. But just... Yeah, the data governance piece, I know I had a friend tell me that their company did an audit

And they were expecting that folks were going to be using like 10 tools, the AI tools. And after they did the whole audit, and this is a relatively small startup, like 200 folks, midsize is what you would call it.

What they found is that there were over 92 tools that were being used that had AI capabilities or were AI tools. And they just were looking at each other thinking like, wow, this is wild. And there was a lot of repeat tools. So you have a lot of the same workflows, but maybe it's different parts of the organizations, different branches. Maybe you're paying for

Even if you are okay doing the OpenAI Enterprise Edition, maybe one branch is paying for it, another branch is paying for individuals to be using it. And so all that governance is a mess.

So that was a challenge in my last company, partially because I am not a molecular biologist, right? And so I don't know what is standard tooling and things and resources that they use online. And the way that we took that apart was this process engineering aspect, was having those discussions online.

And drawing those pictures of what is, what do you do on your day to do this task and how, what systems and like, where do you get that result from? Oh, we...

go online and we take that DNA and we stick into this tool and we run a blast search. And then we do this other thing with this other site because they are good at this particular thing. I'm interested. And you're like, okay. And so there's all these like touch points to different systems. And these were things that are very unique, isolate strains, stuff like that. So that data, that genome was our sort of bread and butter.

You have this sort of challenge of if we take a little snippet of it, nobody knows where that came from. And so we're okay with that going out there, but there's a fine line, maybe a gray line there of like how much is too much. Just knowing those interactions, it all comes from doing this sort of

I think is the process. If you can draw a picture like that flow chart, whatever, pick your favorite tool and then, okay, what am I talking to? It also helps you with the data artifacts problem, which is also what data went in, what data comes out. Do we care? Do we store that? Where does it go? Does it go into some knowledge graph? Does it need to be stored for, to have a sort of full view of the experiment or whatever we're doing here for your clients, the industry, like maybe you need to record that because it's a,

you were required to record those interactions.

is essential to your business. People do this, but they don't necessarily do it in a uniform way. And so that's where these, like, I was pushing on this, using this thing called BP and men. It just was a nice notation. It's a visual notation that people have considered all these problems, and so we don't have to make one up. We can just use it. There's tools for it. But that also means you can also say, okay, I'm going and talking to the SaaS service. What does that service do? And then you can

you know, decide whether you trust them. What data are you giving it? What are you, what's your end goal? Yeah. How are you trying to use it? Because maybe we already have another service that we're paying for and you're paying extra for that. I know that we had Maria on here probably two years ago now, and she did this with her company, Ahoyts, because they did not have any centralized ML platform and

And so they just went to all the different teams and said, what are you using? How are you using it? And they recognized that they already had a lot of usage on Databricks. So they were like, I think we should probably standardize to Databricks so that it is cleaner. And just recognizing from talking to people and drawing it out, I really...

there's so much value in that. But what you're saying is even more, it's like you're taking an etch and sketch and you're making a beautiful picture of what is happening at each step and what the goal is for each of these steps. And being able to have that, you then are so far ahead of the curve because you understand each person's tactics

tasks what they're doing how they're doing it but you also understand how that fits into different workflows as you were saying and and not looking at it as one big workflow but seeing all the different workflows and how they interact with each other because then you can ask all kinds of great questions is it worth automating this no is it and do we need to be worried about this data leakage from this service that you're using is there a better provider for that uh

If we were to use some kind of new machine learning tech and generative AI or an OLM or whatever, some model,

Where does it provide the most value in this process that we have, this big workflow? And you can ask all these great questions and you can see what, you know, and you can get samples of your data because now you've done the analysis. It might seem old school and hard to do, which is the challenge. I have this challenge. Oh, this seems like a massive undertaking. Let's start simple. Let's start like a little piece of it or maybe a really big block diagram with like big chunks of what we do on a daily basis. And then there's, they're just black boxes and we need to

dig into that when the time is right. But I think those kind of like different ways of making the problem smaller and more useful, it can be very strategic and where you start, but to get these sort of, I think this is a, it's probably a bigger challenge for bigger organizations. But I also think that smaller companies, you need to think about this because if you're going to use machine learning tech

And in general, whether it's just inference or if it's like I'm building models like that, it's not just those ML engineering teams in a corner that need the workflow systems and the process engineering. It's everything around it as well. And so you can go bottom up, but that's a hard sell. You can go more top down in terms of like, where is this going to provide value in our organization? And what do we in our industry, whatever it is, need to be worried about?

And you have that, again, like a picture is an amazing thing. And there's ways to communicate outside of those technical groups that are used to graphs. You know what I find the most difficult in these is when you update processes and

updating them on all the documentation and then making sure that, okay, this is the newest way that we're doing things, even though maybe it's an experimentation for a few weeks before you really realize if it is a better way of doing it. And...

Most of the time I'll update a process and it's on the fly. And then later, maybe you don't codify that as well. And so it's updated in one place, but not in another. And so that just is that the entropy of all of the stuff that's happening is really hard to wrangle and you get, it's like a workflow debt, I guess you could call it. Right. And that's where if the workflow system is how you get things done, right?

and it can produce documentation and diagrams and things like that and that's where you go to look up like how are we actually doing this. That's one of the sort of core benefits is

is there's no out of sync because it's how you do things. And the challenge when it fits into a bigger process that isn't automated, but I think it's a useful starting point to have those discussions. And I think it makes the technical tasks easier to do when even if it's a documentation piece, artifact that you're drawing this diagram, you go back to the diagram and say, so explain your change that you want in terms of changing this diagram, right?

And then you've started with the change to your description, your architecture or whatever, and the process, the workflow. And then you go off and build the thing. And then there's a kind of a,

reconciliation sometimes of, you know, the reality that I found as I went to go build it. It's maybe not quite matching the, what we thought that they'd be ending, but that's part of the sort of, during an agile process, you should be coming back to that thing and revisiting it. And having a little bit of structure there and diligence helps, but you don't have to, it doesn't have to be overwhelming. And, but...

If everything was workflow, it would just all be up to date. Right. That's an ideal version of this thing. It's not a reality, of course. It would be so much easier.

Evolving Workflow Orchestration // Alex Milowski // #291 01:14:34 Share

MLOps.community

Shownotes Transcript

Evolving Workflow Orchestration // Alex Milowski // #291