We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

SE Radio 641: Catherine Nelson on Machine Learning in Data Science

2024/11/6

Software Engineering Radio - the podcast for professional software developers

AI Deep Dive AI Chapters Transcript

People

Catherine Nelson

Topics

Catherine Nelson: 数据科学家的角色随着工作环境的不同而有所差异，但总的来说，它包含将业务问题转化为数据问题、解决问题以及构建机器学习驱动的功能。数据科学家需要具备数据处理技能（统计学、编码、机器学习算法、数据可视化、数据伦理等），以及一定的领域知识和业务知识，以便将业务问题转化为数据问题。数据科学家的项目类型多样，合作对象也因项目而异，但通常会与产品、组织和工程师合作。数据科学不仅仅局限于机器学习和AI，还包括统计分析和数据可视化等。机器学习是针对特定问题训练模型，而AI模型可以解决多个问题。Jupyter Notebook 对于探索性数据分析和初步建模非常有用，因为它能提供即时反馈。当需要重复训练模型并优化超参数，以及最终部署到生产环境时，就需要从Jupyter Notebook转向传统的Git仓库。数据科学家通常负责初始探索和模型训练，而机器学习工程师则负责将模型部署到生产环境并进行监控。在小型公司，数据科学家通常需要承担多个角色，即使在大型公司，也可能需要承担多种职责。数据科学家应该学习编写测试和使用版本控制，以提高代码质量和可维护性。Python 是目前最常用的数据科学编程语言。软件工程师在与数据科学家合作时，应该了解数据科学项目的不确定性和迭代性。在机器学习的整个工作流程中，数据科学家和软件工程师需要紧密合作，共同完成数据摄取、数据验证、数据预处理、模型训练、模型分析和验证以及模型部署等步骤。数据验证是检查摄取的数据是否符合预期，例如检查数据是否缺失或是否存在错误。重新运行机器学习管道通常是因为数据发生了变化，或者需要重新训练模型以提高性能。衡量数据质量的方法取决于数据的类型，例如数值数据可以检查均值和标准差，文本数据可以检查文本长度等。如果数据验证失败，通常需要人工干预，例如重新运行数据摄取步骤或更改输入数据。深度学习在一定程度上减少了特征工程的工作量，尤其是在文本数据处理方面。在机器学习管道中，通常不会从头开始训练模型，而是进行微调或重新训练。模型的重训练通常包括模型架构和超参数。在机器学习管道中，模型训练通常是最耗时的步骤。随着机器学习在组织中的应用日益成熟，模型的重新训练频率也越来越高。数据科学家在建立初始管道时应该参与其中，尤其是在数据验证和模型分析方面。模型分析和验证是检查模型性能，例如精度和召回率，以及在不同子集上的性能。模型分析还包括检查模型中的偏差，例如某些群体或用户类型的性能差异。过拟合是指模型过于紧密地拟合训练数据，而无法泛化到新的数据。模型的可解释性通常不是自动化管道的一部分，因为它需要人工干预。部署是将训练好的模型交付给软件工程师，以便将其集成到产品中并提供服务。在模型部署前，应考虑模型的大小和计算资源需求。构建机器学习管道时，可以复用训练脚本，但需要编写新的代码来连接各个步骤。机器学习中一个常见的问题是训练和服务偏差，即训练和部署阶段的特征工程代码不一致。未来数据科学领域可能出现的新兴角色包括生成式AI工程师和AI模型评估师。数据科学家和软件工程师之间最有效的协作方式是团队成员之间互相支持和尊重彼此的想法。她对大型语言模型（LLM）的应用前景感到兴奋。机器学习项目的未来方向包括改进现有模型的准确性以及添加新的功能。 Philip Winston: 主要负责引导访谈，提出问题，并对Catherine Nelson的回答进行总结和补充。

Deep Dive

Chapters

The role of a data scientist varies across companies, but generally involves translating business problems into data problems and building machine learning models. Skills required include statistics, coding, machine learning algorithms, data visualization, and understanding data ethics.

Data scientists translate business problems into data problems.
Skills include statistics, coding, machine learning, data visualization, and data ethics.
Domain knowledge is crucial for understanding business context.

Shownotes Transcript

Translations:

中文

This is soft engineering radio, the podcast for professional developers on the web at s dash radio. Don't mp S, C radio is brought to by the attrib computer society, but I tripoli software magazine online, a computer at org flash software. S C radio has an opening for a volunteer host to produce five episodes per year to apply. Please follow the instructions on our website that S E dh radio dot net flash contact.

Welcome to suffer engineer radio. This is Philip winston. My guess today is Katherine Nelson.

Katherine is a freeLance data scientist and the author of two or Riley books this year is software engineering for data scientists. And her twenty twenty book building machine learning pipelines coauthored with henss haptics. Previously, he was a principal data scientist at S.

A. P. conquered. And before that he had a career as a geothermal st. Catherine has A P H D N.

Geo physics from dim university and a masters of earth scientists from oxford university. He is currently consulting for startups in the general A I space. Welcome.

cain s. Today.

we're going to discuss the role of the data scientists and how this role can overlap with or intersect soft engineering. Let's start with what is a data scientist?

That's such a great question. Because what a data scientist is depends on where you work at some companies that can be more in the data analytic space and that you spending all your time training machine line models. But overall, i'd say being data scientists involves translating business problems into data problems, solving them where possible, and then sometimes building machine landing power features.

So what skills does a data scientists need, either prior to get in the role, or what skills do they need to develop to be good at the role?

They need to have skills for working with data, so they would include a knowledge of statistics, a knowledge of coding to be able to manipulate the data. Take courses in basic machine learning, learn about the algorithms that make up machine learning, data visualization, sometimes story telling with data, how to weave these data visualization together to coherent whole. A lot of data scientists will take causes on data ethics, data privacy because sometimes that is part of the data scientists job as well. It's a real mix bag.

IT seems like data scientists need perhaps more domain knowledge or business knowledge than some engineering roles. Why do you think this is?

I'd say that's right. I think it's because you're translating the problems from a business problem to a data problem. So you might be task to answer a problem such as why are our customers turning? Why to some customers leave the business and you dig into the data to try and see what features of the company are correlated with them stopping using your product.

So IT might be something like the eyes of a business or they might flag given you feedback, that has some reasons for that. So you can't really answer a problem like that without having a good sense of what the business does, what products there are, how things fit together. So yeah, I think involves a lot more .

context on a typical project. Who does the data scientists have to communicate worth? Typically.

the interesting thing I ve found with my data science career is I wouldn't say I have a typical project. So i've done some projects where it's been extremely explored. It's been like we might be considering creating a new feature for the product.

Is this even possible israeli blue sky? And then there's other projects i've worked on where it's been towards the production end of things, deploying new models into production. So i'm going to be working with different people depending on the type project. But some commonalty ties would be a product to organization and obviously, engineers, if is involving building features .

for most of the episode, we're going to be talking about machine learning N. A. I. But as I understand that there's more to data science than just these two fields, can you give me example of a problem you solved or a solution you came up with in data science that didn't involve actually, the example .

that I just mentioned, looking at why customers might leave a business that involved no machine learning at all. IT was a predictive modeling problem, but I didn't use a machine learning solution. So the projects that tomorrow around answering questions versus building features at the ones where there is a lower level of machine learning, A I usage and more statistics or data ization general data analysis skills.

I want to mention two past episode des related to data science. There's episode three fifteen iron jensen on tools for data science that was in twenty eighteen, and episode two eighty six k ty alone intro to machine learning two thousand seventeen eighty million is a data scientist. So now let's move into talking about machine learning N A. I.

To start with, what is the difference between these two fields? And in my research I think i've seen the term AI has been evolving a lot. I'm wondering what definitions you use.

The most useful definition i've heard, and the one that i've adopted and continue to use, was from a podcast that I heard with Christopher manning. He is a professor, stanford university, and then natural language processing. And that is that if you're dealing with machine learning, then you're training a model for one particular problem, one particular use.

But an A I model can answer many problems. So you might use your A I model to power chatbot, but you could use the same model to summarize some text or extract some information for some text. Where's in classical, traditional machine learning? If you wanted to have a model that's extracted some information from some text, you gones collect data sets design for exactly that problem. You take some of the input text and what the output that you want to to produce, and then you train your model and measure how accurate IT was on that particular problem.

I'd like to talk a little bit about the use of no books like google collab in data science. This is a technique or method that I think is more common in data science than in soft rer engineering at large. So i'm wondering what are the pros and etheric coins of doing your work inside of a nobo?

Definitely, i'm a huge fan of jupiter notebooks. I love being able to get the instant feedback on what my code is doing, which is particularly useful when i'm looking at data all the time. I can print that data.

I grave really interact with that data while encoding. I find them incredibly useful. When i'm starting a project, I don't quite know where things are going. I'm really expLoring around and trying to see what the data i'm working with can do for me. Or i'm starting with a basic machine learning model and seeing if IT learns anything about the problem that i'm working on.

what sorts of signs are there that maybe you need to switch to just a traditional get repository? What starts to become difficult with .

a no pod for me? I reference when i'm at the point where I want to train that model repeatedly. So in the machine learning problem, I have chosen the features that I want to work with. I ve chosen the data that I want to work with. I've trained an initial model if getting a reasonable results, but then I want to train IT repeatedly and optimize the hypocrites ters, and then eventually move towards the deployed IT into production.

So I think that the main difference is that when I just have code that I may only run once, I don't know where i'm going, I don't know exactly what the final code base will look like. That's when i'm happiest st in the jupiter notebook. But when it's going to be run repeatedly, when I need to write tests, when I need to make sure that code is robust, that's when I do that reactor.

In a little while, we're going to talk through the steps in the machine learning workflow focusing on what IT would be like to make automated reusable pipeline out of them. But let's talk a little bit more about roles. So I think you mention data analyst relative to data scientist. Let's talk about a machine learning engineer that certainly comes up. What is your feeling about their role and how IT differs from either data science or software engineer?

Many companies, I think it's the data scientist that will make some initial explorations, take a fresh problem and say, like, is this even a problem that we should be solving with machine learning? What type of algorithms are suitable for this particular problem? Train an initial model, proof that, that model is going to answer the question that under consideration.

And then it's the machine learning engineer that takes over when that has been established, when these initial experiments have been done and then puts that model into production. And then they look more on the side of monitoring that model, checking that its performance, checking that IT returns. Its the influences happens in the rights of time.

And so how about at a smaller company, I imagine that people end up wearing multiple hats until you're able to hire for all these different rules. Have you seen that?

absolutely. And sometimes the bigger companies, too, if the data science team is more than one person might wear a lot of hats. That's chAllenging because it's a different mindset that you have when you're running in experiments and being very like open to trying lots of different things versus you once being your production mindset where everything has to run repeatedly, where IT has to be very robust.

Let's specifically zoom in on the relationship between data science and software. Are engineering what motivated you to write your latest book? Software engineering for data scientists?

A couple of things early. One is that is a book that I wanted to read earlier in my career as a date scientist. So a while ago, I joined the time I was the only date design tips on a team of developers, designers.

So, and I found that hard to just even understand the language that the developers were using. Like many data scientists, my education didn't include any kind of computer sciences courses. I didn't have that much familiarity with software engineer always working when I started as a date scientist.

And I questions like, what is an API? How do you build on? And I started getting interested in how I could write Better code.

And the books that were available, the examples were in java or they, and what development. They weren't very accessible to me, writing code in python and not needing to have all the skills and background of IT web developer. And they're also they designed to to have a reputation for writing bad code. And I wanted to help change that.

I think you answered this. I was gonna. How much soft rer engineering training do da scientists have? And I think you're saying that on the low end, IT can be more less. yes.

Yes, there's a couple of main routes for people getting into data science. One is from a hard science background, often physical science as or other science phs. So they have the data analysis, but they might be writing more academic code, but doesn't need to be particularly doesn't need to be particularly well tested. And another way is through data science, undergraduate degrees or masters, which may include some level of programing causes, but there's so many things to try and cover in a data science degree that then it's hard to do that .

in debt from your book. And you pick just maybe two skills that you think would be most beneficial for a data scientist to learn.

one where you say that often a gap is in writing tests. That's often something that's not familiar to people from a data science background. And then because data science projects can be so at hawk, so expLoring its not obvious, went to add test.

You can add test every single peace of code that you're writing in the date science project because half of them you're onna throw away because you found that, that particular line of inquiry goes nowhere. There's not really a culture of going back and adding those tests later, but if you then move on from that exploratory code to putting your machine landing model into production, it's the problem. If you OK, it's not tested. Another one is that again, and comes from this exploratory nature, often data sciences, to a reluctant to use version control when it's just an individual project, IT seems like it's more hassle than it's worth. It's not obvious what the benefits of that are until you start working on a logical base.

What programing languages are commonly used for data science? I know python is common to machine learning in general, but there are other more data science specific languages.

I would say iphone is the biggest data science language at this point. Previously, I was pretty big as well, but the proportion is declining a bit. Some people use Julia, but it's not received this widespread adoption that python has.

Those are the other two I had down here are and Julia maybe tackling the same question from a different angle. What should software engineers keep in mind about working with .

data scientist? I think that data scientists will be coming from a different mindset than a software engineer. So they used to tackling very fake problems and turning those into more data focus problems.

And the thing with data science projects is that you often don't know at the start where you're going to end up. If you are working on a machine learning problem, you might find that the algorithm that you end up using is very simple. Using something that's very complex, you might start off with such a tex data extraction problem.

You might start off with trying a random forest space approach to with some very simple text features, but that doesn't actually perform very well, but accuracy is low. Then you might move on to trying a deep learning model, seeing if that works any Better. So this means that hard to estimate the start of the project, how long it's going to take or even what the final outcome is going to be.

Is this going to be a large model or two gigabit model where we're going to need some specialized infrastructure to deploy IT? Or is IT gonna be very small little scale easily? So I think keeping in mind that uncertainty and it's not the data scientist is just bad estimate, that is the nature of these projects is not clear what from the start, what the end is going to look like.

I could imagine a scenario where the software engineers are eager to start deep into the implementation phase, but the day a scientist hasn't yet found the model for sure. And that might take some patients .

in some time to iterate before it's really going to be.

Let's go through some typical machine learning workflow steps and explain from a data science point of view, what are some considerations, what's the procedure technique that might be used? And if we're working with software r engineers to create an automated pipeline, what are some things to keep in mind about this particular workflow step? Basically, what source of tools or techniques will keep in mind for each step? So the first one I have down is data and justice. So I guess there's a lot of different projects. Bt, what are some things we might be ingesting and what we .

feeding IT into? Yeah, this step is when you take your data from wherever is stored in your company's infrastructure and feeding IT into the rest of the pipeline. This is the point where you might also make the slit into training data and validation data. Let's picking up that data from whatever format that is stored in and then potentially transforming IT into a format that can move through the rest of that pipeline.

Can you give an example of unstructured or structured data?

Usually be call text or images on structure data and struct data, data that in a tabular format so that the tabula data could be data about the sizes of companies that you're considering or the constructed data could be something like the text of a complaints that you are trying to make a prediction from.

So are there any specific tools that might come in to play here, whether their standalone tools or libraries that are commonly used for? Interesting.

there's a few different solutions for the entire pipeline that have a data injustice components, tense flow extended as one of these. There's also amazon sage maker pipelines, and I believe ml flow has a similar structure, though I haven't worked to that myself.

The next step I have is data validation. You talked about dividing the data into training and validation set, but that might be a different type of validation.

Yeah, that's right. So when you dividing you a data into a training and a validation set that he used during the model training or analysis, model validation step to check whether that model is sufficiently accurate for the problem that you're working on 的。 Data validation is checking the the data that you injured, ted, is what you expect. So some problems that you may have with that data could include the the data is missing, some things gone wrong upstream and suddenly you're getting no bonuses in your data and then your machine lining model wouldn't be able to train with those no values. And so the point of the data validation step is that if there's a problem with your data, you can stop the pipeline at this point rather than go through the lengthy training step only to find out that there's an error at that point or your model isn't to accurate as you've expected because there's a quality issue with the data.

This place for a second and talk about when we would rerun the pipeline or why would rerun in the pipeline. So if this was just a one of exploratory ory investigation and we created a model and produce the validation, and that was the end, but in this case, we're talking about building a pipeline.

So when is that that we rerun this pipeline? Is that because we have new data as IT? Because we trying to train a Better model or what situations and I guess, relate to that is do we rerun the entire thing? Or is that being able to rerun .

portions of so for many business problems, the data doesn't stay static. The data changes through time. People behave differently with your product and on. So that causes the model performance to degrade with time because if you've trained a model at a specific point in time, it's been trained on that data. And then as your usage pattern changes, then that model is not quite irrelevant that data. So the performance drops that the time when you might want to retrain that model and usually would want to run the entire training pipeline all the way through. If you just run part of IT, you don't actually change anything because the artifact that you get at the end of the propane that you're going to depend into production is that train bubble on that up.

Data, data. So as part of validation, how do you measure data quality? Or under what situation would the data fail validation that might be specific to a project?

Yeah if you had some numeric data, then you would look at basic statistics of that data, like the mean, the standard deviation. You could look at the proportion of knows and that data, if that goes up, that the signal that your data quality has decreased. If you using text data, then it's less obvious what you check, but you could check the length for that text. If something has gone wrong upstream, you might be getting empty strings coming through until pipeline. So that's something you could check.

And what are your options if the data fails? validation? Is that basically signaling for someone to intervene? Or is there any automated step you could take to allow you to continue?

You could consider rerunning the injustice step if it's something that's gone wrong in that step, you could change the data that you are putting into that pipeline. But in general, it's a kind of safety valve against the final bother being incorrect rather than anything that you'd need to change automatically like them.

That kind of raises the question, how long is this whole pipeline going to take? I'm sure IT very is drastically by application. But in the systems you've worked on, can you give me idea of the range of time that the full pipeline take? And the reason I asked that is because if we're preventing proceeding with bad data, we're saving this amount of time. And so I guess if the whole thing was very short, IT wouldn't be a big deal. But if the whole thing was long, then early out could benefit us like the systems .

i've worked on. It's usually been in the minutes to hours scale. So it's not days and days. But the point is that I daily, you would have this pipeline set up so that IT runs automatically without any indefeb from myself without needing to do anything. So it's it's more than being able to automate IT from the start the end than particularly the time saving. But but in here.

I think I might know the answer to this. But does the data have to be perfect? Or how can we judge how tolerant our pipeline is to bad data?

I think that the data should be reflective of the real world that is trying to model. So like if you have data about your customers, about bunch different companies, and that's going to be very variable. One thing is that some machine learning algorithms can copy with missing data. So in that situation, all values do need to be filled outs, and IT needs to be perfect from that point of view. But IT can have a very wide distribution, and that's fine.

So let me read the next four steps. So have some idea where this is going and maybe what to talk about. IT, what steps. So I have next, data preprocessing, then model training, then model analysis and validation, and then deployment. So let's talk about data preprocessing. X, I don't know if this is an official step or is this depend on the workflows, but I guess how is preprocessing different from the previous steps?

Preprocessing is often nym ous with feature engineering. So that's translating the role data into something that you can use to train the model. So if your raw data was text, then that might be word frequency y as or something like that. And that's different from the validation step because in the validation step you are describing the data, you checking the the data doesn't contain mails and so on.

Has deep learning kind of eroded the active for a feature engineering? I remember I worked on a project a long time ago and a huge amount of effort was put into future engineering. And more recently, I worked on something and they're kind of saying that feature engineering can goes away in some cases. What is your experience?

Yeah, I think that's right, especially I worked on a lot of text models and it's become a lot Better to put to not do much with the text, put in in pretty much more and have a more complex model is able to learn a lot more from that text rather than doing extensive engineering to extract both features from the text and then train the model. That seems right to me.

Okay, let's move on to model training. How about this idea of training from scratch versus find tuning? An existing model are both of these possibilities in a pipeline.

So there's actually three possibilities. There's training from scratch, there's fine tuning and there's retraining ly that play model on new data training from scratch. I wouldn't do that in my machine learning pipeline.

I would do that separately in standalone an code to get that model established. The first time around checked that IT is actually accurate enough to solve that problem. Then I would build a machine learning pipeline only when I knew that I had that model and was going to be retraining, find training. You can certainly do within the pipeline because you might want to tweet the hyper parameters of that model when this new data, so you might want to have a small step with them.

You mention hyper prime miners. I was wondering when you say have the model and then retrain IT, what is the model at that point? Is IT all the prime miners associated? I guess what would be part of the model that then gets retrained days?

Yeah, that's a great point. Sometimes it's the model architecture and those hyper ameers, and sometimes it's just the model architectures of if you're in the neural network well, then the number of layers and nap model, the types of layers and there how they're connected. That's probably going to stay static because changing that up within a pipeline is hard because you don't have quite such an instant feedback on whether the model is working as you do in a separate piece of training code is designed to run through those experiments. You've got a lot of other code around that model that makes us a little more complex to do back ban.

You mention the duration of the pipeline you worked with range from minutes to ours is most of that time in the model training?

yes. Yeah, that's right. Usually the other steps are shorter and it's the mod training belongs. So that's why it's important to have those other steps so that you know that your data is in good shape by the time against to the time consuming training step.

Another element of time would be how long we're going to use a model before retraining. How does that very I think, very early on. I imagine people used models for a long period of time. And more recently, I feel like people are retraining more and more often. Is that a trend?

Yeah, I think from one of worked on that depends on the maturity of the use of machine learning in that organization. So early on, you might built these models fairly at hock and then it's a big effort to depend them into production. But when you do that, IT makes a big step change in the accuracy of your product was as time goes on, you're making smaller improvements in your product, but you want to make that more frequently. So having that pipeline set up allows you to change your model often as the input data changes. So I think that's why is now.

I guess, taking a step back for second during all of these steps of creating a pipeline, in what cases are we able to just hand this over to software engineers and kind of give them the information about the model? And in what cases do you feel the data scientists needs to be involved? What's the trade off from either hand off situation or collaboration, like side by side situation?

Part of this is going to depend on the team that you have the scale set available.

But I would say it's very useful to have the data scientist involved in setting up the initial pipeline in particular things like what are the criteria for the data validation step? What is a sensible distribution of your data, what are the high performance is that you should be considering when you are training the model? And particularly in the step that we haven't talked about yet, which is the model analysis step, I think that's where the data scientist has a really crucial part to play. I think any data scientists can learn the skills, but they need to depend a pipeline, but often being able to debug that complex system, being able to set IT up so that IT interfaces with the rest of the products, making sure that it's well tested. And so that's a software engineer can not so much value .

in i've worked with many scientists unrelated to data science, but just sort of a in biological ter physicists. And yes, they can learn to code as just part of their education are sort of on the side. But then there are certain skills that they don't have as much experiencing. But I definitely feel that many people end up learning to program by necessity. And I think that's a good thing .

for the most pressure. And for me, it's also because I enjoy being able to write Better code. It's a lot of fun being able to do this well and write code that robust and scales. And so you .

mention model analysis and validation, that's the next step. So because the words the same, how is this different from data validation, I guess, is a question of what are we validating?

yes. So this is why we are looking at the performance of the model in terms of how accurately IT is, what's the precision and recall and also sometimes splitting that accuracy down into final grain sector. So if you had a model that you are delaying in lots of different countries, does IT perform equally well on the data from all those countries.

That something that you could do with your validation data, which is the split of your data, goes into the training data and validation or test data. And I know that we're using the word validation waiting many times in this, but that seems to be the way that the technology is gone. So analysis is looking at that accuracy across different aspects.

This is a point where you might look for bias in your model as well. Is IT providing Better performance for certain groups? Is IT providing Better performance on your female of users?

This is your male users that would be something you want to look for this step. And then the validation part of that is the model should only be deployed if it's is acceptable in all the analysts critters. So this is kind you are fine or go or no go step before you delayed that model into production setting.

I wanted to flag the term precision and recall we're not going to try to go through all that, but i'm guessin that relates to false positives versus false negatives.

Yes, a classmate problem that would be some of the metric seem like that.

And when you say deciding whether the model is good enough, the reason we might want to make that decision is maybe we have a previous model that was pretty good already and we don't want to make IT worse.

exactly. That's exactly right. They've taken the same model and retrained IT on new data. Does IT perform Better as a result of this?

How about over fitting versus general ization? I don't know those are too technical, but can we just give an idea for what those have to do with the model analysis?

This is way if you have over fit your model, then you'll see a higher accuracy on your training sets and on your validation sets, and then you'll know that your model is too closely replicating your training set. And it's not able to generalize to new. And what you wanted to be able to do that is exactly dealie IT to new data when you are delay IT into production.

This might be a nave question, but is the size of the model unrelated to the amount of training data you have, or does the model size grow depending on the data set?

If you have a small dataset in a large model, then it's very prone to overfilling because your model is basically to memorize the data that you have. So then IT might perform not that great on when IT sees new date. Yeah, that's good question.

And I have this down as interpretations. I don't know that's really part of this step, but I guess that's a element of a model. Whether you really understand what it's doing or whether it's sort of a black box.

Yeah, that's definitely part of model analysis, but it's probably not something that you you might not put this in the pipe ine because for models where there's like you need to explain IT, that's almost the opposite of automating the problem.

If you need to look very carefully into what features are causing IT to make a certain prediction, you can really do that as part of an automated set up that's going to deploy a model as seen as it's trained. You go to a take that step back. If you needed to interpret your model, you might run the pipeline up to this point. But then maybe this is where your data scientist steps in to really take a good look at that model. And then it's a manual step to deploy its production.

That also kindly raises the point. Maybe if we have a large team or working on building this pipeline with the involved of data scientists and sufferer engineers, but then maybe there's some other data scientists working on sort of the next generation model or something like that. So these things could be happening in parallel is not like we just drop everything and build A A static pipeline.

That's I no.

So let's start talking about deployment. At this point. We have a pipeline. We can run IT, hopefully with minimal intervention, hopefully IT runs kind of straight through, maybe IT retrains with more data. What is different about deployment or maybe production? And i'm sure IT varies based on the project. But in general, kind of what is this transition from? I have a pipeline I can run to its been deployed in, which maybe means is running at a larger scale or it's running more often something.

Yeah, this is really the point point at which I kind of hand off my tasks to the softer engineers as well. The pipeline produces a model artifact, which can be a saved set of model weights and and then that's when that model gets handed over and is set up to run influence. So that's really the point at which the data scientist job is done, that newly trained model has been produced, and now I can be put into the products that can provide the service. The explanation .

how does scaling enter into the equation here is IT possible that our model is too large to computational, that we can't put IT in production? Or is that something you ve thought of from very early on? Is there sort of a point here where we have to decide if we can even deploy this?

Ideally, you would know that before you have started building the pipeline because a lot of what effects that will be, the model architecture, the number of players, the size of the layers, if you're dealing with the neural network. So ideally, you would want to know what some of the requirements are for a influence at the experimental stage when you're trying out lots of models. Because if you need your model to be extremely fast, you might limit yourself in these experiments to models that are small and fast.

This might apply to all the steps, not just deployment. But what is your opinion on heavily reacting and involving the initial sort of pipeline versus rewriting or having a sort of a clean start when you're imployment in the pipeline?

A lot of the pipeline solutions like tense flow extended or amazon, say, to make pipeline, they will take as inputs scripts like your training script. So you don't necessarily have to rewrite. You can just kind of pick up the code that abused from training and drop that into the pipeline code. That works pretty well because you already know that, that working, you don't have to complete the very right from scratch. But a lot of the boilerplate around code around the pipeline is that's new that's coming up from scratch to actually link those pieces together and make sure that one step goes the next to the next one role.

We didn't talk about that. The start was ml ops, which I guess is an offshoot of dev ops. It's more specific to machine learning.

Do you have any experience with sort of what could go wrong in development that maybe is special to machine learning? So in regular back and work, there's certain problems that crop up. I'm wondering if there's any machine learning specific maybe has to do with monitoring the influence time or the memory is or something.

So one problem that I have heard about but not experience myself is called training and serving skill. And what this is, is when you might have a engineering code and model training code, and you need to do that feature engineering when your model is deployed and running inference as well as when you are training the model.

So what you might is update that feature engineering code in your training pipeline and then train a model based on those particular features and then deploy that model and forget to update the future engineering code. So when your model is running inference, the data is coming in is getting the old feature engineering, but the new model. So the data might be completely valid after its have that feature engineering step.

The features might be completely valid, but they answer the right distribution for that particular model. So your model performs worse and you know question why? So that's a subtle one that can come up.

Yes, that does sound like very machine learning specific debugging where the behavior of this model that we validated and analyzed is not where we thought that would be. And maybe we have to roll something back or maybe we just we have to fix this.

So it's good to have monitoring in production to be able to sniff out these kinds of things to check that the model curacy is like you expect.

okay. IT sounds like we did all the steps of the pipeline. Let's start wrapping up. We talked about different roles throughout this episode. Do you see any new rules on the horizon or roles that you think are changing or becoming more prominent?

Yeah, I think there's a couple of things here now that i've started working on. Generative isolations, A I engineer, is the obvious one, someone who is not necessarily building the model or training the model, but is designing applications that based on A I models that's huge and that's only gonna continue to go as the other thing I see is the I think there is a big place for data scientists in the world of A I and that's making sure all that that's in evaluating A I models. So if you're trying to use an L M for some particular business application, it's actually very hard to check how accurate that model is. So I think that's a huge growth theory of of a data signs.

And again, I think yeah when we say A I engineer, we're talking about people working with foundation models of some time. I could be L M or not. Maybe they're just working through an A P I and they don't run any machine learning locally at all. I guess in some cases, their programming, you know in english writing, proms and things.

yes, I think that's gonna here to the and the idea is huge. China.

we try to focus on collaboration, move in beyond just the software machine learning. What collaboration methods have you found and useful between data scientists and software engineers or other members of the team, whether to be a tool or just a technique that you find is helpful?

I think the best way of collaborating is having a team that's open to ideas. IT doesn't speak really to any tools or techniques, is all about valuing each other's ideas. And the best team have worked on for that have been where people are very supportive of each other and supportive of those. If someone bringing in new ideas, that seems to me to be the key, rather than any particular tool or piece of support.

So continuing to wrap up, what are you excited about looking ahead in machine learning projects you're working on or that you see in the wider industry?

So having relatively recently started working with l ms. IT just makes me so just so blown away by the capabilities at the moment. I've been working on a project to showcase a good example of a use of L.

M. For a starts up. I'm working with and the project we decided to choose with, extracting people's flight details out of an email. Sir, you send the service an email with your flight details and that will extract the origin, the destination, the time of departure, time, rival and so on, and populate those into whatever kind of APP want to work on.

And I worked on similar products before, and scene things like big piles of regular expressions, or doing all this complex feature engineering to get this out of IT. But now I can do IT in a five line prompt to open the eye. And IT works Better than all those previous incredibly complicated solutions.

And you can even do things like, you can ask IT for the airport code instead of the name of the city. And even if the airport code isn't in the email, you can still get that because the L M. Has that context. So i'm just really excited about what are going to be able to do with these things in the future.

Yeah, I guess so that leads to the question at the end of a project, all the technical success is there, but what is the business value and sort of what in the projects you've on? What are they doing sort of for the next version? Is that usually kind of doubling down on techniques? Or is IT tackling a completely new area, the business? Or what are the possible directions at the end of a project?

Yes, says some of the projects have worked on IT adds a new feature that wasn't possible without machine learning, and a lot of these have been extracting information out of instructive data. And that gives you the capability to add something that you didn't think you are going to be able to do office, a new feature to your customer and then you might spend some time optimizing that so that the accuracy improved so on. So I think, yes, the baLance between improving these existing models, improving the accuracy and then the step change, adding a complete the new feature okay.

working listeners find out more about you or your new book.

Best place is to follow me on like.

okay, I will put your handle or your linked name in the show notes. Thanks for talking to me today, Catherine.

thanks. Well, it's been great talking to you.

This is Philip whinstone for software engineer radio. Thanks for listening.

Thanks for listening to S. C. Radio and educational program brought to by the police software magazine. For more about the podcast, including other episodes, visit our website at sc dasha radio internet to provide feedback.

You can comment on each episode on the website or reach us on linked in facebook, twitter or through our slack chanted as a radio dot slack dark com. You can also email us at team at s dash radio internet. This in all other episode s radio is licensed and creative commons license two point five. Thanks for listening.

SE Radio 641: Catherine Nelson on Machine Learning in Data Science 48:19 Share

Software Engineering Radio - the podcast for professional software developers

Deep Dive

Shownotes Transcript

SE Radio 641: Catherine Nelson on Machine Learning in Data Science