We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

218: Balancing test coverage with test costs - Nicole Tietz-Sokolskaya

2024/4/18

Test & Code

AI Deep Dive AI Insights AI Chapters Transcript

People

Brian

Python 开发者和播客主持人，专注于测试和软件开发教育。

Nicole Tietz-Sokolskaya

Topics

Nicole Tietz-Sokolskaya: 我认为软件工程师们对测试的讨论往往不够细致，常常陷入产品经理希望加快进度而工程师希望进行更多测试的僵局。我们需要重新思考测试的目的，即降低风险，并权衡测试的成本和收益，思考测试的边际效益。盲目追求代码覆盖率可能会适得其反，例如为了提高覆盖率而编写不必要的测试，或者因为代码重构而降低覆盖率，这并不一定意味着测试不足，反而可能是代码更精简的体现。我个人不再使用代码覆盖率作为衡量标准，而更关注测试的上下文和最终结果。关注测试覆盖的代码上下文比单纯的覆盖率数字更重要。并非所有代码都需要测试，有些测试的价值不高，例如对CSS代码的测试。我主要关注端到端测试的代码覆盖率，而对单元测试的代码覆盖率不太关注。在端到端测试中，处理异常情况的代码也需要测试，可以使用模拟或架构设计来模拟各种故障场景。系统对预期错误的处理能力也属于系统行为，需要进行测试。决定是否测试某种情况需要权衡其发生的可能性和后果。对于一些低概率的错误，可以设置监控告警，而不是编写大量的测试用例。决定测试哪些内容，需要考虑这些内容对业务的影响程度。应该优先测试对业务影响最大的功能模块。测试资源应该集中在对业务影响最大的功能模块上。性能测试需要模拟真实的负载情况，否则测试结果没有意义。理想情况下，性能测试应该模拟整个系统的端到端负载。可以使用监控数据来验证性能测试结果的准确性。需要权衡服务中断的成本和编写及维护测试的成本。维护测试代码的成本也很高，需要考虑在测试范围和维护成本之间取得平衡。端到端测试比单元测试更不容易受到代码重构的影响。测试套件的运行时间也是一个需要考虑的成本因素。高代码覆盖率可以帮助识别和删除无用代码。 Brian: 测试套件的运行时间不应过长，理想情况下应该在几分钟以内完成。测试套件的理想运行时间应该在几分钟以内，过长的运行时间会影响团队效率。测试套件应该模块化，以便可以快速运行和调试。测试应该分层进行，开发阶段的测试应该快速，而CI阶段的测试可以更全面。

Deep Dive

Key Insights

What is the main trade-off discussed in Nicole's blog post about testing?

The main trade-off discussed is balancing the cost of testing (time, resources, maintenance) against the risk of not testing enough (potential bugs, downtime, and business impact). The post emphasizes the need to critically evaluate how much testing is necessary and where to focus testing efforts to maximize value.

Why can refactoring code reduce code coverage percentages?

Refactoring can reduce code coverage percentages because it often results in fewer lines of code. If the same number of tests cover fewer lines, the coverage percentage drops. This creates a paradox where improving code quality by making it more concise can appear to reduce test coverage, even though the code is better.

What is the issue with aiming for 100% code coverage in a React app with styled components?

Aiming for 100% code coverage in a React app with styled components can be problematic because it requires testing every line of CSS. This often leads to low-value tests that don't meaningfully improve code reliability, while consuming significant time and effort that could be better spent on higher-impact testing.

How does Nicole suggest deciding what to test and what not to test?

Nicole suggests focusing testing efforts on the most critical parts of the system, such as features that directly impact revenue or user experience. For example, live interaction features that could result in significant financial loss if they fail should be prioritized over less critical features like analysis tools, which can tolerate occasional downtime.

What is the challenge with performance testing, according to Nicole?

Performance testing is challenging because it must closely match real-world workloads to be meaningful. Simulating realistic user behavior is difficult, especially before deployment, and testing isolated components doesn't capture the non-linear interactions between different parts of the system. Monitoring production behavior can help refine performance tests over time.

What is the cost of maintaining a large test suite?

Maintaining a large test suite can be costly because it requires ongoing effort to update tests as the codebase evolves. Refactoring becomes more difficult, and the time to run tests increases, which can slow down development workflows. Additionally, tightly coupled unit tests can break frequently during refactoring, adding to maintenance overhead.

What is Nicole's opinion on the ideal length of a test suite?

Nicole believes a test suite should ideally run in single-digit minutes, with five minutes being the upper limit for a reasonable development workflow. Longer test suites can significantly impact productivity, especially if developers get distracted while waiting for tests to complete.

How does Nicole use code coverage to improve code quality?

Nicole uses code coverage to identify and delete unreachable code. By analyzing coverage reports, she can pinpoint code that isn't being executed and remove it, which improves code quality and reduces unnecessary complexity. This approach also helps ensure that the remaining code is well-tested and functional.

Shownotes Transcript

Translations:

中文

Today on the show, I have Nicole, and we're going to talk about testing in Python and a whole bunch of fun stuff. Welcome to Testing Code. This episode is brought to you by HelloPyTest, the new fastest way to learn PyTest, and by the Python Test community. Find out more at courses.pythontest.com.

Welcome, Nicole. Yeah, thanks for having me on, Brian. Before we jump in, I'd like to kind of introduce who you are and what you do. Yeah, absolutely. I am Nicole Teetsakolskaya. I go by N-Teets on most places on the web, and I'm a principal software engineer at a tiny startup called Remesh. So most of my focus there is performance, scaling, backend stuff, and I'm

So we use Python for a lot of our back end code and our machine learning code. And then we also have a myriad of like TypeScript, Go and Rust and some other places as it calls for it. So these days I split my time between some of the machine learning code in Python and some of the Rust code for any of the things that just need to be wicked fast. Awesome. Cool. Learning Rust is something on my to-do list and Nicole has a resource that we'll link called yet another Rust resource.

and it looks great. And what was the goal of that? You said it was trying to get people started really quickly. - Yeah, so when we were introducing Rust at work, I knew that like the traditional path to learning Rust was here, here's a thousand page Rust programming language book, go read it. And I needed people to be up to speed a little bit quicker than that. So the goal of this course is to get you to be

proficient enough to pair with someone who knows Rust after just a few days, and then kind of take it from there to deepen your learnings later, but generally just kind of reduce the intimidation factor and get you running. Proficient enough to pair. Does your organization do pair programming? We do it kind of on an ad hoc basis, mostly when like,

situations call for it or we have an interesting problem. But as a remote first organization, we've never really settled into a very strong pattern of doing it regularly. Okay.

That's kind of my comfort level with pair programming anyways, as needed basis. Cool. So I ran across your blog post called too much, a good thing, too much of a good thing. The trade off we make with tests and it talked about, there's just, there's a lot of stuff it talks about, but one of them is balancing risk mitigation and the,

and basically how much you want to test and to that sort of thing. So can you introduce the topic for us really about testing and coverage and whatnot? Yeah. I mean, I think like often as software engineers, we don't have a very nuanced discussion about testing because a lot of it comes down to like product pushes back on testing and wants a faster schedule. Engineers push back on a faster schedule and want more testing.

And this is kind of calling for taking a step back and thinking about like, what is the actual reason that we're doing these tests? And what are we trying to get out of them? And like, is there a point of diminishing return? How do we know how much testing is the right amount of testing? And like you mentioned, I mentioned code coverage in there. Like if we're measuring it, what's the point of that? And like, should we track it? So yeah, just kind of went through a number of things, but yeah.

I don't know if you want to dive into anything in particular from there. Well, let's hit the big one that is contentious sometimes is code coverage. Yeah. What's your, do you have an opinion around code coverage or should we measure it?

Yeah. So the first thing I did in my internship, first job as a software engineer at Vinny Flavor was adding code coverage to the team that I was working on and then increasing their code coverage throughout the summer. So I got started in a culture that was very pro-code coverage. And then over time, I started to notice that when you have this environment where you track code coverage, there's this tendency to push it up

uncritically and you start to do really funny things like if I refactor something and I reduce the lines of code in it, but that was used in a lot of the tests or like then you've now reduced the code coverage percentage because you made something smaller. So if you have something like a code coverage ratchet, then you start to have funny side effects of like, okay, I refactored this thing, but I need to go add tests somewhere else.

so I don't reduce the test amount. And that's where I think like,

That one like threw me and I'm like, well, that's a really good point. So I took it to an extreme with a mental, like a mental exercise of let's say I've got what a hundred lines of code with two 50 line functions. And I've got two tests around that. And, and for some reason, let's say I only have 50% coverage. So I'm covering 50, 50 lines of code and,

If I refactor that and make it tighter and make that, make that like a 40 lines of code and, and then add 10 to the other one, I'm suddenly like shifted and I've got like a 40% code coverage, even though I've just, I haven't made things worse. I've actually just tightened up the code. It's a, it's a weird thing to think about that. Like,

Yeah. You can't really, if you're reducing the number of lines of code, you can't really, then the, the, the, you're going to change the coverage percent. Right. So, and we want that. We want people to make tighter code. But anyway, it's the, the, if you, if you don't have a hundred percent coverage, then the, the number is hard to deal with. Yeah.

I guess I'm, I'm personally for my personal projects, anything that I have complete control over, I am a hundred percent code coverage kind of person, but I use the exceptions liberally. Like if it's like, if I'm using a, a, a,

third party code, or even if I've entered in some code, I know the lines of code that I'm using and I'm not going to try to test everything else. Um, so I do have specifics that like, this is the code I know I'm using and I want a hundred percent coverage of that. Um,

So you were on that project earlier. Do you still... I guess you brought up the fact that maybe the number and trying to reach 100% isn't that great. But culturally, where do you stand now? How do you use it? Yeah. So right now, I don't really use code coverage. I think that it could be a good tool to see...

integrated in alongside a code change. Like, is the code that this is affecting, is that covered by tests? Which I think would be really useful. So I think like, contextually, what code is covered is to me more important than the raw number. But on projects I work on, we just haven't like taken that leap where we look more at

more end-to-end signals, I guess, of correctness and performance and things. But a friend was also telling me last week about a project that he was on where the manager of the team wanted 100% code coverage, but it was a React app and they had styled components in it, which meant that your code coverage getting to 100% required every single line of CSS to be covered as well.

It's like, what's the meaningful test here? And that was also a weird pattern where it's like you were saying, exceptions to it have to be made for either third party things or things that might not make sense to test in this context. Or I would even argue CSS things, you could write a test around that, but I don't think it's a particularly high value test and your effort may be better spent somewhere else.

Yeah. And I got to caveat my opinions with, I don't test any front end stuff at them. Yeah. I kind of always tested the API down. The other thing is I don't really, the only thing I use code coverage for is coverage based on behavior end to end tests. Um, I don't think code coverage on unit tests or that use that interesting to me because, um, because of some of the abuses that you probably have seen that you, uh,

there can be a there can be a piece of code that can't even be reached by the system that you can write a test for and you'll never know to delete it because it's being covered right so yeah so in those end-to-end tests how do you get coverage on for example the error paths when something exceptional happens downstream service is broken or a really strange error happens

In those cases, I think those particular parts are that the concept of mocking or something sounds good. I don't use mocking on any of my professional projects, but I'll design into the architecture a system where we can simulate all the failures that we want to be able to cover. So it's the like, I don't know, in the in the.

in the case of like a internet service or something, an equivalent of the different error codes that we expect to be able to get and recover gracefully. Those are things that you want to know. That is a, that's not really an error condition. That's a behavior, right? That you, that you want your system to handle. So yeah, it's gotta be tested, but it's some, some error conditions are really hard to,

get naturally without forcing the hand. So how do you deal with it? I don't like what we do. We do testing in smaller units for things that are handling upstream services using a mocks where we, where we can. And then that's a lot of like hope that your mock interface and your actual thing return similar responses. Yeah.

And in Python or dynamically typed languages, this is a sore point for me because I don't have a type system telling me that my mock is the same thing. But other than that, I mean, it comes down to that value trade-off of

How likely do you think this is? And if it does happen, how much do you care? So like, if your service that you're relying on historically has been up the vast majority of the time, then maybe it's okay to assume that and like swallow a 500 error on the rare case that goes down, just depending on how much effort it is to actually test that.

I think that's like a case-by-case basis. So it depends on the service, right? Or what you're building. Yeah. And it's like in those cases, what I would like is I would like alerting so that if that's happening, rather than like a test that like, okay, we recover it, but just like an alert that, oh, we're getting a lot of 500s from this critical service. Maybe someone should wake up and look at that.

You brought up the problem with mocks and basically the API drift or something that your Mac doesn't match. And I can't remember the keyword, but there is a way, at least with the Python unit test mock library to say...

make it match this API and it should always match it or something. But I can't remember what that's called. Oh, that's great. I should, I'll definitely look into that. That feature that I was trying to think about during the interview, of course is, is the auto spec feature of mock. The other, the other thing that I wanted to bring up, I guess was, was the, the risk part of like we're testing because we want to, we want to mitigate risk, right? Yeah. Yeah.

I think that's why. So, so how do you value, how do you decide what to test and what not to test then? Yeah. I mean like a very clear example of this from my professional experiences, our platform at work, if a particular portion of it that people are interacting with live, if that goes down and we don't recover in like under a minute, then that's potentially a lot of money loss for both us and for our customers. Whereas like,

The analysis features, it's really nice if they're up, but it doesn't matter quite as much if they go down because people can wait. So it's not a timely thing. And that's where we made a deliberate trade-off at one point that this slice of features is the slice that we really, really need to know if anything was going to break it.

And that's where we devoted most of our efforts for performance testing, just robustness testing for all the changes in it. And then you have a lot more other small bugs slip through in the other part that aren't necessarily as impactful. And obviously in like a monolith, you're going to have interplay between the different parts. You can't isolate them perfectly, but we were able to target our efforts based on like, if there's a major bug, which part of this is most impactful to the business?

Yeah. And that's where we started. I think that's a great, a great way to think about it. Also, one of the things you brought up was performance testing and especially in, um,

like end user services. Performance is, it's important because if it's too slow, people will just assume it's broken, right? - Yeah. - But performance is tough because it, like, it is a little wiggly. How do you deal with that? - Yeah, I mean, it's really wiggly and it also depends so highly on your workload. So if your workload is not realistic, your performance test isn't really testing anything for you. And like, it'll give you broad stroke direction, but not a lot of useful information.

So this is one I have another blog post about is like, why is load testing so hard? But the crux of it, I think, is just that, like, it has to match what your actual behavior is. And that also you don't know, you don't know what the actual workload is going to be until you've deployed into production, you can guess how people are going to use it. But you can't get that real workload until it's in production. So there's some mismatch always there. But

Ultimately, you have to try to simulate, in an ideal case, simulate end-to-end for the entire system. What is the workload that users are going to put on it? Because if you test different services in isolation, you're not capturing interplay and non-linear interactions between the different components of the system. Do you utilize monitoring to try to figure that out?

Yeah, so we like to look at our monitoring to make sure that the actual behaviors we're seeing match what we're doing in our testing for performance stuff. There's some really interesting research out there, I think from telecom companies in particular, that I read when I was starting this project a few years back.

where they were talking about actually generating synthetic workloads automatically from monitoring. Interesting. As far as I know, it has not been put into practice outside of telecoms because it's also really expensive. So telecoms,

Circling back to the risk discussion, unless you have a whole lot of money riding on the line, if your system goes down, it's just not worth it. Whereas if you're a telecom, you're an emergency service for the whole country, so you better stay up and throw money at it to make sure you do.

Yeah. Also, one of the things you brought up kind of, we hinted at so far, but you brought up directly in the article as the trade-offs between the cost of some downtime or a cost of a service breaking versus the cost of writing tests. And then also one of the things that I'm very cognizant of is the cost of maintaining tests.

Because it always feels good to get good coverage, good behavior coverage, and then also a large suite of tests. You feel you're comfortable, but that large suite of tests is also kind of a beast to turn if you have to refactor or things change. You have to maintain test code just like you maintain tests.

the rest of your code. Yeah. And I think that's where when you were saying you aim for 100% coverage on like the end to end testing, I think you have a little bit better time there with refactoring because if you change internal stuff, you're not going to break the test as much because they're not as tightly coupled to it. Whereas if you have like really high coverage on unit tests, it's so tightly coupled to the actual structure of the code that refactors

do get very into the weeds on changing the tests. And another cost of those that I didn't put in the article is just the time it takes to run them. So like, as you get more and more tasks, you're either going to pay for more compute to run them faster, or you're going to wait longer. And that gets really frustrating.

Yeah, and it's interesting with even little tiny things like Python libraries or a PyTest plugin or some little extra feature. We've kind of gotten lazy. I think some of us have gotten a little bit lazy with CI and say, well, it's okay if it's just a few minutes. And then like, but I'm running it on...

on six different versions of Python on three or four different hardware platforms, and that multiplies it out. And even if those run in parallel,

that's a little that's a lot of compute power when sometimes it doesn't matter like i've seen i've seen python libraries that are tested over like uh a ton of versions of python and they're not utilizing that they don't really need to they could they could pin it like the the upper and lower and probably be fine um there's some some risk and benefit there and also i mean if you really aren't hardware specific

I don't think you really need to run on multiple hardware platforms all the time. There's a lot of pure Python libraries that are tested like that that I don't think need to be. Yeah, and you can have different configurations for different changes. So as you have changes come in, maybe you want to test them on just the oldest and newest, but then when you caught a major release or just periodically test on more so that you catch those rare changes, but you don't have to do it every single time.

So just out of curiosity, what's your, you don't need to share with me if you don't want to, obviously, but what would you consider a short test suite and what's a long test suite? What's kind of too long? I think if I can get up and go make a coffee, it's,

is probably too long. So I would say like five minutes is too long. But like realistically for me, it also depends on is my ADHD medication in effect or not? Because like if it's not in effect, then if it's not done before I look away from the terminal, I'm somewhere else and it doesn't matter how long it takes. But if it is in effect, I can sit there and wait a couple of minutes for it. So I think like single digit minutes is pretty reasonable.

Double digits is like this has a major effect on your team and your productivity. What about you? Yeah.

Well, okay. So I have, I don't have an option there with, with a lot of, since I'm working with a lot of hardware stuff on a daily. But the, the thing that we do is try to modularize our tests so that a particular test module or test directory or something can be worked on. And that that bit is under just a few minutes.

So that, like you said, that development workflow, if you're working in this area, you shouldn't have to wait for 10 minutes. A few minutes is even kind of long. So I'd love it to be under a minute for something that I'm working on a day-to-day basis.

But then once I think, oh, this is good, and I push it to merge, I'm okay with like, you know, 10, 15 minutes if necessary in the CI because I probably caught it locally anyway. So the CI is really just having my back in case I broke something that I didn't mean to, things like that. So the multiple layers, I think, is good to be able to say, hey, development workflow needs to be fast, but we also need to test thoroughly as well.

And I think that anybody listening that thinks like even a minute is way too long because you need to be able to test with every keystroke. That's crazy. And I don't think people should have to try to worry about that. But maybe, I don't know, maybe Rust people can because Rust is so fast. Yeah. I mean, in Rust, your tests can be super fast. You're just waiting for the compile time before you can run them. Oh, right. Okay. Yeah.

I forget that it's compiled. It's compiled and it's also like, have you run Go? Yeah. So Go's compiles are incredibly, incredibly fast. And have you done C++? Yeah. Okay. So those are pretty slow. And then Rust is like, okay, we're waiting. We're waiting a while. It's not fast. They're making efforts to get it faster, but that's definitely one of the sore points for Rust and like,

one of the advantages of an interpreted language like Python is like, yeah, I can just run it and it's there. Yeah. One of the cool things about Python is it's compiled anyway, but nobody realizes it because we just don't see it. Yeah. I mean, that, that leads to a really a philosophical question is like, what does it mean to be compiled? Because like, to me, if you have a bytecode, like if, to me, I think it's the motion you go through that makes the meaningful difference. It's like,

Do I run a script directly or do I have a separate compile step? - Yeah, so do you have a compile step with Go? - It's up to you. You can certainly run scripts without it, but you can also have it pump out a binary that you can then run separately.

I guess, well, like the only thing I run is something built. So I run Hugo, which is built with go, but I don't, I don't actually compile Hugo. I just run it. So, yeah. Yeah. So the, the compile stuff there, like you can do go run. I think it is. It's been a while since I ran it directly, which will compile the source and then run it. Or you can like do a separate compile, get the, the distributable binary and ship it to someone and then they can run it. Hmm.

But one of the lovely things about all these things that are fast is it's helping out Python. So at least, especially Rust is helping make Python faster, which is neat. Yeah. I haven't had the opportunity to use Rust across the Python boundary, but it's really cool. And it warms my heart that we can do things more safely and with a little bit less C in the world. Yeah.

yeah yeah well okay so i hopefully i can uh i can agree with you at some point in the future half of my like you know my paid gig is c plus plus so uh i don't want to throw out c plus plus all together my condolences i was traumatized by c plus plus in a previous job uh i'm sorry uh it is yeah it is painful the um you were talking asking me about compile times and

Uh, compile times are to the point now where ours is pretty fast. I mean, relatively, um, we can count it in minutes at least. So whatever. Yeah. Anyway. Um, well, Nicole, it was been lovely talking to you about testing. Um, we're going to link to the, at least the, um, let's see too much of a good thing. You also brought up, brought up why is, uh, why is load testing so hard? Um, uh, we'll totally link to that. And, um, and then also your, uh,

your Rust tutorial and also I can't wait to get started with this. Yeah. I think we can also drop in a link for Good Hearts Law, which we danced around with code coverage being a bad measure, but that makes explicit why.

Yeah. The only thing I wanted to throw in is one of the reasons why I kind of taunt with the whole 100% code coverage is I utilize it mostly to find out what code to delete. My favorite way to get increased coverage is to remove code that can't be reached. So anyway. Yeah.

Yeah, I love that. But people freak out. Like when I delete code, people are like, but we need that. Like prove that you prove to me that we need that. And I'll put it back. Also, I hope it's in version control. So if you ever need it, it's still there. Yeah, exactly. That's why we have it. Oh, yeah. Cannot stand to see commented out code. We might need this later.

We'll get it later if we need it. Don't comment out code. I mean, for a short period of time, it's fine, but it's cringy. So, okay. Thanks a ton, Nicole. And it was good talking with you. Yeah, you as well. Thanks so much for having me on. This was a lot of fun.

Thank you for listening and thank you to everyone who has supported the show through purchases of the courses, both Hello PyTest, the new fastest way to learn PyTest, and the complete PyTest course, if you'd like to really become an expert at PyTest. Both are available at courses.pythontest.com and there you can also join the Python test community. That's all for now. Now go out and test something.

218: Balancing test coverage with test costs - Nicole Tietz-Sokolskaya 27:31 Share