We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode 115. Demystifying the User Experience with Performance Monitoring

115. Demystifying the User Experience with Performance Monitoring

2021/5/11
logo of podcast Code[ish]

Code[ish]

AI Deep Dive AI Chapters Transcript
People
G
Greg Noakes
I
Innocent
Topics
Innocent: 作为Raygun的开发者,我认为Raygun提供的工具能够帮助开发者通过崩溃报告、浏览器性能和应用性能监控来提高软件质量。我关注用户在高峰期的体验,并以此为基础反向追踪支持应用的技术。我会检查代码路径的执行频率,是否存在N+1查询问题,并考虑重构代码和数据模型,以便数据能够以更一致和及时的方式从数据库中提取。Raygun可以将浏览器中的页面加载性能数据与服务器端的性能和崩溃报告关联起来,提供全面的客户体验视图。APM不会跟踪每个用户的全部trace,而是根据采样率来减少数据噪音,只提供有用的信息。 Greg Noakes: 我是Greg Noakes,我想了解Raygun是什么,以及你在Raygun的工作内容。我想知道Raygun是否提供工具来检查代码在运行环境中的实际表现并获得反馈?我喜欢分析单个请求,查看在不同函数和后端服务上花费的时间,以便优化高访问端点的性能。Raygun能帮助我做到这一点吗?我想知道Raygun如何跟踪用户在应用程序中的旅程,以便查看用户访问的所有端点并了解他们的个人体验?

Deep Dive

Chapters
Innocent Bindura from Raygun shares his holistic approach to performance monitoring, emphasizing that the absence of crash reports doesn't guarantee optimal performance. He stresses the importance of considering end-user experience, especially during peak periods, and then working backward to identify technological issues.
  • Absence of crash reports doesn't equal optimal performance
  • Holistic approach: consider end-user experience during peak periods, then investigate supporting technology
  • Analyze load times, response codes, browser vs. server-side issues
  • Examine surrounding applications, exceptions, queries, and code paths
  • Optimize database queries and data presentation for timely response

Shownotes Transcript

Translations:
中文

Hello and welcome to Kodish, an exploration of the lives of modern developers. Join us as we dive into topics like languages and frameworks, data and event-driven architectures, and individual and team productivity, all tailored to developers and engineering leaders. This episode is part of our Tools and Tips series.

Welcome to Kodish. This is Greg Noakes, Distinguished Technical Architect with Salesforce Heroku. Today I'm talking with Innocent, a senior developer at Raygun. Innocent, could you tell me a little bit about what Raygun is and what you do at Raygun?

So, Raygun is in the performance monitoring space. We provide tools and utilities that developers use to improve their software quality through crash reporting and browser performance and application performance monitoring as well. So, how do you approach the performance monitoring and kind of the application introspection? So,

When I look at an application that is performing suboptimal, firstly, my philosophy is the absence of crash reports does not mean that software is performing really well. They are hidden things in software. They are tools that we use when we develop that don't work 100% very well as we would want them to.

But I mean, they do get the job done, but there are areas in there that can be improved. So I tend to take a holistic picture. I look at the size of my audience. And if it's something sizable that gets a lot of traffic, for example, a shopping cart that gets a lot of traffic on a Black Friday.

I would want to be in a comfort zone when I know that during the peak periods, my application is still performing. So I tend to look at the end user, how their experience looks like during very high peak periods. And from there, I start working my way back to the technology that is supporting that application.

So the first point of call is obviously load times for a user, the response codes that they're experiencing on their end. And from there, I try and determine whether it's an issue with their browser or an issue with the software on the server side.

Given that it's an issue on the software with the server side, I would then start looking into all the applications surrounding that application that supports the final product that the customer is seeing.

and pick that apart, find what exceptions I'm experiencing. If there are no exceptions, then look into maybe the queries that are running behind the scenes. If I'm using an object relational mapper, that's a favorite one for me to go to.

I'll then look to see how often certain code paths are being executed, if we're experiencing any form of N plus one queries, and look into ways of restructuring the code

and remodeling the data that we are presenting to the user so that it comes out of the database in a more consistent and timely fashion. And I assume that at Raygun you provide tools so you can do that introspection, so you can examine your code in a running environment and then get good feedback on what it's actually doing? Yes, definitely we do.

It's a well-rounded approach, but one size doesn't fit all. From time to time, I find myself having to

put some arbitrary measurements inside my application that get reported in a separate dashboard that Reagan doesn't provide. But overall, Reagan does provide a one-stop shop for most of the things that I would want to cite in my application performance. Right. And one of the tactics I like to use is taking apart

an individual request, like you said, you know, looking at the time I'm spending in the different functions, the time I'm spending talking to my backing services, whether it be databases or APIs or whatever, and see where I can, you know, tune a little bit more performance out of highly hit endpoints or endpoints that are accessed a lot in my code. Does Raygun help me do that as well?

I think this is the area where Reagan really shines because it paints that picture straight from the user experience right down to the server-side performance. So when you load a page in your browser and the telemetry of the performance is sent through to Reagan, as well as your application running under a performance application monitoring on the server-side and crash reports for it coming through,

We can then associate all those telemetries together and show you that this user experience is associated with these crash reports and it is associated with these stack traces on the server side. So you've got that holistic picture of customer experience on the browser relating to the server performance on the backend.

That sounds super powerful. So do you have folks that use your tool inject some JavaScript into their web pages and some code to go ahead and capture the execution? Or do you have a different way of approaching that collection of that telemetry?

Yes, we do take that approach of injecting a little lightweight snippet of code into your JavaScript on the front end. We've got a very light SDK on the back end that hooks into your exceptions.

And APM takes sort of a different approach where you don't have to do anything within your code, but you do have to install an agent that runs against the server that is running your code. So I assume it's like a lightweight gem or something, being that I'm a Ruby person, something that I would just put into my bundle or file and then go ahead and install.

pass off like an API token to access your guys' servers so I can start feeding that information directly in and then start generating those reports.

Yes, that is 100% correct. And seeing that you are a Ruby person, I have just been working in the APM team these past few weeks, and we did launch APM for Ruby. Oh, cool. So the way it works is you have got one or two gems that you will reference depending whether you are using Rails or Sidekick.

That germ, you then provide environment variable that holds your API key for the first run. Once we see that API key for the first run, it then gets persisted in a JSON file that we will read over and over and over again for as many times as you restart your application. Now, I see you also have some questions.

Crash Reporting. Are you using those same sort of tools for crash reporting? Are you introspecting the logs or what are you using for that? For crash reporting, we've got lightweight SDKs or lightweight providers that you inject in your code. Generally, the way it works, there's two approaches.

We provide a catch-all, so all your unhandled exceptions that don't gracefully terminate a request, we can tap into those exceptions and report on them. But there is a better way because we try and encourage best practices for developers when they are working with software. And one of the best practices is

You anticipate and handle all the exceptions in your application so that the user experience is not clunky.

but you gracefully handle and try and recover where it is possible. But an exception is an exception. You do need to know about it when it happens. So we do offer a way of manually sending those exceptions as they occur and you catch them in your try-catch block. I'm not sure what the equivalent of Ruby here is.

a .NET person. So I'll give the examples from a .NET perspective. So when you catch your exceptions, effectively that exception has been handled and it might not bubble up all the way into the hook for the catch-all. So there you will have to implement some manual logging and

The same way we would log those exceptions to a text file and then have a look at it or maybe log to a database. That's the same way you would just log those exceptions to Reagan. And doing so also comes with an added advantage of

You can add tags and extra information, maybe relating to the user who experienced that exception. And it offers you better troubleshooting options when you know who has been affected, when it happened, where it happened.

So being able to tie together the browser experience and the code introspection on the server, that seems pretty powerful. Do you have a method of tracking, say, a user's journey through the application? So I could, you know, maybe with anonymized data, look at one user as they transit through the application and see all the endpoints that they hit and see what their individual experience is like?

Yes, definitely we do. So our RAM tool, when it is integrated with your application, first and foremost, your user information is an opt-in. So when you do opt-in, you populate the fields with the information that you are most interested in.

So the anonymization part of it there is totally within your control. And then we do follow the user using our internal session ID. So the SDK that is integrated with your code, your JavaScript on the front end, creates a session ID that we track internally within Reagan

and we follow through the user through every single page that they visit. And that internal ID can be associated with your crash reports. So for example, if

your user were to experience a JavaScript exception, that would also get sent through to Reagan to the crash report endpoint with the same session ID. And we also check the session ID associated with that is issued by your browser.

when you experience an exception on the server side, that browser session ID is present in that exception. That's how we're able to correlate those two. So we actually have the full user experience

and their individual sessions on the server side. That's really cool. Yes, and over time, we're also able to give you a complete picture of how the user's session performed from the point they visited your page, logged in, visited a couple of pages, and then left your application. And the crash reports and the traces relating to that particular user

also tied up with that session on the Reagan side. An important point though is with APM, we don't track

all the traces for every user depending on the sampling ratios that you choose because APM tends to send a lot of data, a lot of it which might be just fluff you might not be interested in. So we've got a sampling strategy that sort of reduces a lot of that noise and give you some interesting information when the interesting information is available.

So not all users might have traces, but if you set up your tracing for one for one, we will have all that information. So how do you combine, you know, I understand having kind of that user token that you can follow them around with. How do you combine, say, multiple users experience and multiple application experience?

functions experience into kind of a holistic view of the kind of overall application performance, almost like a generic score or something like that. Do you have a concept like that? Yes, indeed we do. So we have been speaking of the user specifics, which is a result of a drill down.

When we actually go a number of levels higher and get a bird's eye view on the application, you will get your aggregated stats on the application performance, say for the RAM product.

You will have each page aggregated over time, regardless of how many users you've had in a period of time. You won't look at the individual sessions. That information is aggregated and you are able to see, for example, your median, your P90 and P99.

which is what interests me about the RAM product because I tend to focus on the P99 figure. Whoever is in there has had a terrible time and that forms the basis of my investigations. I want to know why there are so many sessions in that P99 and that P99 is probably a six or seven second load time.

I want to move that to a sub three second. So whoever is sitting in that P99 bucket is of interest to me and I'm able to drill down further into their specific sessions to find out what was going on. More often than not, you'll find probably we have a data center in the United States and this customer is sitting there

somewhere in South Africa, for example, and their load time is affected by latency, there's really nothing we can do for that kind of user. Or perhaps there is. AWS now has data centers in South Africa as well. So it might mean that we need to route their traffic to a data center that's closer to them to get rid of that latency.

Or our assets are loading a lot slower. We might have a cached site closer to them. So we do have that ability of taking a bird's eye view on everything and decide on the specific areas that we really want to drill into.

Yeah, I wish we had a way of breaking the speed of light. I always joke around with folks when we're talking about, you know, latency between data centers and stuff that, you know, we haven't been able to figure out how to break the speed of light yet, but we are working on it. So, you know, keep tuned. Maybe we will one of these days.

And I totally agree with you about the P99. You know, over the last 12 years, that has really become kind of that holy grail to me is the more I can push that P99 number down and get it as low as possible, the better experience for all of my customers on any website that I'm working for. So I think that not a lot of people think about the P99 because when I talk to new customers who

who are just kind of undertaking this journey of performance optimization, a lot of times I have to educate them on exactly what a P99 is. So maybe we could take a few seconds and you could tell me your understanding of what a P99 is. So any folks that are listening who aren't aware of it will come away with a knowledge of what that is and why it's so important.

All right. So my understanding of the P99 number of how we use it and how we display it to our customers is an aggregate value that a subset of your customers fall into. If you were to draw a normal distribution curve, it would appear like a bell-shaped

And there is a long tail that will stretch towards the right, almost like in a symptoid boundary. And right at the tip of that long tail, you've got a bucket with an arbitrary number. If you've got millions and millions of customers on your website or application, that number might be sitting in the hundreds of thousands.

I always tie this with behaviors. I'm a millennial myself, and the younger generations after millennials have got far less patience than I do, and I've got far less patience than the generation before me, the baby boomers. So I want to maximize profits, and I know I'm dealing with people who don't have patience for slow loading sites.

I'll give you an example. If I'm shopping online and the application I'm shopping with is not performing, I'll simply shut it down and move on to the next. And if that's not also performing, I'll shut it down and move on to the next. So I'm interested in keeping these people that fall into that bucket of slow loading times.

So for me, the P99 represents the number of people having the absolute worst experience with an application. That is why I take particular interest in it and find out those reasons that are affecting that small number of people. If I can solve their problem,

I have probably improved the life of the 98% before them. Exactly. That's exactly how I think about it. That's a fantastic explanation.

So we've got all of this data coming in from the browser, from the application. We've got it bubbled up into dashboards and overarching metrics. What do you suggest people use all of this besides, you know, pushing down that P99 number? What other uses for all of this information that you're gathering about an application's performance? What do you think folks should be using that for?

Right, so from experience, I have learned that decisions should be made based on numbers. I think the better part of my career, I have thumbsacked a lot of numbers without real empirical evidence of why we are deciding on a certain thing.

And if I can give you an instance where these numbers actually did help me in making up a lot of decisions. One of my previous roles, I was a team lead and we had an application that had been problematic in that company for over two years. And the problems were we were trying to do everything in one go.

and then hooking up some monitoring tools and augmenting that user experience. We then realized that we were doing too much in one call, and there were certain transactions that could obviously be deferred and processed on the side without the user waiting for the feedback because most of the time they wouldn't be interested in that feedback anyways.

almost immediately. It will be something that they would need to look at maybe a month later as an aggregated report or something. Where I'm going with this is looking at that data enabled me to redesign the application and I took the command query responsibility segregation pattern

where stuff that is not mission critical is deferred for processing by a number of workers in the background. And the stuff that is mission critical is executed in a transactional manner real time. So when you actually look at the performance of your application, you've got to determine what the happy path is, what the mission critical path is,

and decide on deferring processing for later in the background. Not everything needs to be transactional and not everything needs to be real time. So we are collecting all these metrics

reporting on them and giving you data that you would need to look at to make an informed decision. If you were to ask me why the application was originally designed in that manner of everything is transactional, that is because somebody thumpsacked what the best practices were and thumpsacked that application.

the loads would be so little and things would be performant. But over time, that didn't prove to hold up and it called for a complete redesign. So you might be a company out there experiencing one of these constant problems with your applications where you can't decide what to keep and what to hold off.

Having this kind of data enables you to see what that mission critical path is, what that happy path should be looking like for your customers and make that informed decision based on actual actionable data.

So do you have any closing comments? I think the life of a developer is an interesting one. We fit in everywhere where the situations permit, and we definitely take different routes to develop our careers. But ultimately, what we should all be concerned about is the quality of the products that I produce,

Definitely reflect on my capability as a software developer. What sets me apart from the next developer is not the amount of cool techniques I can do with code. It's delivering a product that actually works. And what better way of knowing what works when you actually measure things? Everybody should live by the philosophy of assume nothing, measure everything.

everything and everything should be measured. That's fantastic advice. It's been phenomenal talking with you. It was educational for me and I learned a lot about Raygun and what you guys do and how you think about the broad space of performance introspection. It was just great to talk to you, Innocent. Looking forward to doing this again. And with that,

Thanks again for being on Kodish. Thank you for having me. I love these kind of things where we talk about technology and best practices. And kudos to Hiroki and Salesforce for having this sort of thing out there.

Absolutely. We're always happy to give back and to talk to folks that have been in the industry. And as I had an old boss who used to like to say that he didn't trust anybody in this industry that didn't have a few scars. So to be able to show off our scars and the stories behind them and hopefully allow other people to get different scars than we've gotten. Yeah, that's definitely the way we should be doing things.

My experience should not be the next person's experience.