Taming MTTR with Anomaly Detection

Now You See Me: Tame MTTR with Real-Time Anomaly Dete... Kruthika Prasanna Simha & Prschita Prschita

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this engaging talk by Kruthika Prasanna Simha and Prita from Apple, hosted by CNCF, the focus was on enhancing anomaly detection methods to effectively reduce Mean Time to Resolve (MTTR) in systems. The discussion revolved around using real-time anomaly detection to not only improve Mean Time to Detect but also Mean Time to Resolve, thus optimizing operational efficiency and maintaining Service Level Agreements (SLAs). They presented various statistical and machine learning methods, emphasized the importance of real-time monitoring, and how open-source tools can aid in this task. The talk concluded with insightful Q&A sessions addressing practical implementations and potential challenges.

Highlights

Leveraging real-time anomaly detection methods can greatly reduce Mean Time to Resolve ⏰.
The use of multivariate analysis allows for better system insights by examining multiple metrics simultaneously 📊.
Open-source tools and platforms, such as Prometheus and TensorFlow, offer effective frameworks for implementing anomaly detection 🛠️.
Real-time monitoring not only catches issues quickly but helps maintain customer satisfaction by improving service availability 👍.
Anomaly detection techniques are crucial for SLA compliance and enhancing user experience 🌟.

Key Takeaways

Anomaly detection isn't just about detecting issues but resolving them faster ⏱️.
Multivariate anomaly detection provides better insights than univariate methods 🧠.
Open-source tools like Prometheus and TensorFlow can simplify this process 📦.
Incorporating real-time anomaly detection can substantially reduce downtime 🛠️.
Use of dependency graphs can aid in identifying root causes efficiently 🔍.

Overview

In their captivating session, Kruthika and Prita dive deep into utilizing anomaly detection to manage Mean Time to Resolve (MTTR). They open with a discussion on the significance of reducing detection time and elaborate on advanced methods that also minimize resolution time for system anomalies, crucial for operational efficiency and SLA adherence.

They explore real-life scenarios where implementing both univariate and multivariate anomaly detection methods can offer superior system insights, helping pinpoint root causes of anomalies faster. This enables businesses to swiftly manage and resolve issues, reducing the overall downtime, and enhancing service stability.

Throughout the session, the speakers stressed the importance of using open-source tools such as Prometheus for anomaly detection and how integrating these tools can streamline the detection-to-resolution process. The session wrapped with a vibrant Q&A, providing additional clarity on implementation hurdles and practical advice for leveraging anomaly detection in various environments.

Chapters

00:00 - 01:00: Introduction to the Talk The chapter titled 'Introduction to the Talk' begins with the speaker expressing gratitude to the audience for attending despite the late hour and promising not to delay their evening plans. The focus of the talk is on improving anomaly detection methods to affect the mean time to respond. The speaker briefly introduces themselves as Cria, mentioning that she had given a talk the previous day, indicating familiarity with some of the attendees.
01:00 - 02:00: Speakers Introduction and Agenda The chapter introduces two key speakers involved in machine learning and observability at Apple. The first speaker is a machine learning engineer working in the observability team, with a background in observability, machine learning, and data science. The second speaker is Prita, an engineering manager who leads the Observability Intelligence and Analysis team, also with a background in machine learning, data science, and data engineering. The agenda for the discussion involves starting with a case study, exploring the problem with delayed detection, and discussing how normal detection techniques can help address these issues.
02:00 - 03:00: Case Study on Delayed Detection and Anomaly Detection The chapter focuses on advanced methods in anomaly detection, emphasizing their role in reducing the Mean Time To Detect (MTTD) and facilitating root cause analysis. It introduces the concept that anomaly detection can be enhanced by integrating contextual information, which in turn helps in reducing the Mean Time To Repair (MTTR). The chapter also indicates that it will culminate with a question and answer session.
03:00 - 03:30: Overview of Anomaly Detection This chapter introduces the core focus of anomaly detection, emphasizing its value proposition in improving real-time detection and resolution times. It highlights the importance of leveraging statistical and machine learning methods for effective anomaly detection. The chapter aims to provide insights into how anomaly detection can enhance operational efficiency by reducing the mean time to detect and resolve issues.
03:30 - 05:00: Value Proposition of Anomaly Detection The chapter discusses the value proposition of using anomaly detection to improve the mean time to resolve issues. It also explores how open-source frameworks and tools can be leveraged to build an anomaly detection pipeline. The chapter sets the context with a use case involving the creation of an online astronomy shop, highlighting the dream of space exploration and the allure of its products.
05:00 - 06:00: Use Case: Astronomy Online Shop In the chapter titled "Use Case: Astronomy Online Shop," the focus is on an online platform known as Stellar Stash. Stellar Stash offers astronomical gadgets like telescopes and binoculars, with promotional offers such as flat discounts. The chapter emphasizes the importance for service owners to ensure customer loyalty, business growth, and providing users with a stellar experience. It also highlights the necessity of maintaining high availability and reliability as key components of successful online service management.
06:00 - 08:00: Challenges and Solutions for SLA Breach Detection This chapter discusses the concept of Service Level Agreements (SLAs) and their importance in ensuring optimal user experience in online services. It explains that SLAs are commitments on the expected performance standards, such as processing 95% of transactions within 2 seconds, handling at least 2,000 requests per second, or maintaining a failure rate of less than 1%. The chapter further delves into the methods of tracking these SLAs by associating them with key performance indicators (KPIs), which help in monitoring and ensuring compliance with the agreed service levels.
08:00 - 09:00: Understanding Anomaly Detection Techniques Understanding Anomaly Detection Techniques - This chapter focuses on the importance of monitoring service performance metrics such as request latency, throughput, and failure rate. It emphasizes tracking key performance indicators (KPIs) and service level agreements (SLAs) to maintain optimal user experience and prevent service degradation. It suggests collecting KPI metrics through service instrumentation and storing them in Prometheus. These metrics should be displayed on a dashboard to easily monitor and ensure consistent service performance.
09:00 - 10:00: Tools and Frameworks for Anomaly Detection The chapter discusses the limitations of manual anomaly detection in scalable services with millions of users. It emphasizes the impracticality of constantly monitoring thousands of dashboards as a service grows, especially when additional subservices, like tutorials for budding astronomers, are introduced. The need for automated tools and frameworks to handle anomaly detection efficiently in such contexts is highlighted.
10:00 - 12:00: Statistical Models for Anomaly Detection This chapter addresses 'Statistical Models for Anomaly Detection'. An initial issue discussed is the challenge of scalability in traditional approaches, highlighted humorously by a reference to 'The Hitchhiker's Guide to the Galaxy', where tutorials are quirkily attributed to a character named Marvin. The main focus is on the problems encountered with conventional anomaly detection methods. Specifically, longer detection times are a key concern as they contribute to the failure to meet service level agreements (SLAs), ultimately eroding user trust. The text suggests that there must be a solution to these challenges, likely pointing to statistical models as an efficient alternative for anomaly detection.
12:00 - 14:00: Visualization and Alerting in Anomaly Detection Workflow The chapter discusses the adverse effects of system downtime and service unavailability, highlighting issues such as longer detection times, missed SLAs, and system failures. These problems often lead to increased operational costs, manual firefighting, degraded user experience, lost user trust, and potential revenue loss. The importance of visualization and alerting in anomaly detection workflows is implied in preventing these negative outcomes.
14:00 - 15:30: Multivariate Anomaly Detection The chapter titled 'Multivariate Anomaly Detection' discusses the objective of addressing SLA breach detection by monitoring KPIs. It highlights the importance of identifying abnormal behavior in these KPIs, which can result in service degradation or SLA breaches. The goal is to detect these issues, determine the root cause, and rectify the problem to restore services. The chapter hints that the solution lies in the direction of a specified approach.
15:30 - 16:30: ML Models for Anomaly Detection Anomaly detection involves identifying data points or events that deviate from normal behavior. It uses proactive and real-time monitoring to ensure system reliability and scalability, reduces detection time to enhance user experience, and discovers hidden performance opportunities to improve service availability.
16:30 - 18:30: Integration of Anomaly Detection with Root Cause Analysis This chapter discusses the integration of anomaly detection approaches with root cause analysis to enhance the performance of services. It emphasizes adapting to data drifts for future growth. An example provided is an online storefront with a Service Level Agreement (SLA) aiming for 95% of users to experience latency under two seconds. The chapter outlines how sudden spikes in key performance indicators (KPIs), such as latency, can be detected and addressed.
18:30 - 19:00: Value Proposition of Multivariate Anomaly Detection Anomaly detection enhances the ability to swiftly identify unusual behaviors and send notifications and alerts, significantly reducing the mean time to detect anomalies compared to traditional methods. It allows proactive monitoring of anomalies across key performance metrics, addressing potential issues to ensure an optimal user experience. The chapter discusses the role of an end-to-end workflow in anomaly detection.
19:00 - 21:30: Enhancing Root Cause Analysis The chapter titled 'Enhancing Root Cause Analysis' begins with the process of instrumentation and data collection, emphasizing the importance of collecting metrics. The storage of this data follows, creating a foundational layer. Analysis is highlighted as a key phase where anomaly detection is critical. Visualization of data is necessary to derive insights and connect various data points. Finally, the chapter discusses the importance of alerting and notifying to reduce the Mean Time To Detect (MTTD). Each step in the process is essential for an effective root cause analysis.
21:30 - 35:00: Q&A Session The chapter 'Q&A Session' discusses various tools and methodologies for monitoring and instrumenting services, specifically focusing on KPIs of a service like an astronomy storefront. It mentions using tools like 'Open Telemetry' with SDKs in Python and Go for instrumentation. The chapter also highlights the importance of storing these KPIs for monitoring purposes and suggests various storage solutions, including Prometheus, Thanos, and Cortex, which offer high-performance storage and capabilities for monitoring services.

Now You See Me: Tame MTTR with Real-Time Anomaly Dete... Kruthika Prasanna Simha & Prschita Prschita Transcription

00:00 - 00:30 um good evening everyone uh thank you for being here so late this evening and I promise you I won't keep you all for too long from your um dinner plans so today what we want to talk about is how you can use your anomaly detection methods and enhance them to impact meantime to respond so without further Ado let's Jump Right In so we'll take a quick moment to introduce ourselves um if any of you attended my talk yesterday um you'll remember me I'm cria um I'm a
00:30 - 01:00 machine learning engineer at Apple and I work in the observability team my background is in observability machine learning and data science um prita hi I'm prita I an engineering manager at Apple I lead our observability Intelligence and Analysis team and my background is an observable team machine learning data science and data engineering all righty so um let's do a quick walk through of our um agenda for today so we're going to start with a case study um talk about what is the problem with delayed detection and then how normally detection can help with
01:00 - 01:30 that and reduce meantime to detect and then we'll jump into um you know a little more advanced uh anomal detection methods that can help with root cause detection and how that reduces mttr and finally we'll jump into some questions okay so for most of us here when we say anomaly detection we automatically associate that with mean time to detect however you can enhance your anomaly detection methods and add some contextual information to it to get
01:30 - 02:00 a lot more insights from anomaly detection and so the core focus of our talk today is going to be how you can leverage realtime anomal detection methods to not just improve meantime to detect but also improve meantime to resolve um some of the key takeaways we hope you'll get today is we'll talk a little bit about value proposition of anomal detection talk about some of the statistical and machine learning methods that are out there for anomally detection and then jump into the important part which is how can you
02:00 - 02:30 improve meantime to resolve with anomal detection and also talk about how you can leverage some of the open- source um Frameworks and tools to build your anomaly detection pipeline so with that rashita do you want to take away with our use case thank you cria all right so we'll start by setting up some context for our case study imagine that you creating your very first online service which is an astronomy shop sounds really cool it is it is every space Explorer dream it's full of all these fancy
02:30 - 03:00 gadgetss like telescopes and binoculars we're also offering like a flat discount so get them while supplies last what do you want as a service owner you want to make sure that your customers are loyal so they keep coming back you want to grow your business and you also want to make sure that your users have a stellar experience that's why we call it Stellar stash now how do you go about doing this at the end of the day this is an online service right so you want to make sure High availability reliability and good
03:00 - 03:30 user experience and you do this by setting up slas what is slas slas are service level agreements in this case there could be that 95% of the transactions in this online store need to be processed within 2 seconds at least 2,000 requests need to go through every second or less than 1% of the requests should fail what you're going to do is track these slas by linking them to key performance indicators in
03:30 - 04:00 this case these directly translate to request latency throughput and failure rate and then you track your kpis and slas to ensure that there is no degradation in the service and there continues to be good user experience so how about how do you go about doing this what you can do is you can instrument your service using uh collecting the kpi metrics and store them in Prometheus and put them up on a dashboard and you can watch this dashboard to ensure that there is no
04:00 - 04:30 degradation of the service but how long are you going to do that you have to get back to selling so this doesn't really scale moreover what if your users grow so now you have tens of millions of users what if you want to add further subservices to your service so in this case we want to add tutorials for you know budding astronomers or people h on how to use these fancy telescopes uh you just can't keep watching tens of thousands of dashb sports or metrics day in and day out
04:30 - 05:00 that just doesn't scale also before I move on for The Hitchhikers Guide to the Galaxy fans in here there's like two Easter eggs those last two tutorials on the right are taught by Marvin so essentially you can't scale this approach let's talk a little bit about the problems that you're going to run into the first is longer detection time longer detection time is going to lead to missed slas which is going to erode the trust that you users have in you it is going to lead to increased
05:00 - 05:30 downtime which is essentially going to lead to unavailability of your service from time to time when you have issues like longer detection time missed slas all of this snowballs into compounding system failures and you're going to spend all your time either firefighting manually intervening further leading to higher operational costs and overall a degrad user experience where going to where you're going to lose trust with your users and also potentially take a revenue hit so not stellar and
05:30 - 06:00 experience for Stellar stash what we are trying to do here is to solve this problem essentially SLA breach detection so you want to track your kpis Whenever there is abnormal behavior in these kpis that leads to either service degradation or SLA breaches you want to identify that you want to figure out what the root cause of this was fix the issue and bring the service back up all right now that we know what we want to do how do you do it the answer is noral Direction
06:00 - 06:30 what is aomin detection it is a technique used in data analysis to identify data Points events or observations that deviate from expected or normal behavior of a system so what you're doing is you're using proactive and real-time monitoring to ensure reliability and scalability of your system and service you're reducing your meantime to detect to improve your user experience you are discovering hidden performance opportunities that are leading to improved service availabilities and finally another value proposition that anomal adds it it
06:30 - 07:00 adjusts to data drifts to adapt to Future growth for your service okay now let's go back to an example that is connected to your astronomy online storefront um imagine you have a SLA where you are saying that uh at least 95% of the users should have a latency less than two two seconds for this you are tracking the kpi which is the latency you see that there is a sudden spike in your kpi you can detect this
07:00 - 07:30 using anomal detection so you can quickly identify this Behavior send off a notification and alert which would significantly reduce your meantime to detect compared to the marinal approach because you couldn't have been nestly staring at the dashboard at the right time to catch this so anomaly detection is basically helping you proactively monitor AC anomalies across all your key performance metrics to help address potential issues and ensure a great user experience all right so now let's talk about what an end2end workflow for an
07:30 - 08:00 noral direction would look like it starts off with instrumentation and collection because we have to collect metrics uh then you need to store this data so that's the storage layer you have to analyze because that's where anomaly Direction comes in you have to visualize this because that really helps you derive insights and connect the dots and then finally you have to alert and notify because that is very crucial to reducing your main time to detect let's delve deeper into some of these steps the first is instrumentation collection um there are various uh open
08:00 - 08:30 source libraries like of course open Telemetry that can be used with sdks in Python and go to essentially instrument your service so in this case the astronomy storefront the KPS that you care about and collect them the next layer we have here is storage you of course want to store these kpis to be able to monitor them for this you can use you know various Storage Solutions like Prometheus or Thanos and cortex which provide High
08:30 - 09:00 availability multi-tenancy long-term and persistent storage and then comes analysis this is where we do anomal detection there are various libraries across multiple languages that are available to do both statistical and machine learning anomal Direction we'll talk about it in a little bit but you can use libraries and Frameworks like tensorflow iy could learn to do this you can also use Frameworks like Cube flow and kerve to actually train your models and deploy
09:00 - 09:30 them Cube flow comes out of the box with a lot of features like distributed training experiment tracking hyperparameter tuning and integration with CPUs and gpus and then KF serve comes integrated with Cube flow to actually be able to serve these models in production before we go into the rest of the workflow let's zoom in a little bit into the anomaly detection and the modeling layer um as mentioned earlier there multiple models that we can use we'll focus here on the statistical model models cria will talk a little bit
09:30 - 10:00 more about the machine learning models there are various options for this zcore ARA STL decomposition and then based on certain conditions some are better than the others so with the online astronomy storefront zcore is good because of its versatility um when you want data smoothing you can use hold printers method because that helps and works with those use cases it's also available out of the box with Prometheus what's great about these mod models are that they are lightweight
10:00 - 10:30 scalable and offer uh training free anomal detction so you essentially don't have to have pre-training data available um they can perform real-time analysis with minimal setup are fast and generalize well across data types they are also good at quickly identifying threshold violations handling unknown environments and then even working like I said with limited historical data and finally they have less operational cost machine learning models requires building training Pipelines service
10:30 - 11:00 pipelines production environments all of that is often not a costeffective choice for a lot of anomal detection use cases um very quickly like I said lots of libraries across various languages if you want to use Python we have starts model scipi tensorflow time series analysis which come with a lot of Handy um functions for time series analysis in go we have gam stats uh stream tools Prometheus fortunately right out of the box comes with a lot of functions that
11:00 - 11:30 enable time series analysis and anomaly detection so you have predict linear you have standard deviation over time which can be used with zcore you have whole Winters which like I said is often used and is like a very good um uh library to go with all right now we come back to our workflow once we have completed analysis so anomal detection we get back to visualization like I said visualization is key to deriving the most actionable insights from the anomalies you can use various open-source plugins
11:30 - 12:00 and toolkits for this like dashboard from elastic search and purses to create custom dashboards and representations that are tailored to specific needs and then finally like I said earlier alerting and notification is key because if you are not able to alert on time it will not significantly reduce your meantime to detect um these can be configured using tools like alert manager which uh helps you manage alerts create custom notification policies and work with various preferred communication uh
12:00 - 12:30 channels okay so now let's come back and tie together the value proposition for anomaly detection especially in the context of observability and with respect to our astronomy online store what we did here was we leveraged anomaly detection to generate data driven insights to detect our regressions in a faster or quicker manner this effectively reduced the downtime of our service thereby essentially reducing the meantime to
12:30 - 13:00 detect this also helped us improve our operational efficiency maintained our slas and enhance customer experience leading to a stellar experience on Stellar stash try saying that 10 times in a row we should have picked better words I will now pass it on to cria who will talk about how aomin detection can impact mttr all right um thank you prash so so far uh We've looked at how Anor detection methods can be used to perform
13:00 - 13:30 to find anomalous behaviors in single metrics and you know find performance issues and this approach is great because you're just using one metric right that is easy interpretation of the results but it comes with it its own downsides so because you're using only one metric to perform anomal detection it has limited context of the um system that you have at hand and because of that it is not really indicative of service degradation
13:30 - 14:00 okay so while univariant analysis um focuses on a single kpi uh multivariate anomaly detection allows you to analyze uh anomalies across all of your kpis at once or maybe some combination of them essentially um it lets you perform anomaly detection on multiple metrics at the same time and because you're using multiple metrics at the same time you have more contextual information about the system itself because using multiple
14:00 - 14:30 metrics gives you more information and from the correlate correlations that you can generate from these metrics you get a better indication of when a service is degrading or there is a potential issue so at a very high level multivariate anomaly detection utilizes the correlations and relationships between your key kpis that you've defined like we've defined here and uses this to get deeper insights to detect how the combined degradation of these
14:30 - 15:00 metrics can indicate a broader system performance issue so let's now go back to our astronomy shop use case and take a very simple example here to um understand how multi varom detection helps so imagine that you know one morning you woke up and you saw that there are a lot more people who are jumping online onto your store and they buying more telescopes and binoculars but that was also followed by an increase in the request latency and and the failure rate so
15:00 - 15:30 Univar anomal detection can only help you say that there is some abnormal behavior in each of these indiv individual metrics and that there's some performance issue but with multivar anomal detection the model is capable of identifying correlations between our metrics that we have here which is um number of active users request latency and uh the failure rate and because of all this added uh it uses all of this added contextual information to say that there is some anom Behavior happening
15:30 - 16:00 and so now what you can do is set up alerts on all of these metrics to only notify if and only if multiple of these metrics fire at the same time so this way you can reduce the number of false positives that you get and also ensure that you kind of get a general idea of the overall system performance as a whole and if there's a degradation you get to know that too um okay so let's um you know dive a little deeper into ml models that can be
16:00 - 16:30 used for multivariate analysis there are actually a plethora of models that you can use for multivariate analysis but we have highlighted here uh models that are very relevant for the use case that we are trying to solve which is SL so the first one is the autoencoder method autoencoders are great at highlighting um and comp uh compressing High dimensional data and filtering out the noise from them once they filtered out the noise they reconstruct this High dimensional data and compare the Recon constructed data to the actual metric
16:30 - 17:00 depending on what the Reconstruction error is between these two uh data points it says whether the point is anomalous or not the next one is lstms we all have heard of lstms in some uh way or the other I hope so lstms are essentially memory cells and they have a memory of um historical trends of the metrics that you feed into them and using this historical Trends they predict the next timestamp for that metric and compare that next time stamp to the actual metric to see if there is
17:00 - 17:30 a larger prediction error there's a larger prediction error that is an anomaly and finally let's talk about graph NE networks this is actually a very important method for what we're talking about the astronomy shop because the astronomy shop is actually a microservice uh we that we are running we running so the graph neural networks can actually help with understanding component level relationships because it Maps the components and the metrics to individual nodes of a graph and uses that graph to understand the
17:30 - 18:00 relationship and using these relationships it can predict the next timestamp compare that to the actual value and say that there's an anomaly um here are some open source tools and Frameworks that you can use to uh get U you know multivariate anomaly detection or any other uh anomal detection models uh to use you guys want to take a quick photo I'm going to just leave it up for a second and then move on all right moving on Okay so so let's come back to that univariate anomal
18:00 - 18:30 detection workflow that Pita was talking about and let's see how we need to modify this for multivariate anomal detection there's really not that much modification that's needed you only have to expand the analysis layer to also include correlation along with an nly detection so correlation helps you uh understand metric level relationship while dependency graphs helps you understand service level relationship now there's a new word there dependency graphs we're going to talk about that but before we get there we're going to dive into this analysis later a
18:30 - 19:00 little okay so what if I told you that you can extend that multivariate anomally detection setup that we had to perform root cause identification are you intrigued so before we get there we're going to go back to the example and look at um how look at what we mean by root cause identification okay so again let's simplify the example here a little bit let's take that same setup where where there was an increase in the number of
19:00 - 19:30 users the request latency and the failure rate this specific combination of metric degradations could indicate very specific potential issue or issues in this case it could be indicative of a resource bottleneck so multivariate anomaly detection can actually inform you of this resource bottleneck issue that we have at hand and with with this root cause identification you can Target your troubleshooting efforts to very specific causes that have been outlined so you
19:30 - 20:00 just improve the speed at which you resolve the issue all right so let's come back to that multi setup and let's look at what needs to change for us to adapt this to a root cause uh identification problem again there's really not much of a change all you need to do is change your output layer for anomaly detection to emit potential root causes instead of anomalies and with this you get multi-level root cause uh identification so what if you want to extend this setup
20:00 - 20:30 to also get component level or microservice level uh anom U root cause identification you just have to introduce a dependency graph as context for your anomal detection method and along with the correlation it can detect component level Ro root cause identification so let's come back to that word dependency graph dependency graph is nothing but a representation of how uh different components or microservices relate to one another how
20:30 - 21:00 they interact with one another and it's actually very easy to build a dependency graph all you need is the API calls that happen between the microservices or the data flows that happen uh within the components and you can generate a dependency graph for your service or microservices that's pretty cool right all right so now let's go back to the larger picture and address the premise that we originally set out to solve which is meantime to resolve so we have
21:00 - 21:30 now leveraged anomaly detection to perform root cause identification and this provides a great initial Triad step because you get a list of possible root causes so you can so you can use those possible root causes in your analysis layer and very quickly get to a resolution let's say you've encountered a simple you know perhaps even a known issue something that you've seen before you can actually even automate the at root cause analysis layer and uh
21:30 - 22:00 resolution layer so um if you like tie it back to the astronomy use case that we had where we were looking at the resource bottle next um maybe you built a very cool automated log analysis setup along along with your observability stack so your log analysis setup told you that the root cause was just that your CPU capacity was not enough to handle the load and all you have to do is increase the CPU capacity and it would be great to go so this you can actually automate but maybe you have more complex issues
22:00 - 22:30 where even the initial uh uh triage that the root cause identification step gives you is is root causes that need more manual uh analysis in this case the individual who's performing these analys this analysis can use the shorter list of uh potential root causes that you provided to analyze and identify the issue quickly and fix the issue very okay so now let let's go back to what the value proposition for multivar
22:30 - 23:00 anomaly detection is and like focus in on the initial uh root cause analysis identification that uh we we spoke about so with this the overall manual oversight that you need to identify an issue and fix it is lowered because you already have a few possible root causes given to you which you can look at and because of the holistic view of the system Health that you get um it it can help you focus your energy IES on analyzing very specific um causes root
23:00 - 23:30 causes instead of you trying to go and look for a needle in a Hast stack and overall it helps you improve your meantime to respond now we need to tie all this back to meantime to resolve so let's quickly take a look at what comprises meantime to resolve it includes meantime to detect meantime to acknowledge meantime to respond meantime to repair and meantime to resolve while we cannot impact all of the components that are shown here by enhancing anomaly
23:30 - 24:00 detection to include root cause identification you can impact both meantime to detect and meantime to respond and possibly reduce that um as much as possible because you've now given some very specific uh root causes to look at so with this cumulative reduction in time across meantime to detect and meantime to respond even your overall meantime to resolve also know improves and it comes down all right so as promised uh we spoke
24:00 - 24:30 about the value proposition of anomal detection spoke about some of the statistical methods and machine learning methods that you can use for anomal detection talked about how you can improve meantime to resolve by enhancing your unarm detection methods for to indicate root causes and finally we also spoke about how you can leverage open source Frameworks and tools like Cube flow and KF ser and everything to build your anomal detection pipeline so today we've explored you know
24:30 - 25:00 leveraging anomaly detection for SLA breach detection and uh root cause detection we just want to say that you know we're not being very prescriptive here this is just one of the ways that you can Implement root cause detection and SLA detect SLA breach detection using um the anomal uh detection methods that we showed so we hope that um this talk was helpful for you and it sparked some new ideas on how you can use root cause identification for all future projects thank you for being here so
25:00 - 25:30 late um now we'll open it up to [Applause] questions hey uh so this looks very good is there a very dumb down version that I can like do a setup like a test setup for my um for my environment sorry can you repeat your like is there a dumb down version of actually doing the
25:30 - 26:00 installations like use Cube flow and then do this then do this plugin kind of thing like a demo kind of thing or P kind of thing that we can actually follow um maybe we should just talk cuz I'm we're not able to hear you you can just come up sorry about that we are not able to hear what you're saying um so your question is if
26:00 - 26:30 you um so the question is uh is there a read me that you can you can use to um set up the same thing that we spoke about here uh for any of your use cases um right now there's technically no read me this is just for you to like understand what is really out there and what you can really use for um uh your anomaly detection and root cause identification problems um there are there's also a lot more that you can do with um you know even log adding logs
26:30 - 27:00 into the mix you know getting root cause analysis and all that so there's um you you you'll have to explore it a little bit on your own but we can help give you some ideas maybe but that's a that's a good piece of feedback we could take back yeah to work on and and have something in the works can you can you hear me okay yeah um I was wondering have you noticed that you have to like uh is there a lot of tuning involved after selecting a model do you after implement it do you have to do a lot of
27:00 - 27:30 tuning um it depends on uh what kind of model you pick so it's very oh sorry um I was supposed to repeat the question so the question is is there a lot of tuning that is um involved in trying to use machine learning for anomaly detection um it depends again what model you pick so you can actually pick some existing pre-trained models and you know use that to uh tune just whatever is needed and go with that for uh your anomaly detection need but over time you will need to tune your models because
27:30 - 28:00 there will be some data drift the data the data patterns change as time goes on so you will have to tune it h how would you know that you picked the wrong model uh the uh you can like compare the uh uh what do you call prediction errors between what is being predicted and what the actual value is usually that ends up being very high if you have picked the wrong model gotta even when you scale that's another time when you have to tune the models because anomal detection models are just traditionally hard to scale across a diverse set of use cases
28:00 - 28:30 so in both those cases you have to do that and like cria said identify you know what the error has increased and changes based gotcha thank you so much great presentation um I had a similar question to um the one that the gentleman asked uh earlier uh I know you kind of broke it up into different um kind of steps and uh explain the platforms relevant for each of those
28:30 - 29:00 steps but is there a way or is there a platform or is there a product that kind of combines uh some of these to a degree so that I don't have to build it all myself um um there are some vendor solutions that you can use oh sorry the question was are there is there a platform that integrates all of this into um one single uh pipeline um there are some vendor solutions that you can use uh to get anomal detection out of the box and like prash said right uh there's
29:00 - 29:30 Prometheus has a lot of uh uh what you say functions that you can use out of the box and you don't actually have to go and set up all of this um pipeline to get anomal detection you can use Prometheus you can use some of the open source uh to U libraries that are there in Python and go for yeah setting it up our key takeaways to try to work on making something available so we do that of course uh you know open source yes uh when you make the jump from anomaly detection to root cause prediction is
29:30 - 30:00 your assumption there that you have uh labeled training data or how how do you actually generate those predictions that is a great question so the question was are you assuming that there's uh labeled training data for all of this to use the machine learning models so um the lstm models that we spoke about those actually do require labeled required label training data but the other models actually can work with u unlabelled data to auto encoders actually specialize in
30:00 - 30:30 unlabelled data so you can use that for unlabelled data yeah and nowadays you can also utilize a lot of semi-supervised approaches where you have like a very limited amount of label data because it does in certain situations really increase your accuracy so there's like a wide variety of options available from like what cria said unsupervised to sem supervis where you can tailor it to what you have available thank you hello great
30:30 - 31:00 presentation uh how did you build a dependency graph or topology and which open source tool did you use also another question where do you store this graph uh so your first question is how did we build the dependency graph yeah so um you can build dependency graphs uh depending on what kind of a service you have if you have a service that just single service that has has all the components within itself you can use the
31:00 - 31:30 data flow to build your depend dependency graph and if you have a micros Service setup then you can use the API calls they make to each other uh track them and actually build your dependency graph based on that I didn't follow the next part of your question uh where do you store the where do you store them yeah the store um that's a good question you can actually store them at you know in any um SQL database um or post any database any yeah if you want
31:30 - 32:00 to store it as a graph there are storage options for for graph yeah okay yeah and a lot of Open Source options for that as well thank you would you be able to share an example of how you've Ed this in practice like the theory is nice but I don't know if you could give some maybe more concrete example um the question is how do you use this in practice do you want to take that
32:00 - 32:30 um um do you are you looking for a specific example sorry we did repeat the question right okay are you looking for a specific example yeah I am curious about you know how your practical experience using this and you know kind of the results that you guys have gotten and so forth so uh yeah we don't want to talk about like cria said we don't want to be prescriptive about any specific service um but uh going back to if you look at your astronomy store it is a service right you have a service then
32:30 - 33:00 there's multiple microservices so anytime you scale that out to a larger setup you have Services microservices you identify individual uh you know anomalies in in these metrics and then you connect them there are a couple of things like the dependency graph you do need some sort of domain expertise at times that's the uh sort of stuff Kaa went into um but yeah I mean you could could extrapolate that to a
33:00 - 33:30 service um thanks and and have you guys used any of these like automated uh remediations in in practice um I don't think we can talk about that I'm sorry I I wanted to like uh we want to focus here on what's available open source and uh but you know happy to chat yeah go ahead sorry do you have uh sorry uh do
33:30 - 34:00 you have any ideas uh to choose correct uh uh features to inut your machine luggage model for example so I think so you guys are inut are using the S related features like response time so and in add so so in case of me so I'm using so resource data such as CPU memory maybe we can chat with you again we're not able to yeah something is offer the were not able to sry uh s simply U my
34:00 - 34:30 question is so uh so so please give me ideas to choose features to input uh into your mine langage model yes yes yes for example CPU or response time what is specific features for response time no no no so uh for example response time was CPU usage rate or memory us I think you are using those kind of
34:30 - 35:00 yeah quantities how do you identify the features to uh reduce response time yes effective yeah uh so so from my experiences so sometimes so so so for example golden signals in the SR book does not work well and in this sense do you have any idea to to choose good features yeah so I'm actually going to give you a very machine learning uh response to that you can feed all of the metrics that you have as features to
35:00 - 35:30 your model and really the modern uh machine learning deep learning models we have are very great at identifying uh good features from the lot of features that you feed but if you want um specific methods that help you just pick out what are the important features autoencoders is one of them that can do that because it does that dimensionality reduction from high dimensional data right so that low dimensional data that you can get that you can use as features to feed it into your uh model there are also uh statistical methods like uh uh
35:30 - 36:00 PCA which you can use principal component analysis which you can use for extracting and the non-machine learning answer is like go with the metrics that you see are have a high signal to noise ratio if you have expertise on your system the obvious ones would in addition to so ma technique like so Auto encod so using so classical Tech such such as PCA uh so so so could work yeah I agree with you yeah thank you we
36:00 - 36:30 we'd also like to chat offline because sorry there's lot of audio issues yeah but thank you for the question yeah it's okay so I understand thank you thank you thank you so much thank you