Deep Dive into SLAs
Lecture 21: SLA-Tutorial
Estimated read time: 1:20
Summary
This lecture provides a comprehensive tutorial on Service Level Agreements (SLAs) in cloud computing. It highlights the critical role SLAs play between service providers and consumers, ensuring the agreed level of service is met. The lecture addresses the lack of a standardized SLA across cloud services due to the varying types of services and consumer needs. The tutorial includes examples of SLA calculations, such as service availability and outages, and discusses the complexities involved in different scenarios. Additionally, it covers best practices for evaluating and managing SLAs to ensure compliance and reliability, offering a valuable resource for understanding these essential agreements.
Highlights
- Understanding the importance of SLAs in cloud computing and their role in assuring service quality. 🤝
- Exploring the absence of a universal SLA standard across different cloud services. ⚖️
- Examining practical SLA problems to understand outage calculations and compliance checks. ➗
- Discussing best practices and guidelines for setting up robust SLAs. 📘
- Considering factors like data policies and security in SLA management. 🔐
Key Takeaways
- SLAs are crucial for defining the service expectations between cloud providers and consumers. 📜
- There is no standard SLA due to the diverse nature of services and consumer requirements. 🔄
- SLAs include various metrics like uptime, downtime, and policy guidelines. 📊
- Practical examples can demonstrate how to calculate and verify SLA compliance. 🧮
- Best practices for SLAs involve understanding cloud actor roles and the importance of clear service management. 💡
Overview
In this lecture, the focus is on Service Level Agreements (SLAs), especially in the realm of cloud computing. The introduction sets the stage by defining SLAs and their significance in maintaining the relationship between cloud service providers and consumers. It also discusses the dynamic nature of SLAs, given the varied cloud services and consumer expectations.
Through practical examples, the tutorial dives into calculating service availability and identifying SLA violations. By exploring realistic scenarios, learners can grasp how SLAs work in actual business contexts, assessing compliance through specific calculations to evaluate service performance.
The discussion extends to address best practices in formulating and managing SLAs. It emphasizes understanding the roles of cloud actors, evaluating business and service-level policies, and managing failure provisions. Such insights are crucial for both providers and consumers in navigating the intricate landscape of cloud agreements.
Chapters
- 00:00 - 03:00: SLA Overview The chapter titled 'SLA Overview' provides a concise tutorial on Service Level Agreements (SLAs). It builds upon previous discussions on the topic, emphasizing the understanding that SLAs, meaning Service Level Agreements, are central to the discourse.
- 03:00 - 06:00: SLA Guidelines and Significance The chapter explains the importance of a Service Level Agreement (SLA) between cloud service providers and consumers. Such agreements are crucial for executing and consuming cloud services as they define the terms and conditions of using third-party services. Cloud computing relies on these agreements to ensure services are provided and consumed effectively.
- 06:00 - 12:00: SLA Metrics and Policies The chapter discusses the challenges and the evolving standards of Service Level Agreements (SLA) across various service providers. It highlights the lack of a universal standard for SLAs and points out that this may be due to the diversity of services offered by different providers.
- 12:00 - 15:00: Problem-Solving: SLA Violations The chapter titled 'Problem-Solving: SLA Violations' discusses the varying categories of consumers who require services on different scales. It outlines the general guidelines that Service Level Agreements (SLAs) adhere to when consumers engage with service providers, such as major commercial cloud services including Microsoft Azure, Google Cloud, IBM Cloud, and Amazon Cloud.
- 15:00 - 20:00: SLA Calculations and Penalties This chapter discusses the formation of Service Level Agreements (SLAs) and the essential components that constitute them. It highlights the importance of establishing clear agreements, focusing on SLA talks that include Service Level Objectives (SLOs), Key Performance Indicators (KPIs), and other necessary elements. These aspects form the underlying infrastructure necessary for forming robust SLAs.
- 20:00 - 28:00: Complexity of SLAs in Practice This chapter discusses the complexity of Service Level Agreements (SLAs) in practice, focusing on the different types of parameters involved. Some metrics are policy-driven, such as data residency and backup policies, while others are parameter-based, such as uptime, CPU usage, and disk usage. The chapter aims to explore these differences by examining specific problems and how various parameters are considered in different types of SLAs.
- 28:00 - 35:00: Best Practices for SLA Formulation In this chapter, the focus is on the best practices for formulating Service Level Agreements (SLAs) in various commercial environments. The content includes insights and strategies utilized by major commercial service providers such as IBM, Google, and Amazon. It delves into how these companies structure their SLAs to ensure clarity and effectiveness, providing a framework for others to emulate in their own SLA development processes.
- 35:00 - 36:00: Conclusion The final chapter titled 'Conclusion' reiterates the themes discussed in earlier sections, focusing on the formal agreements and contracts that form the foundation of trust between a provider and a consumer. It highlights the importance of formalizing these agreements to establish clear expectations of performance. The chapter underscores the provider's intentions and objectives in maintaining a consumer relationship, emphasizing the role of trust and clarity in service provision.
Lecture 21: SLA-Tutorial Transcription
- 00:00 - 00:30 hi so today we will have a small tutorial on s l a it is in continuation with whatever we discussed on s l a ah so so s l a as we understand for so service level agreement
- 00:30 - 01:00 and it is important for any cloud service provider and cloud service consumer to have an agreement to execute this either consume or provide this service this as as we understand that when we are using cloud computing we are basically leveraging on a third party services which are the service providers and you the consumer things are hosted so it is
- 01:00 - 01:30 both way to we should sign up or we should have a s l a so unfortunately there is as such there is no standard or there is a rather i should i should say that standard state evolving to have a standard s l a across the different services that one of the major reason maybe there are different type of services which are provided by different providers
- 01:30 - 02:00 they are at different category of consumer we who require services at different scale so never the less there are some broad guideline by which these s l a s are go on so whenever you take a service from any service provider let it be commercial any of the commercial cloud like say microsirisio or google cloud or or ibm cloud or amazon cloud or any cloud
- 02:00 - 02:30 so you need to agree on some of the some agreement so what we have seen in our sla talk that how this agreements are means how this agreements can be formed or what will be the basic ah underlining infrastructure for that so for that my matter we have talked about s l o ah kpis and so on so forth which which allows us to build up this sla so in some cases this
- 02:30 - 03:00 is some of the metrics which will be there in some cases it is policy driven right like where your data through reside what should be the backup policy and so and so forth are more policy driven where as some of the agreements are ah more on parameter based like what is the see up time or cpu uses or disc uses these are some of the things which are metrics so what we will do to we will we will try to look at one or two problem before that ah we will see that how different parameters are considered in different type of slas right
- 03:00 - 03:30 in different type of commercial life we have taken this from again from internet resources primarily from commercial providers like as your ibm google and amazon and others so it the idea is to say that what how they frame they are things in a in a way so before looking
- 03:30 - 04:00 at those things those stuff we just see as we as we have discussed earlier this slide like its a formal agreement between contract between the provider and consumer foundation of consumer trust on provider and sometimes eservices that how the providers wants to have this consumer on the things purpose to define a formal basis for performance and
- 04:00 - 04:30 availability of service providing provider guarantee is to deliver and as we have talked about slos like objectively miserable condition for services and sla slo basic for the selection cloud provider this way seen just i kept the one slide so that things will be there so we will discuss this two problem and then try to look at some of the aspects of how
- 04:30 - 05:00 somehow this commercials cloud another things and clouds look at the how they define so like a let us have a simple problem like suppose a cloud guarantee service availability of ninety nine percent of of time right late third party application runs in the cloud for twelve hours a day at the end of one month it was found that there is a outage often point seven five hours find out whether the provider has violated the initial availability
- 05:00 - 05:30 guarantee right so very straight forward so it gives in the sla as ninety nine percent of time late third party run say cloud for twelve hours a day end of a month it is ten point seven five and you want to find out whether the provider has violated the initial availability guarantee right so if we look at so thats a problem so the total time for
- 05:30 - 06:00 which the application to run in a month is equal to twelve cross thirty three sixty hours
- 06:00 - 06:30 right now what we say outage time is ten point seven five hours this has been declared this has been given so therefore service duration equal
- 06:30 - 07:00 to three sixty minus ten point seven five hours so percentage availability equal to
- 07:00 - 07:30 one minus ten point seven five by three forty nine point two five into hundred which is ninety six point nine two right so this much percentage availability will be there this
- 07:30 - 08:00 you can you can straight forward calculate so what it as it was their initial service guarantee was ninety nine percent so has hence as final service guarantee so final service availability is less than
- 08:00 - 08:30 initial service guarantee so what we can conclude that the service cloud service provider csp if we say has violated the sla ok so its a very straight forward simple arithmetic
- 08:30 - 09:00 right but what we see that if we can somehow measure this type of things i can i can as per as the availability is concerned we can basically calculate weather this sla violations are there if there is a set of sla s which need to be looked into for everything every component we can have this sort of simple calculation and or in some cases it may be
- 09:00 - 09:30 little complex when you want to do some statistical analysis to find out something and then you can say that this is weather the sla violation is there or sla has been honored or not right so this way we can calculate this whether this any sla is satisfied or not right ok
- 09:30 - 10:00 so this is this is pretty straight forward but in reality it may not be that straight forward what i can have different type of availability at different point of time for that for that matter i can say that say if i considered a commercial say for example a banking organization so the a sla s i can say that during my peak hours like i say nine
- 10:00 - 10:30 to seventeen zero zero hours i require a availability of ninety nine point nine percent whereas in a peak hours like seven zero zero to say i can divide them into different scale i can say that seven zero zero two hm seventeen zero zero two this hours i can have ninety nine percent whereas nineteen zero zero to next day zero eight zero zero hours i can
- 10:30 - 11:00 still bring down to say ninety seven percent and zero eight zero to zero nine zero zero hours i can say it is something again ninety nine percent now what i mean to say this availability requirement me also vary over time right based on your business requirements right so based
- 11:00 - 11:30 on your requirement things will be like a institute like a i can i say that if my lab if our labs are running between two to five or say morning in the morning say two to six and morning since an eight to twelve so during those lab hour i require a high percentage of availability however during the a peak hours or evening hours i may i may require much re use thing because more you guarantee the services more you pay for it right so
- 11:30 - 12:00 that is require so there are there may be more complex calculation ah to look at right so a similarly a we can we look t another problem ah which is a little bit extension of the other of the previous one so we consider a scenario where a company x a service provider x sorry a company ah company x want to use cloud service from a provider p so there is
- 12:00 - 12:30 a company x which wants to use a provider p like say i have to correct once more cloud service from a external one or any anything the service level agreement guarantees negotiation negotiated between the two parties plat two initiating the business are as follows so before that the service level guarantees are like this like availability guarantee is ninety nine point five nine five percent time over the service period so the service period it
- 12:30 - 13:00 should be ninety nine point nine five percent time availability period is thirty days maximum service hour per day is twelve hours and cost say fifty dollar a day right so this is this is the type of agreement or type of requirement and that formal requirement which has been agreed upon with within the service provider and the service provider so availability ninety
- 13:00 - 13:30 nine point nine five percent ah service period thirty day maximum service hours per day is twelve hours and cost is fifty dollar a day right so serve with kid eats are awarded to the consumer if availability guarantees are not satisfied right so there is another part like if you if the provider fails to provide fails to provide service at the guaranteed
- 13:30 - 14:00 level for which it has been agreed upon and the consumer is charging then thus there has to pay the penalty right so penalty can be in terms of money detents or the penalty can be in terms of giving some extra compute hour or data or whatever so in this case availability can set monthly connectivity up time service level given as like a monthly up time percentage is less than ninety nine point nine five percent but
- 14:00 - 14:30 more than ninety nine percent greater than equal to ninety nine percent then the service get ride is ten percent right whereas of it is less than ninety nine percent then the service credit should be ah twenty five percent right however in reality it was found that over the service period the cloud server support five outages right during for this following
- 14:30 - 15:00 durations like one is five hours thirty minutes one is one hour thirty minutes one is fifteen minutes two hours twenty five minutes each on different day so this due to which normal service guarantees where valid right if unless if sla negotiation are honored we need to compute the effective cost payable towards buying this cloud services right so this is
- 15:00 - 15:30 this we need to check that how much effectively need to be paid by the consumer to for this cloud services right so that is fine so again just to quickly repeat so there are some of the guarantees are there availability ninety nine point nine five percent service period thirty days maximum things twelve hours fifty dollar and there are some penalty for not providing the services less than ninety nine point nine five percent but greater than equal to ninety nine percent is ten percent and less than ninety nine percent
- 15:30 - 16:00 is to twenty five percent and there are five outages five hours thirty minutes one hour thirty minutes fifteen minutes and two hour twenty five minutes i mean to find out that the effective cost payable towards the buying the cloud services so this we have to ah work on again not a difficult problem but it gives us a idea the how things works so service period duration is thirty days right twelve hours so total so there for we have total
- 16:00 - 16:30 so much hours or three six zero hours cost what we have seen fifty us dollar per day
- 16:30 - 17:00 so total cost so this is at the time of
- 17:00 - 17:30 at the time of service negotiation is dollar fifty cross thirty or this fine so these are
- 17:30 - 18:00 the facts what we have given thirty days service duration duration per day is twelve hours as we are using total service up time is treated this one fifty dollar this fifty per day is the cost and total cost at the time of service negotiation is ah fifteen to thirty fifty fifteen hundred ah dollar that is the thing now total service total service down time
- 18:00 - 18:30 is five hours plus thirty minutes plus one hour thirty minute plus fifteen minute plus two hours twenty five minute right so if you add up it is nine hours twenty five minutes
- 18:30 - 19:00 ok so this is the total outage time or the total down time for the things right so we can we can say service availability equal to one minus this we have seen previously
- 19:00 - 19:30 also and this is the standard thing one minus downtime by uptime equal to what we can say one minus nine hours twenty five minutes by three six zero hours
- 19:30 - 20:00 hundred so much percentage so this fine so this was our total expected out time and this is the outage or the downtime so one minus down time by so and so forth and so what we have the ninety seven point three eight five percent ok so this is fine
- 20:00 - 20:30 we calculate the service availability as ninety seven point three eight five percent so as per this data available so what we see monthly up time percentage is ninety seven point three
- 20:30 - 21:00 eight five as we have calculated which is less than ninety nine percent right not only ninety nine point five percent but like that ninety nine percent so service credit available due to that whatever we whatever the during that service negotiation or sla things are
- 21:00 - 21:30 there twenty five percent of total cost ok so it is total cost as we have calculated fifteen hundred so it is dollar three seven five so effective cost cost payable towards by buying the service
- 21:30 - 22:00 equal to dollar one five zero zero minus dollar three seven five equal to dollar one one two
- 22:00 - 22:30 five so this is the effective cost so what we see that based on the outage so it is the if you look at the problem two it is the ah extension of the means little bit extension of the first problem but this are in reality things happens right so we need to measure this things log things and calculate the things accordingly so right this is with respect
- 22:30 - 23:00 to the up time and there can be with respect to ah means it may not be total down but you are availability band width may be may availability in the network may be slow and type of things so it depends some lot of other aspects it is not that always straight forward but there are lot of other complex consideration in doing so so what we tried in this two problem so show that the how a sla guarantees can be calculated or looked into an type of a
- 23:00 - 23:30 means how it can be calculated and see that whether the violation of sla or not right so this i believe this will give you a broad means board idea or a things of the so now what will see that in like what are the different dip rachises or what are the different components like one what we have calculated is the up time so what are the
- 23:30 - 24:00 different components of the of a cell which are considered by primarily the commercial cloud right ah so that just couple of things will see right so so sla for cloud service
- 24:00 - 24:30 provider already we have seen but just to so some of the aspects ah which we would like to highlight this like in case of commercial cloud what things that there is applicable
- 24:30 - 25:00 monthly period right which means for a calendar month in which the service credit is odd the number of days you are a subscriber of a service right so it is applicable service period similarly applicable monthly service fees right so the concept of downtime like as we have seen services in the service specific terms below error code means the indication that the operation has failed such that http status code is five x s or something right so that is services
- 25:00 - 25:30 should have a error code otherwise you will it is difficult to try pinpoint we services has failed this type of thing external connectivity is a bidirectional network traffic support like protocol for http or https can received a public ip and so and so forth where they are external connectivity incident means any single or a set of incident that result in down time management portal means that the wave interface provided by
- 25:30 - 26:00 this is basically meant for microsoft azure so through which the consumer may manage the services like that management portal service credit if there is a failure that how much credit will be given that we have seen in this problem service level means the performance mistakes says set forth in the sla and in case of myself as your it agrees to meet the delivery of services service resource success code like as we have seen failure
- 26:00 - 26:30 code we have a success code like in a http we know that two xx is the success code support windows refer to the period of time which during which the service feature on compatibility with the separate product services is supported so there is a support window where the where the things will be there along with that there are some additional definition like availability set refers to two or more virtual machine deployed across different fall domains so
- 26:30 - 27:00 that it will not go for down time at the same time at the same period to avoid single point of failure cloud services referrers to the computer resources utilized for web and web role and worker role fault domain is a collection of servers that share common resources which are power and network connectivity tenant represent one or more roles that is
- 27:00 - 27:30 one or more roll instances that it deployed in a single packages this we have seen like it can be a worker role it can be a web role type of thing update domain refers to a set of in this case microsoft azure in instances which platform update are concurrently applied virtual machine this we know vnet this is a virtual private network and this also is known that web role and worker role so these are some additional definitions which will
- 27:30 - 28:00 be utilized for service level for sla calculations right so similarly as we have seen here in our example here if you can see the monthly up time calculation and service level for cloud services using those definitions say monthly available minutes is the total accumulated minutes during a billing period for all interfacing roles and two or more instances deployed in different update domain similarly down time is that
- 28:00 - 28:30 how much time and up time percentage is the maximum available minutes minus down time while maximal elements like that we have calculated here right so that is the thing and there are can be service script credit rules as we have seen right ninety nine point nine five percent in these exactly the same type of valves we have used similarly this is for calculation and service develop cloud services similarly we can have
- 28:30 - 29:00 for the v m s right so vm s like i want to have infrastructure service and the virtual machines are allocated similarly so in maximum available minutes is the total accumulated time is billing period and so on and so forth down time is similarly we can calculated and we can have several separate same type of service credit so it can be at what we mean to say it can be a different type of level it can be at a highest level it can be a space levels any syllable i can have as a storage level of the if there is storage downtime
- 29:00 - 29:30 or access ability problem so on and so forth so this need to be clearly specified now when we want to do this type of thing so what are the different best practices or rules we need to follow right so that way let us see some of the best practices what are are follow
- 29:30 - 30:00 so the cloud standard custom council right provides cloud consumer with seven steps they are they should take when they evaluating the sla s like it is also provided in their document of april twenty twelve right so identify the cloud actors who are the actor according to nist architecture so these are the actors consumer producer or the provider carrier
- 30:00 - 30:30 broker and auditor so these are the five actor which are there as far as nist so we need to evaluate the business level policies right this is important what will be the data policies sla guarantee least of services not covered under this excess uses payment penalty sub contract services license software industry specific standards and these are different
- 30:30 - 31:00 aspects of the things so what while we are discussing about simple slas in actually the things are more complex like what should be the data policy is how which are covered which are not covered ah if there is a sub contract of services what should be that policies whether you are using license software licensing mechanisms and so on and so forth right because most of the cases when we try to use these we may be using different license software and those
- 31:00 - 31:30 licensing cost etcetera come into play not only that licensing period and so and other things come into then we need to understand that which level operations we are looking at sas spas or ias because the different type of things are the different type of services have different type of requirement in some cases as is much easier to control maybe or measure but we need to look at that which type of services
- 31:30 - 32:00 we are leveraging on whether we are having multiple this sort of services so to we need to understand what sas spas ias are about and which type of cloud it is running whether is a public private or hybrid terms and condition in a sla depends on the complexity of control variables that are provide that the provider gives to the consumer or the service consumers so now consumer need to calculate the availability etcetera so for that the controlled control
- 32:00 - 32:30 variables are provided by the like i say it gives me the cpu up time etcetera or different uses time or had that disc uses ah parameters so these are the different control parameters provided now more the complexity of this parameter depends on which level of operations you are doing and where you are running the things like is a ias space or sas or whether it is a hybrid or your public or private so the
- 32:30 - 33:00 other things are one is that metric what we are discussing about twenty five what metrics should be used to achieve performance objectives right some examples availability at availability response metrics are like metric name in the in sla like availability and other type of things there may be other constraints whether and frequency of collection of these data is also important the aspect the next aspect is the security like consider key security
- 33:00 - 33:30 parameters for cloud including a set sensitivity legal regulatory requirements like i a i may say that the that datas would reside within these particular geographically boundary or within this type of things cloud provider security capabilities what is the capability of the cloud provider to provide that then we have service management requirement to need to identify the service management requirement so what should be monitored and
- 33:30 - 34:00 reported for example load performance application performance or what should be metered right what you are need to be million meter how rapid provisioning should be like speed testing demand flexibility and how resource changed should be managed right so how is the provisioning what need to be monitor and reported and meter type of things need to be looked into and then prepare for and manage for the failure right there is another important exit determined
- 34:00 - 34:30 what remedy should provided like for example service credits and what are the liability limitation so how much service credits to provide and what are my liabilities on the provider cons on the provider point of view and in order to that what the consumer are signing of how the disaster recovery plan will work when needed so how the disaster recovery plan will work where in it is needed and exit clause should be a part of every cloud sla right in either the consumer or the provider wants to terminate the relationship
- 34:30 - 35:00 so sla what it is there its a agreement so what should be the exit clause suppose the consumer at some point of time to exit or the providers says that i am not able to provide that thing so that should be in the thing so what we says that this are some of the essential best practices or some best practices we should keep in mind when formulating the sla etcetera link identify the cloud actors evaluate business level policies understand
- 35:00 - 35:30 this type of services what are the different matrixes security capability of the and requirement security requirement of the consumer and the capability of the provider service management requirements and how to manage failure or what should be the remedies for failure so what we tried this is a what we send a extension of the sla already we have discussed
- 35:30 - 36:00 so what we try to give that the there are different aspect to the things and try do in this thing which we have also seen to simple sla related problem how it can be how this type of slas are calculated though the problems are very simple and straight forward but it gives us a idea that how you can apprise the approach this two things so will let us conclude here for this sla tutorial thank you