Lec 4: Performance Evaluation Methods

Estimated read time: 1:20

    Summary

    In this lecture, performance evaluation methods for computer systems are explored, focusing on how to determine which architecture is faster or more efficient. Key metrics such as execution time, throughput, and response time are discussed, alongside laws and benchmarks like Amdahl's Law, the Spec benchmark, and others. The significance of these metrics in different contexts—desktop PCs versus servers—and the role of benchmarks like Spec 2006 with its standard programs for assessing performance are emphasized. Additionally, several illustrative examples and problems are discussed to highlight the application of these concepts in real-world scenarios.

      Highlights

      • Execution time is essential for desktop PC performance evaluation 🏁.
      • Throughput is critical for server performance concerning transactions 📊.
      • Benchmarks provide standardized ways to compare computer systems 📈.
      • Amdahl's Law explains the limitations of performance enhancements ⚠️.
      • Understanding metrics like IPC and spec ratio for in-depth analysis 🔍.

      Key Takeaways

      • Understanding performance metrics is crucial for evaluating computer architectures 🖥️.
      • Execution time and throughput are key metrics for desktops and servers respectively 🕰️.
      • Benchmarks like Spec 2006 help compare different computer architectures effectively 🏆.
      • Amdahl's Law aids in understanding the limits of performance improvements ⚖️.
      • Real-world examples and problems help in grasping these evaluation methods better 🤓.

      Overview

      In the world of computer architecture, determining which system outperforms another is an intricate task. This lecture dives into performance evaluation methods by breaking down various metrics such as execution time and throughput that help us understand which architecture is superior. The lecture contrasts these metrics' applicability in desktop PCs versus larger server environments, indicating the nuances involved in making these assessments.

        A crucial part of performance evaluation is using benchmarks. Benchmark programs like Spec 2006 are pivotal for providing a structured approach to determining performance across different machines. They help us understand how computers handle specific tasks, with metrics such as IPC and spec ratios offering detailed insights.

          The lecture further illustrates these concepts using Amdahl’s Law, which underscores the idea that overall system performance gains are constrained by the fraction of the system that can be improved. With practical examples and solved problems, the lecture makes these abstract concepts more tangible, providing a solid foundation in understanding performance evaluations in computing.

            Chapters

            • 00:00 - 01:00: Introduction to Performance Evaluation Methods The chapter titled 'Introduction to Performance Evaluation Methods' is about discussing various methods and techniques used to evaluate performance. Although the provided transcript is incomplete, it starts with a greeting and an introduction to the topic of performance evaluation in a lecture format. This indicates the chapter may cover several components or theories linked to evaluating performance effectively in scholarly or practical contexts.
            • 01:00 - 03:00: Metrics for Evaluating Computer Systems The chapter 'Metrics for Evaluating Computer Systems' focuses on evaluating and comparing computer architectures. It introduces the importance of analyzing and determining performance metrics to decide which system, between two options A and B, is faster or offers higher throughput. The discussion implies that understanding different architectures, such as RISC and instruction encoding, precedes evaluation. The primary goal is to use specific methods to assess the performance of distinct computer systems.
            • 03:00 - 06:00: Desktop PCs vs. Servers Performance Evaluation The chapter delves into the comparison of desktop PCs and servers in terms of performance evaluation. It discusses various quantitative metrics that are critical in assessing performance, such as execution time, throughput, fairness, and starvation. The essence of the chapter is understanding which metric is most relevant depending on the category of computers being analyzed, as not all metrics will have the same level of importance or applicability for both desktops and servers.
            • 06:00 - 09:00: Calculating Speedup and Throughput This chapter introduces different classes of metrics and laws used to evaluate computer performance. It discusses fundamental parameters for comparing computers and addresses the common question: when can we say that one computer is better than another?
            • 09:00 - 12:00: Understanding Response Time, CPU Time, and Wall Clock Time This chapter explores performance metrics related to computer systems, focusing on execution time, which is a key metric for assessing a program's performance on desktop PCs. It discusses a practical approach by comparing execution times of a single program on different desktop PCs to understand performance differences.
            • 12:00 - 15:00: Introduction to Benchmarks and Benchmark Suites The chapter introduces the concept of benchmarks and benchmark suites. It discusses different metrics used to evaluate performance, such as execution time for desktops and transactions per unit time for servers. An example of railway booking through the IRCTC website or mobile app is given to illustrate these metrics.
            • 15:00 - 18:00: Importance of Instructions Per Cycle (IPC) and Instruction Profiling This chapter discusses the process of making a railway booking from a customer’s perspective via IRCTC. Whether using the website or mobile app, the customer's request is sent to a central server, known as the IRCTC server. This server manages booking information for various trains, processing the booking requests accordingly.
            • 18:00 - 21:00: Data Cache Misses and Branch Mispredictions The chapter discusses the complexities involved in the booking process on a server. It outlines the various steps required when handling multiple booking requests, including checking data availability, offering booking or alternates, managing waiting lists, processing payments, and then confirming tickets. The focus is on the server's ability to efficiently manage multiple booking requests from various sources simultaneously, delving into topics like data cache misses and branch mispredictions.
            • 21:00 - 24:00: Spec Ratio Explained The chapter discusses the concept of "throughput," which refers to the number of transactions a server can handle in a given time unit. The focus is on understanding that the execution time of a single program is just a part of the broader picture, as servers receive requests from multiple machines. The performance measure is not only about how fast a program runs, but also how efficiently the server can handle multiple requests concurrently. The text gives an example of a common desktop PC running a particular program to illustrate these points.
            • 24:00 - 31:00: Amdahl's Law and Its Application in Performance Optimization This chapter delves into Amdahl's Law and its role in performance optimization. It discusses key performance metrics such as execution time for individual machines and transaction throughput for servers. The chapter explains how machine performance comparisons can be quantified by execution times, illustrating that if machine X is faster than machine Y, then the execution time on machine Y will exceed that of machine X.
            • 31:00 - 36:00: Parallel Processing and Limitations The chapter discusses the concept of comparing performance between two processes, X and Y, through execution time and throughput. It explains that by dividing the execution time of Y by the execution time of X, one can determine how many times faster process X is compared to Y. Alternatively, comparing throughputs of the two processes can also provide a measure of relative speed. If the throughput of X divided by the throughput of Y equals a number n, then process X is n times faster than process Y.
            • 36:00 - 42:00: Design Example: Graphics Processors and Instructions Optimization The chapter begins with a discussion about the performance comparison between X and Y, noting that X is n times faster than Y. It aims to delve into additional metrics used in evaluating performance. The first metric discussed is 'response time', which is the duration taken by a second machine to respond to a request from a first machine. Another key metric introduced is 'throughput', which refers to the number of tasks completed in a unit of time.
            • 42:00 - 53:00: Principles of Computer Design: CPU Time and CPI The chapter explores the concept of CPU time in computer design, explaining that it represents the total time for a program's execution by the CPU. It differentiates between CPU burst, where the program is entirely managed by the CPU, and peripheral operations, where CPU intervention isn't needed. CPU time is the portion dedicated exclusively to CPU tasks, distinct from wall clock time, which measures the entire execution duration.
            • 53:00 - 62:00: Comparative Analysis using Amdahl’s Law The chapter titled 'Comparative Analysis using Amdahl’s Law' discusses the components of a program's execution time, which includes both CPU time and non-CPU time such as peripheral operations. It introduces the concept of 'speed up' as a metric for performance, emphasizing the importance of execution time comparison between different scenarios. The discussion revolves around understanding metrics for assessing performance improvements.
            • 62:00 - 69:30: Recap and Conclusion In the chapter titled 'Recap and Conclusion,' there is a focus on comparing two hardware systems, labeled as A and B, with an emphasis on determining what makes one faster than the other. The idea is to judge the speed by running a program, presumably offering insight into performance differences. The chapter appears to discuss the criteria for assessing hardware speed and potential speed-up when executing a program on different systems.

            Lec 4: Performance Evaluation Methods Transcription

            • 00:00 - 00:30 foreign [Music] welcome to lecture 1D of the course and in this lecture we are talking about performance
            • 00:30 - 01:00 evaluation methods so we have already learned about different types of architectures like the risk consist architecture the instruction encoding and all now here our focus is slightly different let us say if you have two architectures A and B and I just wanted to evaluate which is faster which is a more throughput so I just wanted to evaluate two different computer systems and how do you evaluate two computer systems in that case I have to use
            • 01:00 - 01:30 certain metrics some quantitative metrics sometimes I may be bothered about in both computer A and B I will run the same program I am looking at what is the execution time in some other cases I may be interested more in throughput sometimes the fairness may be used and sometimes they may be looking at how much starvation that happens and all so these metrics are really important but then we have to understand what metric makes sense for one category of computers one metric may be useful in
            • 01:30 - 02:00 some other category the another metric may be useful so in this lecture we would be introducing a different class of metrics and some laws and then we were trying to understand what are the basic parameters by which we can judge the computers so with this background let us go into the details of today's lecture when can we say that one computer is better than another one it is a very common question
            • 02:00 - 02:30 that we face in our day-to-day life while working with computers let us try to understand this problem the first one is desktop PC when you consider the context of a desktop PC then the execution time of a program is the best metric that can be used to assess its performance so given two desktop PCS we have to run a program on the first PC find out execution time for completion of this and then run the same program on the second PC
            • 02:30 - 03:00 and then find out execution time and the lower the one will tell you which is a faster desktop but if you go to servers rather than execution time a metric known as transactions per unit time will be more relevant under the context of servers let us consider with the help of an example you might have all done Railway booking through the website of IRCTC or even with the mobile app
            • 03:00 - 03:30 of IRCTC so what happens from a customer perspective when we wanted to make a railway booking we go into the website or we go into the mobile app and then we give our requests and this request will go into a central server where in the booking information pertaining to various trains are available let us call This Server as the IRCTC server so request for booking is reaching this particular server
            • 03:30 - 04:00 and the server has to accept these requests look into the data that is available if so either give the booking or give various options or putting in waiting list and then accept payment then confirm the tickets these are the various sub process associated with booking of a ticket we have to understand multiple such requests are going to come from various places into this particular server we are talking about so for machines like servers it is not
            • 04:00 - 04:30 the execution time of a program because it is not the program which runs on that server only that matters we are getting different requests from different machines and how fast we can provide the service in this context it is a transactions per unit time that a server can do it is also known as throughput so we have seen that for common desktop PC when you run one particular program
            • 04:30 - 05:00 its execution time that is a more predominant performance metric whereas in the case of servers it will be the number of transactions that a server can complete in unit time that make more sense when can we say that a machine X is n times faster than machine y so the first parameter is something like execution time what we told so if x is faster than y then the execution time on machine y will be more than the execution time on machine X and the
            • 05:00 - 05:30 ratio of them will tell you how much faster X is so execution time of Y divided by execution time of X and that is going to give you n and then we can say that X is n times faster than y but rather than execution time if you wanted to look at a matrix called throughput we know that throughput of X is more than throughput of Y then we can say x is faster so if throughput of x divided by throughput of Y if it is equal to n then also we can say that
            • 05:30 - 06:00 X is n times faster than y looking further we will try to understand few more metrics that are been used first is called the response time so when a request is coming from a machine to another machine the time it takes for the second machine which is going to respond to the request is called the response time throughput is the number of tasks that is completed in unit time
            • 06:00 - 06:30 CPU time is the total time associated with respect to a program in CPU execution generally programs have CPU burst wherein it is completely executed under the control of CPU or it can have a peripheral operation where no longer CPU is directed intervention is required it will be a peripheral task that has been done so the fraction of time that is exclusively dedicated for CPU is called CPU time and then we have the wall clock time which tells that what is the overall execution time from the
            • 06:30 - 07:00 beginning of the program till the end of the program how much overall time it took and it is actually the sum total of CPU time and some total of non-cpu time which is the peripheral time and then other related operations and then speed up is yet another metric that has been used we have already seen that if the execution time of one is larger than other we feel that we will get a speed up and speed up is at another measure that is been used now once we know this Matrix and if you
            • 07:00 - 07:30 have two Hardwares which are to be compared how will you compare what kind of program that we run see consider the case that you have two Hardwares let us call it as a and other one is B when can we say that one is going to be faster than the other when can we say that a is faster than b or there exists a speed up if you do the problem on a you can simply say that let us say I am going to run a program called p and the
            • 07:30 - 08:00 program p is running in a and program p is running in B and if the execution time for Running P on a if it is smaller than running p on B then we always say a is faster a is faster as far as that particular program is concerned will it be a general case can we generalize always a is faster than b need not be consider the case of another program Q let us say if you run
            • 08:00 - 08:30 Q On A and B this time it may not be the same kind of speed up that you get it will be for b sometimes faster or a may not be showing the same kind of numbers let us run some one more program r then also the execution time or the ratio of execution time in A and B may differ so how can we conclude how much faster is A over B because the kind of performance Improvement that you get and the rate at
            • 08:30 - 09:00 which the execution time varies when you run on different architectures is specific to a particular program when the program changes execution time also changes in this context let us try to understand some kind of a standard programs they are known as benchmarks so these benchmarks are actually standard programs which are being used so if we wanted to compare two architectures we wanted to run this Benchmark program Benchmark programs are basically into
            • 09:00 - 09:30 three types the first one is toy programs where your simple sorting program or searching program or a tree traversal program these are all simple programs that we do on a computer as far as some basic exercises are concerned you can always compare two architectures by running toy programs or we can use special category of programs which are known as benchmarks and specifically artificial benchmarks they don't represent any specific task
            • 09:30 - 10:00 in real time but a combination of instruction that can test all the hardware units of a computer and the most commonly used one are called Benchmark suits something like the spec and the splash benchmarks these are all commonly accepted huge programs which are generally used in the architecture community for assessing the performance of Hardwares the next slide will give you some input regarding what this Benchmark
            • 10:00 - 10:30 suits are so consider this this is called the spec 2006 Benchmark suit it consists of 12 integer applications and 17 floating Point applications and you can see sum is a compression program some are based on C compiler some are fluid dynamics some are linear programming optimizations speech recognition programs so if you look at this in the spec website we can see what are the different range of real life
            • 10:30 - 11:00 applications that have been used so generally if you want that compare the performance of two machine it is always advisable to run the spec Benchmark programs on machine one and the same Benchmark program on Machine 2 and find out what is the difference in execution time the difference in execution time in one benchmark may not be the same when you run a different Benchmark on these two machines so taking up a geometrical mean across all these will give you a
            • 11:00 - 11:30 normalized answer let us now try to understand what are the things that you are going to evaluate with the help of these benchmarks there are certain metrics which are very important when you assess the hardware architecture programs called simulators are being generally used in the architecture community to assess the performance of hardware and these simulators will also help us to understand deeper about basic sub operations that are happening
            • 11:30 - 12:00 at the hardware level one of the important metric that governs the performance of applications is IPC when you run a particular program how many instructions that you can complete in a cycle that is known as instructions per cycle or also abbreviated as IPC so when you run certain benchmarks you could complete more instructions per cycle because of effective utilization of the available Hardware but for some other benchmarks depending upon the inherent dependency and there
            • 12:00 - 12:30 may be some delay in running the subsequent instruction because you have to wait for the previous instruction to get over IPC values may not be that high here for various spec benchmarks given on the x axis the y-axis Port what is IPC value instructions completed per clock cycle you can see that different benchmarks the IPC value sometimes it is very low and sometimes it can be very high as well another metric that we will use is known
            • 12:30 - 13:00 as the instruction profiling so these are all spec 2006 benchmarks with the help of simulators we are trying to understand what is the split up of various category of instructions in this case we have considered that there are load and store instructions and there are Branch instructions and then the rest of the instructions are grouped together so this bars will show what is the appropriate fraction of load instruction
            • 13:00 - 13:30 in the entire instruction suit store instruction branch and other category of instructions that is basically used to find out the split up of the instructions why this pritup is important sometimes think of a case that this yellow portion is branches there are certain benchmarks which are very few branches and there are certain benchmarks which have very high rate of branches as well so if we are trying to improve upon how
            • 13:30 - 14:00 branching is been done we will learn it later in this course there is a technique called Branch prediction which will help us to predict whether a branch will be taken or not so if you are using an advanced Branch predictor then we will get better performance Whenever there is a branch instruction that this performance is limited to how many number of Branch instructions are there in a given instruction suit so here we have seen that there are
            • 14:00 - 14:30 certain benchmarks which consists of heavy number of branches and certain benchmarks which are having limited number of branches so when we use an advanced Branch predictor these benchmarks which have a rich mix of Branch instruction will obviously get the benefit think of a case that we are going to have a sophisticated Branch predictor circuit in your hardware and the program that you run is not having enough number of branches then the performance gain that we get is very very minimal
            • 14:30 - 15:00 so trying to understand what is the split up of instruction will also give the designer a greater picture about any kind of optimization than that is done on a hardware will it be impactful or not coming forward let us try to understand yet another metric that is known as the number of data cache misses per thousand instruction so mpki misses per kilo instruction
            • 15:00 - 15:30 is a metric that is commonly used when you run thousand instruction kilo instruction how many misses are there in the cache we all know that when you fetch an instruction the fetching happens from the cache and when you have a data access like load or a store operation again we are going to access the data cache so the graph that we are going to discuss now it shows the data cache misses per thousand instruction when you run 1000
            • 15:30 - 16:00 instruction how many of that resulted in a Miss in the data cache you can see that an application called mcf the data case Miss is very high it got roughly 156 data cache misses when it was accessing thousand instruction and there are certain other applications also which got around 50 data cache misses but some of the application like Gob MK hmmer these are all less than 10 data cache misses
            • 16:00 - 16:30 and that is going to be a very important significant number that will help us to understand what is the memory profiling as far as this benchmarks are concerned so when we use a good caching mechanism good replacement algorithm then those benchmarks which suffered heavy cache misses will be able to reduce their cash emissors if an application is already having very low cache misses let us say less than 10 cash misses for every thousand
            • 16:30 - 17:00 instruction then the kind of improvement that we get by optimizing the cash may not be much because already we are having very few cache misses so the scope for further Improvement is less so these kind of observations that are being taken from the real Benchmark suit running on the simulators will give the designer a bigger picture about what are the scope for improvement or to what rate I can optimize a given system
            • 17:00 - 17:30 moving further let us try to understand one more metric it is known as Branch mispredictions per thousand instruction you can see that there is a lot of benchmarks where Branch mispredictions are less I already mentioned we have Branch predictor circuits and these Branch predictor circuits will help us in predicting whether a branch will be taken or not before the actual execution of a branch so the prediction is correct then the next instruction fetched will be the correct one either from the follow through or from the target as
            • 17:30 - 18:00 predicted by the branch predictor but we can see that there are certain benchmarks where the predictors are not giving good output they are having very number of mispredictions this will also help us whether a branch predictor is going to help a particular program or not we will now try to understand a parameter known as spec ratio spec ratio of a machine a is defined as the execution time of a program on a reference machine divided by execution
            • 18:00 - 18:30 time of a so consider the case that I have a laptop I wanted to know what is the spec ratio of my laptop or with respect to the specific processor that is running on the laptop I wanted to know what is the spec ratio so first we will take a program we run the program on a reference machine so the reference machine for spec 2006 is defined as Sun Ultra Enterprise 2 workstation with 296 megahertz Ultra spark 2 processor
            • 18:30 - 19:00 so run a program on this reference machine and run the same program on your machine a in this case the processor that we are considering inside the laptop and that is going to give you spec ratio the spec ratio is larger than one means your machine will take less time in running the program we can always find out the ratio all spec ratios let us say I wanted to know what is the spec ratio of a over spec ratio of B so in that case
            • 19:00 - 19:30 execution time on the reference machine divided by execution time of a the whole divided by execution time of the reference Machine by execution time of B since execution time on the reference is common we will get execution time of B by a or performance of a by a that is not a spec ratio of a divided by spec ratio of B now when you have different benchmarks let us say A1 is a spec ratio for one of the Benchmark A2 for other a three four
            • 19:30 - 20:00 other like that up to a n then by taking a geometric mean of this we'll give you the average pack ratio so the individual spec ratios you take the geometric mean of that that's the spec ratio associated with that particular Hardware the next concept we are going to learn today is about amdas law we have seen so far that if you wanted to improve the performance of a machine
            • 20:00 - 20:30 by focusing on certain specific Hardware how much returns we are going to get clearly specifies the amount of returns that we are going to get if you make some modification on any part of the hardware amdad's law defines the speed up that can be gained by improving some portion of the computer the performance Improvement to be gained from using some faster mode of execution is limited by the fraction of time the faster mode can be used
            • 20:30 - 21:00 so let us say I am going to improve one Hardware unit by making some modification on it the new execution time is defined as the old execution time into 1 minus the fraction where the enhancement is done what is the overall fraction of enhancement plus fraction enhanced by speed up enhanced or if you find out what is going to be the speed up speed up is defined as execution time of old divided by execution time of new it is 1
            • 21:00 - 21:30 by 1 minus fraction enhanced plus fraction enhanced by speed up enhanced so think of a case that so think of a case that if we have 10 percent of the instructions that are getting benefited by an advanced Hardware and the speed up that you are getting is going to be 5 times that means in the speed up it is 1 divided by 1 minus 0.1 that is a portion which is not affected by these
            • 21:30 - 22:00 instructions and this ten percent of instructions are getting a speed up of five so that is how it is been done now think of the case that if rather than 10 percent it is hundred percent of instructions are going to get benefit then this will become 1 minus fraction and hence will become 1 minus one its hundred percent so this will become essentially zero plus fraction enhanced by speed up enhanced that is going to be
            • 22:00 - 22:30 10 by whatever so this will become 1 divided by five let us say 5 times is the speed up that you get so then overall sped up will be 5. what if none of the instructions are going to be benefited then this will this portion will become zero and this portion will be one itself so that you will get the same answer so depending upon what is a fraction enhanced we are going to find out what is the overall speed up that you get so mdal's law gives the general overview that the
            • 22:30 - 23:00 amount of speed up that you can get is restricted to the fraction enhanced its very important its restriction to the fraction enhanced we will now take a simple illustration example to understand amdal's law suppose that we wanted to enhance the floating Point operations of a processor by introducing a new Advanced floating Point Unit let the new floating Point Unit is 10
            • 23:00 - 23:30 times faster for floating point computations than the original processor assume a program has 40 percent floating Point operation what is overall speed up you are going to gain so for all floating Point operations we get 10 percent speed up but floating Point operations are only 40 percent so what is overall speed up that you are going to get so the speed of overall is one by one minus fraction enhanced plus fraction
            • 23:30 - 24:00 enhanced by speed up enhanced so the fraction enhance is going to be 0.4 and the spread up enhanced is 10. so when you substitute it in the equation you get 1.56 times so even though floating Point operations get 10 times benefit or they are being speeded up by 10 times since they are limited to only 40 percent of the instructions you would the overall speed up that you get is 1.56 only
            • 24:00 - 24:30 andal's law for parallel processing consider the case that you have 500 steps of operation to be done and it is taking 500 unit of time so each step is going to take one unit of time wherever there is this blue color these are the section inside the task which can be done only by one unit whereas wherever you see this white
            • 24:30 - 25:00 color these are portions in the task which can be parallelized by having more parallel units let us now consider that you have two parallel units so this hundred since it can be done only by one unit whereas this hundred can be done by two units parallely so it may take only 50 units of time even though hundred units of work will be completed in 50 units of time because we are having two units to parallely do
            • 25:00 - 25:30 the work and after that again there is a sequential portion where only one unit can do the job and again you have a portion where there exists parallelism so 250 and then you have one more hundred so the total work is 500 units of work is completed in 400 units of time and the speed up you are going to get is one point two five x consider the case that rather than 2 units doing the job now I have four
            • 25:30 - 26:00 units doing the job so the entire hundred unit of work which can be parallelized can be now completed in 25 units of time so if you proceed like this the same 500 unit of work will be over in 350 units of time so the speed up is 1.4 times let us assume we have infinite number of processors can help even though you have infinite amount of processors these three hundreds will be still there whereas the one that has
            • 26:00 - 26:30 been paralyzed we assume that it will take zero time approximately because we have infinite number of processors so the total work is now 500 and the time is going to be 300 so the speed up that you get is one point seven X so we have used infinite number of processors but even then the speed up you are getting is only 1.7 x times so this is the graph which will show us how much performance or speed up that you can get based upon the fraction of
            • 26:30 - 27:00 instructions where the modification is impacted so speed up is defined as one by one minus Alpha plus Alpha by n where Alpha is the fraction if Alpha is 50 percent that means 50 percent of the code is paralyzable this graph shows this is the blue line which is almost parallel to x axis on the x axis we are plotting the number of processors ranging from one processor to over sixty five thousand processors and the y axis
            • 27:00 - 27:30 is plotting the speed up you can see that the Blue Line the speed up increases as the number of process increases but beyond a point even if you increase the number of processors you are not going to get any benefit similarly if you look at 75 percent where parallelism is applicable that is a red line so when compared to the Blue Line the red line has more speed up but there also we can see beyond the number of processors even if you increase the number of processors you are not going to get much benefit in terms of speed up
            • 27:30 - 28:00 whereas the green drop percent ninety percent of the code is paralyzable again the speed up is more and it is getting to higher numbers again beyond the point then we are not going to get any benefit in speed up similarly for 95 percent also we can see that the speed up is high but beyond the point we cannot do much so this restrict us even with the 95 percent of code that is
            • 28:00 - 28:30 paralyzable anything more than 256 processors is not going to give us much rewards so this gives a deeper intuition about the performance that we can gain from andal's law now consider another design example a common transformation required in graphics processors is square root implementation of floating Point square root very significantly in performance especially among processors designed for graphics suppose floating Point square root operation is responsible for twenty
            • 28:30 - 29:00 percent of the execution time of a critical Graphics benchmark one proposal is to enhance floating Point square root hardware and speed up sporting flow in square root operation by a factor of 10. the other alternative is just to try to make all floating Point instruction in the graphics processor run faster by a factor of 1.6 floating Point operations are generally responsible for half of the executed
            • 29:00 - 29:30 time of an application compare these two designs using amdal's law so we have two modifications that has been suggested one is proposing a new floating Point square root Hardware which will speed up floating Point square root operations only by a factor of 10. the other option is all floating Point instruction not only the floating Point square root any floating Point instruction we are going to get a speed up of 1.6 that is by making the graphics processor
            • 29:30 - 30:00 work faster let us now try to find the solution the case a is by proposing a floating Point square root Hardware optimization whereas case b is floating Point instruction optimization when you apply this in the equation you can see that only 20 percent of instructions are floating points square root operation so if you put that twenty percent here and these are going to get 10 times speed up the overall speed up that you get is 1.219 times whereas in case b we know that 50
            • 30:00 - 30:30 percent of instructions are floating Point operations and these fifty percent will get a spread up of only one point six so overall you get one point two three times which is faster than case a so this shows that even though I as a designer has two options either go for and out an advanced floating Point square root Hardware or go for a better graphics processor which will improve all the floating Point operation
            • 30:30 - 31:00 even though the speed up that you get is limited to 1.6 times overall that is going to get benefit because it is going to impact 50 percent of instructions whereas the hardware optimization on floating Point square root is going to impact only 20 percent of the instructions so with this we are able to see that amdal's law is going to help us to compare two designs moving further into principles of computer design we know that every
            • 31:00 - 31:30 processor is being triggered or run with the help of a clock and it is a clock right that is considered as the basic step by which you can do an operation inside the processor generally this clock is represented in terms of gigahertz or one clock period will come up to nanoseconds now CPU time that is execution time of a program is defined as number of Cycles needed to complete the task into clock
            • 31:30 - 32:00 cycle time so CPI that is clock cycles per instruction is defined as CPU clock cycles for a program divided by the instruction count so the total CPU time that it will take for a completion of program is depending on number of instruction that is instruction count into Cycles required per instruction into how much is one cycle that is clock cycle time instruction count will tell number of
            • 32:00 - 32:30 instructions per program CPI will tell clock cycles per instruction and CCT will tell number of seconds per needed for a clock cycle so when you look at clock cycle time that is a hardware technology what is a crystal oscillator that you are using CPI is been governed by the organization of the hardware and the instructions at architecture that you use
            • 32:30 - 33:00 so if the instruction set architecture follows a risk one CPA will be less if it is a cisc one then the CPA is going to be slightly larger and the instruction count depends on compiler technology as well as the instruction set architecture so when you consider about how much time it is needed to complete a task that is a CPU time associated with a task is been governed by what Hardware technology that you use the internal organization of the processor the instructions at
            • 33:00 - 33:30 architecture that you use and the kind of compiler that is going to translate these programs into machine language but different instructions have different cpis so the clock Cycles needed to complete a task in CPU is computed by summation of instruction count into the CPI so if you have an add operation find out the total number of instruction which perform add multiplied by what is the average number
            • 33:30 - 34:00 of Cycles needed to perform another so CPI of add operation into the instruction count of add operation that is first component similarly you do the CPI for the next instruction let us say subtraction operation into the instruction count for the subtraction operation like that you do a summation across all the instruction and the CPU time is defined as whatever is the average number of clock Cycles needed to carry out the task into the clock cycle
            • 34:00 - 34:30 time so we have one more example which will help you to compare two designs which are varying in CPI and clock consider two programs A and B that solves a given problem so A and B are two different programs both are solving the same thing probably we can consider AI as one kind of a sorting let us say cook sort and B is bubble sort a is said you to run on processor P1 which is operating at one
            • 34:30 - 35:00 gigahertz and B is scheduled to run on processor P two running at 1.4 gigahertz program a has 10 000 instruction out of its twenty percent are branch forty percent are load store and the rest are ALU instruction B is composed of 25 percent branch the number of load store instruction in B is twice the count of ALU instructions so this statement number of load store instruction in B is twice the count of ALU means l u will be 25 percent and
            • 35:00 - 35:30 load and store will be 50 percent the price has been maintained total instruction count of B is twelve thousand in both P1 and P2 Branch instructions have an average CPA of five and ALU instructions as an average CPA of 1.5 both architectures differ in the CPA of load store instruction they are 2 and 3 for P1 and P2 respectively now the question is which mapping whether a running on P1 or B running on
            • 35:30 - 36:00 P2 solves a problem faster and by how much let us now try to understand what the given problem is you have a problem to solve there are two approaches by which we can solve the first is called program a second is program B now the program a is supposed to run on a machine P1 which operates at one gigahertz and program B is supposed to run on P two which is operating at 1.4 gigahertz
            • 36:00 - 36:30 a consists of ten thousand instruction B consists of twelve thousand instruction the percentage split of Branch instructions load store instruction and a new instruction is given similarly percentage split of the other 3 is given for program b as well the given this context which approach is going to solve our problem whether a on P one or B on P two let us try to summarize what are the
            • 36:30 - 37:00 details that has been given a is running on P1 B is running on P two the appropriate clock speed has been given the instruction count is ten thousand and twelve thousand it is a fraction of Branch instruction versus Lord store versus L U 20 Branch forty percent load store and forty percent value similarly it is given for B on P two and the CPI value is also given accordingly we know that they differ in CPI only for load store instruction it is 2 for the first machine and three for the
            • 37:00 - 37:30 second machine now we have to compute what is the CPI of program a on machine P1 so you divide the fraction with respond to CPI so you know that you have a CPA of five or Branch instruction and there are twenty percent branches since there are twenty percent branches and each branch will take five Cycles its 0.2 into five you have forty percent load store instruction which have a CPA of two and you have forty percent ALU
            • 37:30 - 38:00 instruction which is a CPA of 1.5 so the average CPI is 2.4 once you get the CPI then to find out what is execution time CPI into instruction count into what is clock cycle time so you get twenty four thousand nanosecond repeating the same thing for B on P two we have 25 percent of them are Branch instruction which will take five Cycles fifty percent of them are load store
            • 38:00 - 38:30 instruction which will take three Cycles remaining 25 percent of them are ALU instruction which will take 1.5 Cycles giving a CPA of 3.125 to compute execution time it is CPI into instruction count that is twelve thousand into one clock cycle is only 0.714 because we are using a 1.4 gigahertz clock so that gives the execution time of twenty six thousand seven seven five the program a while running on P1 it
            • 38:30 - 39:00 will take twenty four thousand nanosecond whereas program B while running on P two will take only will take twenty six thousand seven hundred and seventy five nanosecond this is the comparison this shows that a on P one is giving you better performance because its take lesser execution time hence a on P1 is faster
            • 39:00 - 39:30 we will now try to work out one more problem on amdal's law which gives a fair comparison on how various optimizations are done parallel and what is impact on them a company is releasing two latest version beta and gamma of its basic processor architecture named Alpha beta and gamma are designed by making modifications on three major components x y and reset it was observed that for program a the fraction of total execution time on these three components x y and they said are 40 30 and 20
            • 39:30 - 40:00 respectively beta speeds of X and reset by two times but slows down y by one point three times whereas gamma speeds up x y and desert by one point two one point three and one point four times respectively so how much faster is gamma over Alpha and whether beta or gamma is faster for running a find the speed up Factor since there are three component
            • 40:00 - 40:30 component x y and result that are going to be impacted by some order and it is different for beta and gamma speed up with andal's law can be represented with this equation that is given 1 divided by 1 minus f x minus f y minus f is that plus f x by N X Plus f y by n y plus f is set by n is centered the fraction f x f i and the filter is
            • 40:30 - 41:00 same because it is a property of program a if you implement that on beta the speed of that you get n x is 2 N Y is one divided by one point three because it is actually slowing down the factor y by a fraction one point three so the speed up is one divided by one point three and N is that is equal to two whereas for gamma n x is one point two n y is one point three and N is that is one point four now you substitute these values of f x f
            • 41:00 - 41:30 i f e z n x n y and the initial in the equation that is the speed up of beta over Alpha is going to be one point two six seven times whereas the speed up of gamma over Alpha is one point two three nine times so gamma is having one point two three nine times that is the first question that is asked it is 1.239 times faster over Alpha whereas beta is faster than
            • 41:30 - 42:00 gamma because beta's value is higher than that of gamma so beta is faster than gamma by 1.022 times so we come to the end of today's lecture let us try to have a quick recap of what we studied today we started our discussion with why we need to have performance evaluation methods when you have two designs then if you want that to find out which of the design is
            • 42:00 - 42:30 superior than the other one we need to have certain metrics we familiarized with execution time with throughput with response time with speed up these are all the various metrics that we learned today and then we try to find out what's the scope in amdal's law it defines how much speed up that you can get by making some optimization in one component of hardware we worked out few numerical problems in
            • 42:30 - 43:00 execution time of a program on a hardware which is dependent on the instruction count with the CPI and the clock cycle time and then we have few more illustrations on the application of andal's law so that concludes today there are many other numerical questions similar to this that is given in the textbooks that is been prescribed as part of the course I request you to familiarize as many number of numerical problems so that you will get a deeper
            • 43:00 - 43:30 understanding about the problems that we are discussing kindly post your queries if you take thank you [Music]