Exploring VM Migration and Strategies
Lecture 46 "VM Migration - Basics Migration strategies"
Estimated read time: 1:20
Summary
In this lecture, the basics of VM migration and the strategies involved are explored. The need for VM migration, including load balancing, power fault tolerance, and system maintenance, is discussed. Different migration approaches, such as pre-copy and post-copy, are analyzed, focusing on the intricate process of transferring VMs from one physical server to another. Various stages of migration, including memory transfer and handling of dirty pages, are explored along with an analytical approach to assess total migration time, downtime, and number of iterations. The lecture also considers the challenges associated with multiple VM migrations, comparing serial and parallel strategies.
Highlights
- VM migration is crucial for balancing load and maintaining systems. 💼
- Pre-copy and post-copy are essential strategies for efficient migration. 🔄
- Managing 'dirty pages' is vital to reduce migration delays. 📑
- Analytical models help in estimating migration times and iterations. 📈
- Serial vs. parallel migrations have their pros and cons in handling VMs. ⚙️
Key Takeaways
- Understand the importance of VM migration for load balancing and system maintenance. 🛠️
- Explore different VM migration strategies like pre-copy and post-copy. 🔄
- Learn about handling 'dirty pages' during migrations. 📄
- Analyze migration times and necessary iterations using mathematical models. 📊
- Compare serial and parallel VM migration strategies. ⚖️
Overview
VM migration is a critical process for optimizing server usage and maintaining efficient operations. It involves transferring a running application or virtual machine (VM) from one physical server to another, necessitating essential considerations such as CPU state, active memory, network connectivity, and active I/Os.
Different strategies like pre-copy and post-copy are applied to handle migrations. Pre-copy involves iterative data transfers followed by a 'stop and copy' phase, effectively managing 'dirty pages' that may require multiple iterations to resolve. Alternatively, post-copy focuses on transferring the process state first and handling memory when demand arises, thus preventing multiple sending of data blocks.
An analytical approach towards VM migrations helps in understanding total migration time, downtime, and iteration count. This becomes particularly useful when dealing with multiple VMs, where strategies can be undertaken either in a serial or parallel manner, each with its own set of advantages and challenges.
Chapters
- 00:00 - 01:00: Introduction and Need for VM Migration Chapter 1: Introduction and Need for VM Migration
- 01:00 - 02:00: VM Migration Analysis and Strategies In this chapter, the focus is on analyzing and strategizing for VM (Virtual Machine) migration. The lecture involves a bit of an analysis to understand how to approach the problem of VM migration, possibly using mathematical modeling. The key concepts are to dissect the migration process, which involves moving active computational processes or virtual machines from one environment to another, ensuring minimal disruption and efficiency in operations.
- 02:00 - 03:00: Key Considerations in VM Migration This chapter focuses on the essential factors to consider when migrating virtual machines (VMs) from one physical server to another. Key considerations include managing the process state, often referred to as the CPU state, and ensuring that active input/output operations are handled properly. The discussion highlights memory, storage, and network connections as critical elements that must be addressed during the migration process.
- 03:00 - 04:00: Pre-Copy Approach in VM Migration The chapter discusses the Pre-Copy approach in Virtual Machine (VM) migration, focusing on the issues of active memory or main memory migration. It emphasizes the importance of networking continuity during migration. The text highlights that as the VM's connectivity shifts, careful attention must be paid to IP to MAC mappings and similar networking considerations to ensure seamless migration.
- 04:00 - 05:00: Post-Copy Migration Technique The chapter explores the Post-Copy Migration Technique, focusing on its necessity and application in resolving ERP-related issues, load balancing, and system maintenance. It emphasizes addressing unresolved active IOS and discusses the importance of these processes to ensure optimal system performance. The chapter revisits previously discussed topics, presenting familiar concepts with slight variations.
- 05:00 - 06:00: Analysis and Estimations in VM Migration The chapter "Analysis and Estimations in VM Migration" explores the significance of load balancing in virtual machine (VM) management. It highlights scenarios where unbalanced workloads and impending downtimes necessitate the migration of VMs to ensure service reliability and efficiency. The discussion emphasizes how strategic VM migration is pivotal for service providers to maintain optimal performance and fulfill service commitments.
- 06:00 - 07:00: VM Migration for System Maintenance and Multiple VMs The chapter discusses the importance of ensuring critical service availability and reliability through proactive fault tolerance strategies. It emphasizes anticipating and handling failures proactively to maintain system reliability. This involves power management techniques, such as switching idle servers to sleep mode, to prevent power faults and maintain system integrity.
- 07:00 - 08:00: Serial and Parallel Migration Approaches Chapter Title: Serial and Parallel Migration Approaches Summary: This chapter discusses energy-saving techniques in data centers, focusing on reducing energy consumption through server management strategies. It introduces VM live migration as an effective approach to optimize energy use. The idea is to assess overall energy requirements and leverage servers' idle, slip, or off modes. The strategy involves identifying underutilized servers and migrating VMs to consolidate workloads, allowing some servers to become idle or slip into energy-saving modes. By moving VMs from less loaded servers, data centers can turn off certain servers, thereby saving energy.
- 08:00 - 09:00: Conclusion and Future Discussion In the conclusion and future discussion chapter, the text focuses on optimizing power consumption and resource allocation in computing environments. It discusses the importance of resource sharing among virtual machines (VMs) and suggests that challenges such as limited hardware resources, including memory, cache, and CPU cycles, can be mitigated. An effective solution proposed is the relocation of VMs from overloaded servers to underloaded ones to improve efficiency. This approach enhances performance by balancing the workload across available resources, thereby overcoming the limitations and ensuring better power management. No specific future directions or additional discussions are detailed in the provided text.
Lecture 46 "VM Migration - Basics Migration strategies" Transcription
- 00:00 - 00:30 [Music] hello continue our discussion on vm migration so last lecture what we discussed is that how what are the need of vm migration and why it is important and what are the different way or ah
- 00:30 - 01:00 mechanisms to do that right today we will try to see little bit of analysis of that that how to do some sort of a ah mathematical model or how to approach those problem in today's lecture right so today we will do some vm migration analysis and also we will be looking at the strategies so these are the keywords so ah just to quick recap so it is a when we talk about migration is a process of ah to move some running
- 01:00 - 01:30 application or vms from one physical server or host to another physical server right so process state storage memory network connection if there is active i o those as needs to be taken care of ah as we discussed in the last class so if we look at that say the process state or sometimes we refer as a cpu state is a important factor we need to be taken care memory
- 01:30 - 02:00 that means the where the vm or the is working on that is the active memory or the main memory which need to be migrated and of course in some of the cases networking because not some of the cases networking is important because now the your if you see the your connectivity is not longer in the in that vm it is somewhere here so some sort of a ah mapping of this i p two mac and things like that has to be looked into all
- 02:00 - 02:30 right or some sort of a erp ah related issues need to be resolved and in some cases if there are active i os so that need to be looked into so though that last class we discussed right and also why to migrate that is one of the major reason is ah that if there is load balancing and system maintenance we discuss few of the things some of some may be ah the same thing but in may be little ah
- 02:30 - 03:00 there may be rewarding difference so one definitely is load balancing right ah so when the load is considerable considerably unbalanced like if you have ah host where unbalanced loads and impending downtime often require [Music] migration of ah vms right so in order to have a more faithful provide more faithful service ah by the provider this migration is one one of the
- 03:00 - 03:30 things which we may look into other things what we looked at as the power fault tolerant right so is another challenge to guarantee critical service availability and reliability right failure should be anticipated and proactively handle rather in some of the reference you will find that what we say proactive for fault tolerance so we ah if there are means failure they need to be proactively handled right power management so switching the idle mode server to either slip mode so the server
- 03:30 - 04:00 which are idle mode or less loaded slip mode or off mode based on the resource demands right that leads to energy savings vm live migration is a good technique of doing that so overall energy requirement of the particular data center need to be looked into where you can we can look into the idle server where the server having very less number of vms that can be migrated to other servers and some of the servers can be made free so that they can go for a slip
- 04:00 - 04:30 mode or sometimes in a ah off mode that means they will consume less power than expected while it is in a running mode right resource sharing ah challenge of limited hardware resources like memory cache cpu ah this cycles ah can be resolved by relocating vms over ah from overloaded servers to the underloaded server so that is other way
- 04:30 - 05:00 around so if it is if there is a lot of punch that it it basically goes on a some sort of a load balancing sort of things but here the major challenge is that resource sharing so if we need in some cases you require that these are the say cpu hungry operations so those may be shifted to some of the things which has more resources on that font and type of things right so there is a resource agreement of course last but not the least rather one of the major requirement is the system maintenance right so
- 05:00 - 05:30 if a service provider putting shutting down the system and for or taking down time for a long time is always a negative point right that affects its ah credibility and revenue as as well right so physical system required to be sometimes upgraded goes for a system maintenance ah and so the vms in that physical system must be migrated to alternate servers so that the service downtime is minimized or you probably can be provided uninterrupted
- 05:30 - 06:00 services so this we all things we have looked into or seen these scenarios right so these are the major things which we require when it is migrating and also we looked into two major way aspect that is one is pre copy approach another is post copy approach and just we have a quick little bit more ah deep in depth into the things so
- 06:00 - 06:30 whatever we are talking about is live migration scenarios right so uses iterative push phase that is followed by a stop and copy phase in this li pre copy approach because of iterative procedure some memory pages have been modified or mod updated or modified what we refer as a ah dirty pages right when when when it is going on ah transferring this page can be modified at the other end so you know if you dirty pages are
- 06:30 - 07:00 regenerated on the source server during the migration iteration so in the migration iteration going on this pages can be at the source ah can be modified or or updated and that creates these dirty pages so desertification had to be resent to the destination host in a future iteration right if there is a dirty page so need to be sent to the things hence some of some ah of the frequently access memory pages are sent
- 07:00 - 07:30 several times so what we see that some of these memory pages are sent several time it causes a long migration time because it is once there is a dirty page and things like that if you remember we called about that when when it will when the dirty pages are ah below a threshold or ideally there is no dirty page then we say that the whole thing has been migrated properly so so if we look at the first phase or
- 07:30 - 08:00 the first step towards this pre copy migration is the all pressures are transferred while the vm running continuously on the source source right further rounds dirty pages are recent right so what we see that all ah in the step one all pages are transferred while the vm is still running on the source source so automatic source source host so automatically there may be updation of the pages so this dirty pages are resent right so second
- 08:00 - 08:30 step is termination phase which depends on the defined threshold right at where you terminate this and put the cpu state on the other thing and then start working on the things right so the termination is executed if any one out of these three condition we seen earlier also the number of iteration exceeds the predefined ah number of iterations right i i look for say n number of iteration and it goes on more than the ah defined
- 08:30 - 09:00 so that is what we say that it is crossing in max type of things the total amount of memory that has been sent is above a threshold or the number of dirty pages is just ah previous round fall below a defined threshold so if the dirty pages ah fall below a defined threshold ah then we stop this [Music] step two or the pre-copy approach the second phase of the things
- 09:00 - 09:30 and in the last top ah stop and copy phase migrating v m ah is suspended at source server after that the move the processor state and remaining dirty pages to the destination server and then ah start working on the ah on this vm is initiated or started at the destination server so when the vm migration process is completed in the correct way that means all these steps are
- 09:30 - 10:00 there then what we do ah then the hypervisor resumes the migrant vm on the destination server so hypervisor ah resumes those migrating vm on the destination cipher right so if you look at kvm zain vmware hypervisor use this pre copy technique right this tp copy approach for live vm migration this our popular hypervisor like kvm zain vmware this
- 10:00 - 10:30 hypervisors they use this ah pre-copy approach so this is in the pictorial form ah if we see in the flowchart form which is given in this reference this is a nice survey paper which you can refer ah migration so start of migration so destination server selection that where is the destination server resource
- 10:30 - 11:00 reservation as the destination site is important because the vm will be there so the resource will should be available capture the whole memory and assume it as a dirty so whole memory is transferred there iteratively copy the dirty memory pages of the vm to the destination server right so initially the hole is copied and then as and when incremental things are going on there so stop and ah copy if it is still not below that ah threshold this was that iteration goes
- 11:00 - 11:30 on if it is there suspend the vm and transfer the vm state that cpu ah register vm memory and so and so forth that state of the vm to the destination server right so resume vm at the destination server and start working on the things right by the hypervisor resumes the vm at the destination server and the whole migration process is ah done or committed right so
- 11:30 - 12:00 this is nicely given this flowchart where ah initially that whole memory is being transferred then iteratively copying the dirty memory and ah this iteratively ah from the source to the destinations one that is there ah within the above the threshold then we suspend and go on continuing with the ah destination server the other one is the post copy phrase
- 12:00 - 12:30 phase in the post copy migration ah the processor state transfer before the memory content and then vm could be started at the destination server right so post copy migration technique investigate demand paging active ah push and pre paging ah there are other techniques also for prefetching of the memory pages at the destination server so as we say stop phase stop the source vm and copy the cpu state to the destination restart the restart the
- 12:30 - 13:00 destination phase on demand copy copies for the on demand [Music] means copy ah approach copy the vm memory according to the demand of the things right so this is the ah what we look at the post copy approach right so if we try to look at the same thing in the [Music] flow chart format so the
- 13:00 - 13:30 migration starts right destination server selection that we discuss in resource allocation at the destination site right capture vm ah state like cpu registered if there are any io state and type of things transfer the vm state to the destination server stands for v a minimum state to the destination [Music] server and then resume the vm at the destination server
- 13:30 - 14:00 right an active push dirty memory pages of the vm to the ah from the source to the destination server right so initially whole memory is pushed into the things if there is a page fault yes the copy the full fully faulty pages from the source server right if it is no all memory pages of the vm transfer successfully if yes migration it so it is somewhat what we say post copying of the memory
- 14:00 - 14:30 ah from the source to the destination server so these are two approaches what we try to look at so the major challenge is definitely this ah memory thing like if at it is getting dirty pages and how many iterations goes on and things like that it is a challenge and need to be looked into that ah need to be looked into ah the whole
- 14:30 - 15:00 process to be completed and committed so the whole state is transferred into the things right so ah let us do some analysis ah for the things so this is ah this is with reference with some of the literatures and also done one of my ah one ah ex post doc researcher dr sohrab who is presently at nit ah suratkal
- 15:00 - 15:30 ah so let t mig ah be the total migration type right so the if we consider the total migration time at the tmig let t down be the total downtime right and so for non live migration of a single v m the migration time t m i g can be v m by r so v m is the size of the memory right or the memory size and ah r is the
- 15:30 - 16:00 transmission rate so what we are considering the transmitter remains constant throughout the migration process so the transmission rate throughout the migration process right so that is that is one assumption and that may be a fair assumption because ah taking a something average migration rate or transmission rate and with the type of ah connectivity etcetera that that is the assumption right so what we say t m i g migration time is the whole memory migration divided by r
- 16:00 - 16:30 right so this is the typical migration ah time so as we see that the memory is the major constraint of migrating right other things is a one time things you shut down cpu and go on so this is the major constant so rather in online migration practically the downtime is same as the ah more or less same as the migration time right there are other consideration like ah cpu state transfer etcetera but practically downtime is the
- 16:30 - 17:00 dictates dictated by the migration t e m i g migration of this memory right so in other sense if this is a non live migration what we say that t down ah is ah more or less or basically equal to the tmij right so this is the ah what we see if we have a ah when we have a non life migration type
- 17:00 - 17:30 of ah scenarios ok now let n be the total number of iteration in the pre copy phrase now we are considering ah pre coffee live migration in a p copy ah [Music] cycle or p copy pre copy approach all right so let n be the total number of iterations what we are looking for right so so much iteration is required and t i j
- 17:30 - 18:00 represent the total time that j theta is a iterative ah transmit or d iteration ah transmit i virtual machines memory so v t i j so the i n virtual machine in the ah jth iteration the memory is transfer so vm is the memory of the vm as we have seen vth is the threshold for stopping the iteration right when this is the maximum need to
- 18:00 - 18:30 be transferred n max is the maximum number of iteration which is allowed that is another threshold so v t h or n max is dictates the threshold of the things and r ah because we have this ah page size and the dirty page and the transmission rate so r basically is the factor which dictates this ah controls this how much any iterations and how much this dirty paging things rate is there right so p that is the page size into dirty page divided by
- 18:30 - 19:00 transmittance rate transmission rate r is dictates that ah this you can think of that ah this sort of a factor which dic which tells that the rate of this 30 paging right so t res denote the time taken to restart the vm on the destination server right so it is the time taken at the to restart the vm at the destination server
- 19:00 - 19:30 so peak trophy migration mechanism the v a memory can be migrated iteratively or rather are migrated iteratively we can compute the total migration time t i m g ah of the i th v m as follows right if there are i f v m so t i m g is equal to summation of j 0 to n t i j v equal to v m by r one minus r to the power n plus one by one minus r plus t raise
- 19:30 - 20:00 right so where it take as we told t dash is the time taken to restart the vm at the destination server and t down is represented by ah this component r to the power n v m by r plus t s right so some just to do some ah some calculation like in the t 0 v m by r whatever the memory was there it is in the r where in the t 1 we have the dirty paging thing so based on the if that is
- 20:00 - 20:30 my base time so it is p into d by r into t 0 and rest of the calculation comes from there and similarly if t 2 p into d by r that is the factor what we calculated as small r into t 1 and ah t three is like that and so and so forth so when we have round t n minus one so r to the power n minus one v m by r right so through this
- 20:30 - 21:00 ah analogy and ah round end where we stop the memory copying ah stop this phase or ah what we say that step one step two and copy the things so t n is given represented by ah r to the power n v m by r right so if the total time we calculate and v m by r one minus r n plus one by one minus r which is basically represented out here
- 21:00 - 21:30 right plus the time taken for there right so that little there may be little simplistic ah or some assumptions are there first of all these rate of transmission these are constant and we are [Music] considering that few factors like within the stress hole it is working and type of things right but nevertheless ah it gives us the idea that how
- 21:30 - 22:00 overall [Music] that how much time or how much migration time may be required things like that right why this is important this is important because ah we need to plan for it right migration is if it is a ah due to maintenance etcetera then we need to system maintenance and we are migrating with a predefined things like we will migrate on so and source date at so as
- 22:00 - 22:30 so time then we have to redefine that how much time will take then ah it will again active vm will be active accordingly ah consumer need to be initiated if there is a certainly due to some fault or something need to be migrated that also we need to know that how much [Music] time it requires for migrating the things so this is important to have a estimate of the things how is to be there things will become little complicated when we talk about this i o
- 22:30 - 23:00 and other type of things especially the network level migration all those things like where your connectivity need to be ah repositioned and things like that ok so that becomes that may become little complicated but nevertheless it gives you a broad idea that how things works right so there is another ah things we are not going much nitty gritty into the things that ah there is a need of estimating that number of rounds right is required
- 23:00 - 23:30 for the things right considering there is a threshold for the vm memory right and also there is a in max that how much we can go all right so we need to look at that how much memory is there which is threshold by the vm with a r is again coming as the factor of the data edging ah the that ipage where we calculated and based on that we can finally we can see that ah it it is the min of because whichever is minimum that
- 23:30 - 24:00 n max or this calculating v t h by v m ah log of r ah of looking at that ah factor ah that is number of rounds again there is a there may be little cross estimation ah in reality there will be different other factors which you need to be considered but not not but it is a give a ball mark things that how can be migrated now where there is when we migrate ah
- 24:00 - 24:30 multiple vms right so when we migrate need to migrate multiple vms as we seen that these are ah we considered that ith vms and things like that ah then when we migrate multiple vms ah generally to ah approach ah so generally multiple because when you are going for a system maintenance is not like that one vm to be migrated there may be a special case there there may be a some resource requirement where the vm what you are
- 24:30 - 25:00 working on is not possible to run on this particular ah server or the host and because of the requirement rise in the requirement the vm need to be migrated to something so that you have a better ah resources to cater to the customer so there may be reasons like that but in most of the in other cases where you go for a system maintenance or you are expecting or detected a fault and want to recover the transfer this vm to another thing so it
- 25:00 - 25:30 is not one vms one vm rather is a ah multiple vms which need to be transferred to the destination host now typical ah strategies as intuitively commerce one is going serial serially right one vm the next vm next vn and type of things that is something one at a time which may be ah if we consider that ah looking at the other our calculation that a single single vm multiple times
- 25:30 - 26:00 right it it can be there or there can be a parallel migration so i migrate the whole thing in a parallel and it goes in a migrating way so when you look at the serial migration so the first bm that is selected to be migrated executed ah executes this peak of copy cycle and other v m minus 1 vms continue to provide services that is ah there right so i have say
- 26:00 - 26:30 m vms and then the first vm gets migrated using this pro say pre copy cycle and type of things and then the rest of the vms which are in minus one they continue give services as soon as the first vm enters into the stop and copy phase ah the remaining m minus vms are suspended and copied to the first completes its stop and copy phase until it cop this so the once it is p and copy the other are suspended and then
- 26:30 - 27:00 that executes this first vm executes is complete cisco pfs reason for stopping the rem m minus one v m is to stop those vms from dirtying the memory right otherwise what will happen ah during this portion what if it is a single bin what is expecting that nobody is dotting this memory right once it is stop and copy phase we expect that nobody no one is dotting this memory because there is a single vm which was dotting and now everyone says stop that thing it is not dotting the
- 27:00 - 27:30 memory right but in this case it may so happen the other vms may threaten the memory so one assumption that each vm that is copied at a full transmission rate right so the full transmission rate is available as we have seen that r the down time for the serial migration includes the stop copy phase of the first vm the migration time for the m minus bm and time to resume the vms at the destination vm so that is those type of things are considered if we consider that
- 27:30 - 28:00 this type of serial thing so one ah when we look at the ah parallel migration so the major ah difference what we can see that all mvms start their pre copy cycle simultaneously it is a big big big challenge right so all this mvm because its parallel so each v m effectively we can consider get a
- 28:00 - 28:30 rate of r minus m for the transmission capacity right or actually there will be few more overhead so it is effectively less but but we can consider that r by m type of ah bandwidth they are getting or transmission capacity they are getting to transmit the hm that vmi vm 1 to vm m to the destination right so here all all we are considering that is transfer from one host
- 28:30 - 29:00 or one [Music] server to another server on the other other host or the server right so as the vm sizes as as the vm sizes are same and transmission rates are same so that is another consideration same the v m begins to stop copy phase more or less at the same time and they end the cop cause also at the same time so it is ah little ah maybe maybe i should not say strong assumption but a
- 29:00 - 29:30 so we also assume that all vms are of the same size so ideally they are having the same transmission rate and more or less though they are serving different things so they are other iterations top copy phase will be there at the same time right since the stop copy phase is executed in parallel they consume the same amount of time at the downtime is in fact equivalent to the time taken by stop copy phase for any vm added to the time taking to the resume
- 29:30 - 30:00 the vm and the things right so as ah if we look at the real down time what it is happening is basically that ah because so long this memory is being copying the vm is active at the source end right so when you have a stop copy phase that time the down time is being activated right so as we are considering that those are all in the same time so so it is being ah the downtime in addition that ah
- 30:00 - 30:30 the time it take to resume the operation at the destination vm right so in case of a parallel ah things but nevertheless the effective transmission [Music] rate it got svm is something which is ah r by m right effective transmission rate so that is that is the another that is that need to be considered so ah if we look at today's ah
- 30:30 - 31:00 discussion so what we try to do to do some means ah a something a broad ah analysis or try to look at that what what will be some estimation of this ah our sort of a ah this migration time and downtime sort of things and also as that estimation of this how much iterations required based on the this dirty paging phenomena and also ah we try to
- 31:00 - 31:30 see or have a overview of this what is when we when we have multiple vms more than one vms into the things whether this parallel ah migration or vsa vc serial migration how how things are ah carried out right so with this ah let us stop our discussion for today's today's discussion and we will continue with other aspects of
- 31:30 - 32:00 this our this cloud computing in our subsequent classes thank you