Thursday, August 9, 2012

Making the Case for Long Distance Virtual Machine Mobility

With VMWorld coming up I’m reminded of a top of mind subject, virtual machine mobility. The reason for moving virtual machines is to better allocate server resources and maintain application performance. It’s a useful technology that works great in the data center. We also hear a lot about the need to move virtual machines across the WAN, live, without losing sessions. This is known as Long Distance vMotion or generically as long distance live migration. This might sound like a good idea, but it gets a bit complicated when you think outside the data center walls and across the WAN. It creates complexity in the network, as maintaining sessions requires keeping the same IP address and MAC address after the move.  There are many proposed use cases for it, but is it such a good idea?

Limitations of Live Migration
Long distance live migration over the WAN has limitations due to latency and bandwidth requirements and the complexity of extending layer 2 which is required to maintain the same MAC address. Issues include the potential for traffic coming first to the original data center where the gateway is, and then looping to the new data center where the VM has moved.  Traffic can also loop back over the WAN to reach the storage that stayed behind. There are also bandwidth requirements to handle the large scale movement, as well as issues with storage pooling and storage replication as well as the complexity of implementing the L2 bridging architecture. If we are going to deal with all of this complexity of moving virtual machines over the WAN there had better be a good reason to do it. But is there? Here is a look at the various use cases for long live migration that I have found.

Optimizing Resource Utilization
The most commonly used and most logical use case for live migration is for optimizing resource utilization in the data center within an L2 domain. In this case if a server has available resources due to lower application activity VMs can be moved to it. If VM needs more compute resources than are available on a server it can be moved to another server, in a nearby rack. If a server is experiencing technical issues or is going down for maintenance the VMs can be moved to another server. No WAN extensions are required for this use case and it makes a lot of sense to do it. This is what vMotion was designed for.

Hybrid Cloud
For Service Providers providing IaaS or Cloud Computing to Enterprise customers cloud bursting is a proposed use case. In this model workloads in virtual machines are burst from the enterprise DC to the SP DC once they exceed the available resources in their original DC. This is proposed in the case of special projects with large unpredictable workloads for example. This model is proposed to be done via live migration using WAN extensions between the Enterprise DC and the SP DC in a model called Cloud Bursting or Hybrid Cloud. The enterprise will likely have a contract in place with the SP to take this traffic and possibly VPN in place if they need to isolate the traffic from the Internet. An alternative to live migration for accessing the capacity of a spill over Data Center is to simply start VMs in the SP DC as the traffic ramps up and direct user sessions to them. Live migration is not required with this method.

Cloud Federation
In this scenario workloads are moved between SP data centers based on capacity availability. A Data Center might run short of compute resources so some workloads are sent to other data centers using L2 data center interconnect and live migration. A variation of this use case proposes creating an intelligent method to determine where to send the workloads. This is sometimes called data center federation. All of the issues mentioned above come in to play if VMs are moved around in this way. Amazon conversely deals with the data center capacity issue without moving virtual machines. They simply control VM creation in a DC based on available capacity. This works since many VMs are running batch jobs and have a short life span. Capacity quickly becomes available as they are terminated. Alternatively new VM creation can be redirected to another data center instead of moving VMs.

Disaster Avoidance
In this model if a disaster is expected such as a weather storm the idea is that you start moving the virtual machines to the backup data center while they are running. To do this you need reserve bandwidth on a high bandwidth, low latency link to handle the traffic that would be created once live migration starts.  It is difficult to estimate how much bandwidth would be needed and how long the migration would take since the VMs are constantly being updated by the users and live migration must keep sending the deltas. In this case it would be expensive to have the bandwidth available just in case it is needed when a once in a century disaster strikes that requires a total data center shutdown. An easier and cheaper alternative is to implement a backup plan where VM are shut down, copied and restarted. It should be relatively easy to estimate the time and bandwidth required.

Follow the Sun
In this model workloads are moved from data center to data center based on time of day. In this “follow the sun” model there are latency and bandwidth requirements that might not be met when going long distance and especially transoceanic. VMware recommends <10 ms and 600 mbps and 200km max. While organizations do successfully use a “follow the sun model” for routing of telephone calls and users sessions for a call centers, doing live migration of VM workloads has serious limitations over such long distances. Imagine also the issues with replicating the data storage to follow the virtual machines or the issues with looking back over the WAN to access the storage.

Redirection using Server Load BalancingYou can create or clone a new VM in a second data center and forward new sessions to it using server load balancing. The load balancer stops forwarding new traffic to the original VM and it stops operating when all traffic has been offloaded. This is a common and well-established method of scaling traffic and ensuring application availability which does not have location or IP restrictions, and does not require L2 stretch.

Routed Live Migration
If you still want to do live migration and alternative to L2 stretch that you could consider is routed L3 live migration which many hypervisor vendors support such as Microsoft, KVM, and Citrix. This method allows a change of IP and MAC addresses and uses dynamic DHCP / DNS updates to make the changes happen faster. Implementations can also use session tunneling to keep existing sessions alive if needed.

Cold Migration
The need to use live migration assumes that you need to move VMs live and maintain sessions, which creates complexity in the network. There is always the option to move virtual machines cold, where you shut them down, move them and restart them. There are no real issues with doing this and many organizations find that this is a great way to share work with global teams.

How You Do It Matters
Virtual machine mobility is a useful tool for maximizing server resource utilization and ensuring application performance. Being able to move virtual machines around has a lot of benefits; however there are some serious considerations and consequences with long distance live migration. I think it’s usually better to keep things simple. Adding complexity to the network can have unintended consequences; however if you are planning to implement long distance live migration then the methods used to implement it are extremely important to success. There are several technologies that make this possible and they can be the topic of a future blog. In the mean time I am interested to know your plans for long distance vMotion. Please take this simple survey. This post first appeared on my Juniper blog site see, link.

No comments:

Post a Comment