Category Archives: ESXi5

vSphere Hardening (Updated with General Availability)

VMware-Patches-Vulnerability

One of the more solid cases in the VMware vs Hyper-V arguments is the ‘footprint’ of the hypervisor, VMware wins in terms of sizing and claims a massive advantage in being Linux based with a smaller attack surface.  However, this is often taken as an excuse by admins to leave a default configuration on hosts and within the vSphere components used to make up a virtualisation solution.  VMware have made things a little easier for those concerned that vulnerability in even a single host can mean chaos for your virtual environment, they have now release an updated hardening guide for the following components:

  • Virtual Machines
  • ESXi hosts
  • Virtual Network
  • vCenter Server plus its database and clients.
  • vCenter Web Client
  • vCenter SSO Server
  • vCenter Virtual Appliance (VCSA) specific guidance
  • vCenter Update Manager

To directly download the guide you can use this link and this one for the change log.

VMware Blog Details Here

Thanks to Mike Foley for confirming the version status of the document being published

Paper: VMware Horizon View Large-Scale Reference Architecture

From virtualization.info (it’s a good blog that you should read!)

VMware has released a paper titled: “VMware Horizon View Large-Scale Reference Architecture“. The paper which contains 30 pages details a reference architecture based on real-world test scenarios, user workloads and infrastructure system configurations. The RA uses the VCE Vblock Specialized System for Extreme Applications, composed of Cisco UCS server blades and EMC XtremIO flash storage, to support a 7,000-user VMware Horizon View 5.2 deployment. Benchmarking was done using the Login VSI Max benchmarking suite.

clip_image001

The paper covers the following topics:

  • Executive Summary
  • Overview
    • VCE Vblock Specialized System for Extreme Applications
    • VMware Horizon View
    • Storage Components
    • Compute and Networking Components
    • Workload Generation and Measurement
  • Test Results
    • Login VSI
    • Timing Tests
    • Storage Capacity
  • System Configurations
    • vSphere Cluster Configurations
    • Networking Configurations
    • Storage Configurations
    • vSphere Configurations
    • Infrastructure and Management Servers
    • Horizon View Configuration
    • EMC XtremIO Storage Configurations

Conclusion:

Our test results demonstrate that it is possible to deliver an Ultrabook-quality user experience at scale for every desktop, with headroom for any desktop to burst to thousands of IOPS as required to drive user productivity, thanks to the EMC XtremIO storage platform, which provides considerably higher levels of application performance and lower virtual desktop costs than alternative platforms. The high performance and simplicity of the EMC XtremIO array and the value-added systems integration work provided by VCE as part of the Vblock design contributed significantly to the overall success of the project.

vSphere Multi-Core vCPU Clarifications

One of the most common misconfigurations I see in VMware environments is use of multiple cores-per socket, VMware has released a clarification post reminding people of the best practice advice (see below) and clarifying performance of multi-core vCPUs

This complements a better post by the SANMAN (who provided the graphics used below)

#1 When creating a virtual machine, by default, vSphere will create as many virtual sockets as you’ve requested vCPUs and the cores per socket is equal to one. I think of this configuration as “wide” and “flat.” This will enable vNUMA to select and present the best virtual NUMA topology to the guest operating system, which will be optimal on the underlying physical topology:

#2 When you must change the cores per socket though, commonly due to licensing constraints, ensure you mirror physical server’s NUMA topology. This is because when a virtual machine is no longer configured by default as “wide” and “flat,” vNUMA will not automatically pick the best NUMA configuration based on the physical server, but will instead honor your configuration – right or wrong – potentially leading to a topology mismatch that does affect performance:

 

Full Content of the VMware Post is below:

Does corespersocket Affect Performance?

There is a lot of outdated information regarding the use of a vSphere feature that changes the presentation of logical processors for a virtual machine, into a specific socket and core configuration. This advanced setting is commonly known as corespersocket.

It was originally intended to address licensing issues where some operating systems had limitations on the number of sockets that could be used, but did not limit core count.

KB Reference: http://kb.vmware.com/kb/1010184

It’s often been said that this change of processor presentation does not affect performance, but it may impact performance by influencing the sizing and presentation of virtual NUMA to the guest operating system.

Reference Performance Best Practices for VMware vSphere 5.5 (page 44):http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf

Recommended Practices

#1 When creating a virtual machine, by default, vSphere will create as many virtual sockets as you’ve requested vCPUs and the cores per socket is equal to one. I think of this configuration as “wide” and “flat.” This will enable vNUMA to select and present the best virtual NUMA topology to the guest operating system, which will be optimal on the underlying physical topology.

#2 When you must change the cores per socket though, commonly due to licensing constraints, ensure you mirror physical server’s NUMA topology. This is because when a virtual machine is no longer configured by default as “wide” and “flat,” vNUMA will not automatically pick the best NUMA configuration based on the physical server, but will instead honor your configuration – right or wrong – potentially leading to a topology mismatch that does affect performance.

To demonstrate this, the following experiment was performed. Special thanks to Seongbeom for this test and the results.

Test Bed

Dell R815 AMD Opteron 6174 based server with 4x physical sockets by 12x cores per processor = 48x logical processors.

TestBed

The AMD Opteron 6174 (aka Magny-Cours) processor is essentially two 6 core Istanbul processors assembled into a single socket. This architecture means that each physical socket is actually two NUMA nodes. So this server actually has 8x NUMA nodes and not four, as some may incorrectly assume.

Within esxtop, we can validate the total number of physical NUMA nodes that vSphere detects.

8NUMANodes

Test VM Configuration #1 – 24 sockets by 1 core per socket (“Wide” and “Flat”)

VMTest1

Since this virtual machine requires 24 logical processors, vNUMA automatically creates the smallest topology to support this requirement being 24 cores, which means 2 physical sockets, and therefore a total of 4 physical NUMA nodes.

Within the Linux based virtual machine used for our testing, we can validate what vNUMA presented to the guest operating system by using: numactl –hardware

VMTest1NUMA

Next, we ran an in-house micro-benchmark, which exercises processors and memory. For this configuration we see a total execution time of 45 seconds.

VMTest1Result

Next let’s alter the virtual sockets and cores per socket of this virtual machine to generate another result for comparison.

Test VM Configuration #2 – 2 sockets by 12 cores per socket

VMTest2

In this configuration, while the virtual machine is still configured have a total of 24 logical processors, we manually intervened and configured 2 virtual sockets by 12 cores per socket. vNUMA will no longer automatically create the topology it thinks is best, but instead will respect this specific configuration and present only two virtual NUMA nodes as defined by our virtual socket count.

Within the Linux based virtual machine, we can validate what vNUMA presented to the guest operating system by using: numactl –hardware

TestVM2NUMA

Re-running the exact same micro-benchmark we get an execution time of 54 seconds.

TestVM2Result

This configuration, which resulted in a non-optimal virtual NUMA topology, incurred a 17% increase in execution time.

Test VM Configuration #3 – 1 socket by 24 cores per socket

TestVM3

In this configuration, while the virtual machine is again still configured have a total of 24 logical processors, we manually intervene and configured 1 virtual socket by 24 cores per socket. Again, vNUMA will no longer automatically create the topology it thinks is best, but instead will respect this specific configuration and present only one NUMA node as defined by our virtual socket count.

Within the Linux based virtual machine, we can validate what vNUMA presented to the guest operating system by using: numactl –hardware

TestVM3NUMA

Re-running the micro-benchmark one more time we get an execution time of 65 seconds.

TestVM3Result

This configuration, with yet a different non-optimal virtual NUMA topology, incurred a 31% increase in execution time.

To summarize, this test demonstrates that changing the corespersocket configuration of a virtual machine does indeed have an impact on performance in the case when the manually configured virtual NUMA topology does not optimally match the physical NUMA topology.

The Takeaway

Always spend a few minutes to understand your physical servers NUMA topology and leverage that when rightsizing your virtual machines.

Other Great References:

The CPU Scheduler in VMware vSphere 5.1

Check out VSCS4811 Extreme Performance Series: Monster Virtual Machines in VMworld Barcelona

vSphere Blog Details Replication Improvements in 5.5

From this vSphere Blog entry by Ken Warneburg:

The changes they’ve made fall into three main areas:

  1. Improved buffering algorithms at the source hosts resulting in better read performance with less load on the host and better network transfer performance
  2. More efficient TCP algorithms at the source site resulting in better latency handling
  3. More efficient buffering algorithms at the target site resulting in better write performance with less load on the host

Let’s look at each of these in a bit more detail:

Source improvements

The way blocks are queued up for transfer has changed slightly from the past iterations where TransferDiskMaxBufferCount & TransferDiskMaxExtentCount were the primary throttling mechanisms for reading and sending changed blocks.

Now, we use a global heap on each host to hold the blocks that have been identified to be sent.  We create a heap of appropriate size to hold blocks for potentially the maximum number of VMs on a host.  This is dynamically sized according to the host memory size and the maximum number of VMs on the host, but roughly it sizes to a bit more than 3MB of memory.  We then load that heap with changed blocks identified by the agent that tracks the blocks as they change.

The way the blocks are read into the heap is in essence by a round robin among the replicated disks, grabbing up to 16 of the changed blocks at a time, and there is a maximum number of IOPS set per host to 1024 to ensure this cyclical reading doesn’t overburden anything.

Blocks are then shipped from the heap to the VR Server on the remote site, with a maximum of 64 extents still “in-flight” that have not been acknowledged as written.  As those blocks come back acknowledged, the agent is free to send more from the heap.

The net result is that this is a much more efficient mechanism as we can load and send from a global heap rather than treating each VM as its own object.  Fundamentally this leads to a greater overall efficiency of the VR resource manager, and allows getting data to the VR Server faster.

TCP Changes

The TCP algorithms at the source site have been changed to using a CUBIC[3] based transport.  This is a fairly minor change, but has very good impact for use on long fat networks, as we see often on higher latency yet still high bandwidth connections that people often use for replication.  It uses much smarter means of determining factors like TCP window size based on accelerating probing over time and specifically looking at factors like time since last congestion event.  It will also size the TCP window independently of ACKs.

All around this makes things much more efficient for data sends across higher latency networks, where bandwidth is less an issue than RTT.

Recipient VR Appliance Changes

Vast improvements have been made to the way the vSphere Replication Appliance receives and writes out the changed blocks, by making some small but very clever adaptations:

The biggest change is by switching the way the appliance sends its writes to the disk with Network File Copy.  We now use buffered writes instead of direct IO.  Direct IO requires opening the target disk, writing an extent, waiting for write acknowledgement, moving on to the next write, etc.  Instead, with buffered writes, the VRA will open the target disk in buffered mode and write using NFC with a single sync flag at the end of the write.  In essence these are async writes with a sync write at the end of each ‘transaction’ with the disk. This is a considerably quicker way for VR to do NFC, with no penalty in host performance, and still maintaining data integrity.  This gets things to disk much quicker, and provides a huge leap in performance, as we can now acknowledge a whole bunch of writes with one transaction.

A further change is by using “coalescing buffers” to consolidate contiguous blocks on the appliance before performing a single NFC stream rather than doing each extent in isolation.  In 5.1 for example, if there are 128 contiguous 8k writes they would be sent as one NFC transaction, but would be issued as 128 writes to the kernel.  In 5.5 if they are contiguous blocks, they are coaslesced into a single write transaction that NFC issues to the kernel.  This provides less disk and host overhead, and again gets things to disk much quicker.

Coupling coalesced buffers with buffered writes and a larger amount of cached data gets much faster writes from the VR Appliance to the host’s target disk.

So that’s what has been changed to improve performance, but what can we now expect in terms of throughput?  Coming up soon in another blog post, I’ll have some sample data from my labs, and a few warnings about the impact of this.  As an anecdotal tease though, I’m seeing roughly 40Mbps for a single VM…

vSphere & ESXi 5.5 Download Available Now

Thanks to the Yellow Bricks blog for letting me know that the binaries are now available, links here:

Core vSphere and automation/tools:

Suite components:

What’s New in vSphere 5.5 – Quick Reference

of VMware has published a very useful PDF guide to the new features included in vSphere 5.5. Copy of his blog entry from here, published below:

With all the new announcements at VMworld last week you will be excused for not knowing the technical advantages that have been added to the vSphere 5.5 release and specifically in the platform, with this in mind we wanted to create a quick technical overview of what’s new so you are armed with the information you need as a technical person.

Image

This “Quick Reference” was created as a single place to look and find out the new features which interest you and give you some base information to quickly understand which areas you may want to find out more information about. I have found it useful, I hope you will too.

Please note that this guide may need adjusting in the future but all updated copies will be uploaded to this page.

Current Version 0.5

Download the vSphere 5.5 Quick Reference here and let us know if the format of this information is useful by adding a comment below.

For more information now make sure you check out the Summary of Whats New in vSphere 5.5 here by Kyle Gleed