The changes they’ve made fall into three main areas:
- Improved buffering algorithms at the source hosts resulting in better read performance with less load on the host and better network transfer performance
- More efficient TCP algorithms at the source site resulting in better latency handling
- More efficient buffering algorithms at the target site resulting in better write performance with less load on the host
Let’s look at each of these in a bit more detail:
The way blocks are queued up for transfer has changed slightly from the past iterations where TransferDiskMaxBufferCount & TransferDiskMaxExtentCount were the primary throttling mechanisms for reading and sending changed blocks.
Now, we use a global heap on each host to hold the blocks that have been identified to be sent. We create a heap of appropriate size to hold blocks for potentially the maximum number of VMs on a host. This is dynamically sized according to the host memory size and the maximum number of VMs on the host, but roughly it sizes to a bit more than 3MB of memory. We then load that heap with changed blocks identified by the agent that tracks the blocks as they change.
The way the blocks are read into the heap is in essence by a round robin among the replicated disks, grabbing up to 16 of the changed blocks at a time, and there is a maximum number of IOPS set per host to 1024 to ensure this cyclical reading doesn’t overburden anything.
Blocks are then shipped from the heap to the VR Server on the remote site, with a maximum of 64 extents still “in-flight” that have not been acknowledged as written. As those blocks come back acknowledged, the agent is free to send more from the heap.
The net result is that this is a much more efficient mechanism as we can load and send from a global heap rather than treating each VM as its own object. Fundamentally this leads to a greater overall efficiency of the VR resource manager, and allows getting data to the VR Server faster.
The TCP algorithms at the source site have been changed to using a CUBIC based transport. This is a fairly minor change, but has very good impact for use on long fat networks, as we see often on higher latency yet still high bandwidth connections that people often use for replication. It uses much smarter means of determining factors like TCP window size based on accelerating probing over time and specifically looking at factors like time since last congestion event. It will also size the TCP window independently of ACKs.
All around this makes things much more efficient for data sends across higher latency networks, where bandwidth is less an issue than RTT.
Recipient VR Appliance Changes
Vast improvements have been made to the way the vSphere Replication Appliance receives and writes out the changed blocks, by making some small but very clever adaptations:
The biggest change is by switching the way the appliance sends its writes to the disk with Network File Copy. We now use buffered writes instead of direct IO. Direct IO requires opening the target disk, writing an extent, waiting for write acknowledgement, moving on to the next write, etc. Instead, with buffered writes, the VRA will open the target disk in buffered mode and write using NFC with a single sync flag at the end of the write. In essence these are async writes with a sync write at the end of each ‘transaction’ with the disk. This is a considerably quicker way for VR to do NFC, with no penalty in host performance, and still maintaining data integrity. This gets things to disk much quicker, and provides a huge leap in performance, as we can now acknowledge a whole bunch of writes with one transaction.
A further change is by using “coalescing buffers” to consolidate contiguous blocks on the appliance before performing a single NFC stream rather than doing each extent in isolation. In 5.1 for example, if there are 128 contiguous 8k writes they would be sent as one NFC transaction, but would be issued as 128 writes to the kernel. In 5.5 if they are contiguous blocks, they are coaslesced into a single write transaction that NFC issues to the kernel. This provides less disk and host overhead, and again gets things to disk much quicker.
Coupling coalesced buffers with buffered writes and a larger amount of cached data gets much faster writes from the VR Appliance to the host’s target disk.
So that’s what has been changed to improve performance, but what can we now expect in terms of throughput? Coming up soon in another blog post, I’ll have some sample data from my labs, and a few warnings about the impact of this. As an anecdotal tease though, I’m seeing roughly 40Mbps for a single VM…