Chinese web giant Alibaba has reduced network outages by 92 percent, cut load balancing costs by 18.9 percent, and found ways to improve SmartNIC performance by offloading workloads to idle infrastructure.
The company revealed those outcomes in papers it will present at the SIGCOMM conference next week.
The reduction in network outages came from a technology Alibaba calls “ZooRoute” that its researchers describe [PDF] as “a fast failure recovery service that ensures global bypass in large-scale cloud networks within seconds.”
The paper describing ZooRoute explains that cloud operators’ networks will inevitably fail from time to time, and that strategies like fast rerouting and traffic engineering can take seconds and minutes respectively to restore traffic flows – too slow for many users.
“As a result, tenants are forced to develop their own recovery solutions, which typically involve redundant resources or protocol stack modifications, thereby increasing capital and operating expenses,” the paper argues.
The company claims its own ZooRoute tech can “instantly reroute traffic to a working path” by constantly probing for viable routes. If a failure occurs, ZooRoute is therefore aware of a route that will work, and switches to it ASAP. The paper says Alibaba Cloud has used ZooRoute for 18 months, and it has “significantly improved network reliability, reducing cumulative outage time by 92.71 percent.”
Alibaba Cloud has also deployed a tool called Hermes that it says “reduces daily worker hangs by 99.8 percent and lowers the unit cost of L7 LB infrastructure by 18.9 percent.”
A paper [PDF] describing Hermes explains that the layer 7 load balancers clouds use to keep their networks humming “rely on I/O event notification mechanisms such as epoll
to dispatch connections from the kernel to userspace workers,” but that this approach sometimes creates bottlenecks.
Alibaba’s solution is using eBPF – a tech that allows workloads to run with the same privileges enjoyed by processes in the Linux kernel – to filter demands from workers to understand which deserve priority, and then schedule tasks accordingly.
“Hermes is well suited for cloud L7 LBs facing diverse and rapidly changing traffic patterns, where no single scheduling policy can optimally handle all tenant workloads,” the paper states, and reports that in production at Alibaba Cloud it’s reduced the standard deviation of per-worker CPU utilization and connection counts by 90 percent and 99.4 percent, respectively, helped average daily worker hangs to decrease by 99.8 percent, and dropped the unit cost of cloud infra for our L7 LBs by 18.9 percent.
A third paper from Alibaba describes [PDF] “Nezha”, a distributed vSwitch load sharing system that works on SmartNICs – the CPU-equipped network cards that hyperscalers use to run networking and storage plumbing workloads so that CPUs can run tenants’ applications.
In the paper about Nezha, Alibaba admits that some of the virtual switches running on its SmartNICs are maxed out. Its solution is to find under-used SmartNICs and shift workloads to them.
“The deployment cost of Nezha is only a small fraction of that required to deploy new devices,” the paper states, and has significantly improved performance and moved bottlenecks from the vSwitch to the VM kernel stack.
SIGCOMM commences on September 8th, in Coimbra, Portugal.
One notable feature of this year’s event is a keynote by distinguished computer scientist (and Register columnist) Bruce Davie, to celebrate his being chosen as the recipient of the annual SIGCOMM Award, in recognition of his lifetime contributions to the field of communication networks.
Bruce is the first Australian to win the award, which The Register’s APAC desk thinks is bloody brilliant. ®