Body

Rice CS team reveals augmented queues at SIGCOMM 2023

Eugene Ng and CS PhD students introduce scalable traffic queues for data center network sharing

Rice CS team reveals augmented queues at SIGCOMM 2023

In the lab of Rice University Computer Science professor T. S. Eugene Ng, a team of researchers is busily working on projects that enable a robust and manageable global networked infrastructure for the future, one that will probably be primarily based in the cloud. A recent project, the Augmented Queue, will be presented in September in NYC, where Columbia University is hosting this year’s flagship conference of the ACM Special Interest Group on Data Communication (SIGCOMM).

Ng said, “The success of public clouds is largely due to the fact that they are shared by millions of tenants, but this sharing inevitably sacrifices performance stability and predictability. Consequently, one of the most pressing problems now is how tenant applications of fundamentally different designs that govern their resource consumption can co-exist without interference.”

Xinyu Crystal Wu, a CS Ph.D. student who has already published eight papers with Ng and other Rice CS researchers, is one of the first authors of this paper. Wu found this particular project both intriguing and challenging and chose a water dispenser analogy to illustrate the problem.

“Imagine we are at a large cafeteria or gym where many people are trying to fill their water bottles at a single dispenser. Most people have regular-sized bottles, but occasionally some have giant containers that take a lot longer to fill. This type of delay results in a larger queue of waiting customers and everyone suffers from the growing congestion, even if their need is average or even small.

“If you prefer a road intersection analogy, then our Augmented Queue (AG) paper shows how we can help intelligently control network traffic from different entities (types of vehicles) to ensure that large data transfers (buses) don’t monopolize the bandwidth. Regular traffic — like the cars of commuters — keeps flowing smoothly and important entities such as emergency vehicles have more priority and experience no congestion caused by the other entities.”

Zhuang Wang, a CS Ph.D. student who has published seven papers with Ng, shares the role of first author for this paper with Wu. He compared their AQ tool to delivery fleets sending their vehicles out on the highway. Wang said adhering to a maximum quota (link capacity) for the number of cars that can drive on the highway simultaneously can avoid traffic congestion. Each company sending the vehicles in their fleet down the highway pays a toll for their car quota (specified bandwidth). The total quotas of all the fleets can be equal to or smaller than the total quota.

“In previous solutions, the map app takes all cars from different fleets at the same time. If one company floods the highway with too many vehicles, traffic congestion results, and all the companies are forced to slow down their fleets, which is unfair. With AQ, the map app can distinguish cars and monitor the number of vehicles from each company on the highway (the mathematical model). If the number of one company exceeds its quota, the map app will only notify this company to slow down the departure rate of its vehicles,” said Wang.

For Weitao Wang, this AQ paper is the tenth research paper he has published with Ng. He is simultaneously wrapping up his Ph.D. and working as a student researcher for Google, but took time to explain how their tool attempts to isolate network traffic. “AQ isolates different types of traffic with a software solution rather than hardware solutions, and the unique benefits of a software solution are flexibility and scalability,” he said. 

“Physical queue is achieved by physical circuits on the chipset, more physical queues means more complex chip, higher temperature, higher manufactory cost, and higher retail price, which prevent the physical queue solution from supporting datacenter-scale network systems.”

“With the help of an efficient and powerful mathematical model for network queueing, what we accomplished with AQ is providing a tool to help enable the flexible and scalable bandwidth resource allocation. But ‘how’ to use AQ to assign bandwidth resources for different entities is the duty of the operator who deploys AQ. Just like splitting a cake, the cloud operator is the one who splits the cake among millions of guests, while AQ is a knife and a kitchen scale to get the job done precisely.”

Zhuang Wang took the explanation a step further. He said, "the AQ uses a mathematical model to calculate the ‘augmented queue length’ based on the packet arrival rate and the specified bandwidth, regardless of the link capacity."

“The augmented queue length of each tenant is calculated separately; the individual queue only depends on its own traffic. The augmented queue length begins to accumulate when the packet arrival rate exceeds the specified bandwidth and then begins to ‘drain’ when the arrival rate is smaller than the specified bandwidth. When a threshold is breached, AQ generates congestion signals as notifications for individual end hosts to control their traffic rates,” he said.

“These signals can be some bits tagged in packets or direct packet drops. After receiving these signals, the end hosts are aware of ‘network congestion’, which might not be real congestion in the network but reflect that their traffic rates exceed the specified bandwidth; they then take corresponding actions to limit the traffic rates according to the applied congestion control algorithm.”

More efficient management of bandwidth allocation among tenants is critical because increasing bandwidth is not possible or reasonable in many cases. The public clouds are faced with the same resource consumption and cost concerns as other tech companies as well as industries like energy, manufacturing, and supply chains.  

Crystal Wu returned to her road intersection analogy to explain. She said, “The number of tenants in a data center is several orders of magnitude greater than the number of physical queues available in today's switches. It is not practical to continue adding physical queues, just like it is impossible to fit another intersection on top of the one near Rice University at Main and Sunset. Instead, AQ can support millions of tenants for precise bandwidth guarantees, regardless of the limited number of physical queues in switches.”

Ng is proud of the results and the team. He said, “Guiding my Ph.D. students has been and always will be the best part of my job. It's amazing to work with the team from an initial spark of a potential idea to a complete understanding of its properties and benefits to real applications. It is immensely gratifying to see the team taking more and more control of the research direction until they completely own the work.”

Carlyn Chatfield, contributing writer