Photo caption: Noah Harding Assistant Professor of Computer Science Anastasios Kyrillidis directs the Optimization Lab (OptimaLab) at Rice University.
Computer scientists solve problems by describing the challenge in mathematical terms and then calculating solutions for the problem, which is usually done by devising an algorithm. The more complex the problem, the slower the algorithm typically runs. The Optimization Lab (OptimaLab) led by Anastasios Kyrillidis at Rice University specializes in optimizing — increasing the efficiency of — very complicated calculations.
“My research is about efficiency in problem solving, focusing on the optimization part”, said Kyrillidis, director of the lab and the Noah Harding Assistant Professor of Computer Science at Rice. “For a given goal such as a ‘minimize cost’ or ‘maximize a loss’, my aim has always been to determine if there is a more efficient way to find a ‘good’ solution.
“We can see the success of organizations like Google, OpenAI, DeepMind, FAIR, and others as they develop and apply new algorithmic solutions to problems that impact our daily lives. My passion has always been rethinking how existing algorithmic solutions have been used thus far, and then devising new ways to achieve the same or better performance, more efficiently.”
Optimization is an on-going area of interest for many computer scientists in academia, industry, and national research centers. Kyrillidis said it is at the heart of recent computational advances in machine learning and artificial intelligence, and he believes it will be central to future developments in fields like quantum computing. Along with a team of Rice graduate and undergraduate students who share his thoughts, Kyrillidis is pushing the envelope with a variety of optimization tools and strategies.
“Recent outcomes from my lab all revolve around efficiency in optimization, but they follow different routes,” said Kyrillidis.
“For example, our work on Independent Subnet Training focuses on how we can further unlock higher efficiency when distributed computational resources are available; that work focuses on neural networks, yet we believe that it applies in many more machine learning problems and settings.
“Similarly, the work presented at conferences on the intersection of Machine Learning and Systems (MLSys) and the International Conference on Acoustics, Speech and Signal Processing (ICASSP) focuses on hyperparameter tuning, which is a major hurdle for machine learning and artificial intelligence (AI) practitioners. Algorithms do not always run smoothly in their default settings, and we often spend days or even weeks tuning parameters in order to get a good result or improve the result before applying the algorithm to a different problem. Implementing more universal rules could reduce the ‘idle’ time for the user, which would free up cycles for other works.”
Two other methods his team has explored in their mission to improve problem-solving efficiency include algorithmic acceleration and efficiency (how can the algorithm be changed to make it faster?) and efficiency through sparsity (i.e., do all the parameters require tuning, or will a small subset be sufficient to save computer resources and time?).
Independent Subnetwork Training for Distributed Computing
Improving the efficiency of neural networks (NNs) is a research area that has been prioritized with grant funding by the National Science Foundation (NSF) in collaboration with Intel to combat the increasing costs of larger and larger NNs. At Rice, Kyrillidis, Computer Science Department Chair Chris Jermaine, and Electrical and Computer Engineering Assistant Professor Yingyan Lin received an NSF-Intel grant to develop optimization strategies for neural networks to be deployed in smaller capacity devices or in environments with lower stability and/or energy and computationally restrictions.
“Today’s smartphones, smartwatches, home appliances and drones all require machine learning to function. The software on these devices is constantly learning to provide better service and support. There is a real need for systems and algorithms that allow edge-based training and inference to be carried out in a timely fashion on these computation-efficient, communication-light, and energy-limited platforms,” said Kyrillidis.
“Already, one outcome of the NSF-Intel grant-funded research is a novel algorithmic way to carry out large-scale computations more efficiently, even on low-resource devices. Chen Dun and Cameron Wolfe — along with Binhang Yuan in Chris’ lab — took the initial concept we had proposed, put a lot of thought into how the algorithms and their implementation would look, and proved theoretically that the initial idea makes sense. Based on their initial contributions, most of the subsequent efforts in my group will build on these foundational efforts, making their work indispensable.”
ResIST
Optimization first captivated Chen Dun’s attention when he was introduced to deep learning and NNs; he realized that a seemingly simple and straight-forward optimization technique (simple gradient descent) could help a neural network solve really complex tasks.
Dun said, “Two questions — ‘How does it work?’ and ‘Can we improve upon the simple method?’ — continue motivating me to further research optimization. In our ResIST paper, we describe an optimization method that could be used in edge-computation tasks such as training an image/photo classifier on mobile phones.”
“Due to privacy concerns, centralized training has been replaced by on-device training. While current mobile phones have increased computation power to support on-device training, the devices still have hard limits on memory and computation power. The same limits apply to the communication bandwidth of mobile phones. Optimizations like ResIST could fully utilize the limited computation power and communication bandwidth to more efficiently classify images with labels such as flower, building, etc.”
Independent Subnet Training
Cameron Wolfe was researching gradient-free optimization with the neural networks group at the University of Texas at Austin when he met Kyrillidis, a Simons Foundation postdoctoral researcher there. When Kyrillidis moved to Rice to launch the OptimaLab, Wolfe was invited to join the team. He said, “Working with Tasos on more practical approaches to optimization excites me because there are so many applications in industry and everyday life. Pretty much everything in the world can be modeled as some form of optimization problem.”
Wolfe helped develop a recent device for optimizing work -— detailed in the paper Distributed Learning of Fully Connected Neural Networks using Independent Subnet Training (IST) — that can be compared to job-sharing.
“One way of getting work done in an organization is using a sort of delegation or decentralized command structure,” said Wolfe. “When a team works together, different jobs are assigned to different members; the leader of the team cannot do everything on their own. Similarly, IST divides the neural network into several disjointed groups that are trained separately and in isolation. Thus, portions of the NN are delegated to different servers for training, allowing the high-level process of training the entire network to be split and assigned to separate compute sites.
“Right now, the scale of deep learning research is massive. As such, people typically use massive cloud computing clusters to train deep neural networks — which still takes a really long time. IST is a fundamental development that can reduce time and computation for this type of distributed training, thus resulting in less energy usage, acceleration in the research process, etc. IST can benefit any situation that trains neural networks in a distributed fashion.”
Shallow Neural Network Training with Randomly Masked Neurons
“I was captivated by the power of algorithms, but I also tried to look for the underlying beauty of mathematical logic in them,” said Fangshuo Jasper Liao, explaining how his interest in AI and ML gradually grew throughout his undergraduate years as a double major in computer science and mathematics at Rice. When he expressed interest in the OptimaLab research, Kyrillidis invited Liao to participate in a project.
Now a second-year Ph.D. student in the OptimaLab, Liao is delivering a paper on the convergence of shallow neural network training with randomly masked neurons at the Transactions of Machine Learning Research (TMLR) journal. His research aligns with Kyrillidis’ NSF-Intel grant and aims to fill the gap in the theoretical work for various NN training methods.
Liao said, “Broadly speaking, deep learning methods do achieve significant levels of performance but the majority of these methods remain a black box: we don’t know why our methods work, and under what circumstance these methods work. Previous theoretical works have attempted to explain this mystery. The problem is, most theoretical work applies only to the simplest training methods, gradient descent, while current advancements have developed many deep learning training methods to satisfy a variety of needs, e.g. distributed methods to train larger models, regularization techniques to improve the generalization ability, etc.
“Our work identifies a central concept for many of those training methods — the idea of training the smaller subnetworks that lie within bigger neural networks. Our proposed framework generalizes many training methods and performs analysis on this theoretical framework. In this way, we can provide theoretical support to multiple existing methods.”
Kyrillidis said Liao’s work may serve as a first step for deep learning researchers to understand more about when, where, and why the methods that they use are working. By providing the rationale and mathematical logic behind these existing methods may also facilitate further research into identifying more advanced techniques.
Hyperparameter tuning in neural networks
Although the Kyrillidis team members are excited about developing new algorithms to solve known problems, they are also keen to improve existing algorithms — tuning hyperparameters to get the most performance out of the algorithm. John Chen and Cameron Wolfe are researching parameters at the meta-algorithmic level — the tunable parameters common to most algorithms that could be honed to better optimize multiple machine learning models.
REX: Revisiting budgeted training with an improved schedule
John Chen admires the technical challenges of increasingly complex forms of AI: machine learning (ML), deep neural networks (DNNs), and generative adversarial networks (GANs). He is fascinated by their ability to solve problems at scale, faster and better than humans.
“For example, there are implementations in agriculture which optimize yield and analyze ripeness, or in logistics to optimize planning, or in healthcare to aid physicians in efficiency. While only sensational or controversial advancements make the news, there are a lot of subtle, under-the-radar implementations that help make our everyday life better,” said Chen.
“But one of today’s biggest challenges in ML is the widening gap in computational resources. Some of the state-of-the-art models are trained with resources that would take the average lab many lifetimes to train! The issue here is not only how it is becoming more difficult for most ML folks to contribute to top-of-the line research, but also how it is becoming more difficult for the average ML practitioner to implement these models. As a result, there is a growing need for efficient and faster training methods. The paper we presented at the MLSys conference introduces a resource, REX, to help address this need.”
Demon: Improved Neural Network Training with Momentum Decay
Like Chen, Wolfe enjoys tuning parameters like scheduling to improve optimization for deep networks. He said while all state-of-the-art results in deep learning are achieved using a curated learning rate schedule, choosing the correct schedule and how to implement it varies with the application. The learning rate schedule used for image classification will be very different than the schedule used for language modeling.
Tweaking the learning rate by adjusting its schedule helps direct the model’s focus by slowing it down at a certain point. Wolfe said, “The idea behind learning rate decay is to lower the neural network's learning rate as you get deeper into training. This controls the balance between exploration and exploitation. At first, you allow the model to explore the optimization landscape with a larger learning rate. Then, as it finds better solutions (later in the optimization process), you lower the learning rate so that the model can explore this portion of the optimization landscape more granularly.”
Wolfe then applied the idea behind learning rate decay to another parameter, momentum. In general, momentum performs smoothing on a neural network's optimization trajectory.
“Intuitively then, using high values of momentum early during training smooths the optimization trajectory and keeps the neural network from sporadically ‘bouncing around’ different solutions, while lower momentum during the later stages of training allows the neural network to update its parameters freely during the network training stages when the optimization landscape is more flat or well-behaved (because you are reaching a final solution),” said Wolfe.
“Improved neural network training with decaying momentum, or Demon, is work we presented in an ICASSP paper. Demon can be used pretty much anywhere. It is a fundamental proposal for how you should train deep networks. This is a somewhat common theme in our lab — we like to find simple, generally applicable ideas that are useful to practitioners in our field.”
Acceleration v Sparsity
One of the ways algorithms can be optimized is by speeding them up, which requires understanding the algorithmic motions that lead to acceleration. Another way to optimize an algorithm is to make it leaner; sparsity is the strategy for “killing” trainable parameters in order to reduce the complexity of the problem while retaining the overall performance. In the OptimaLab, Junhyung Lyle Kim is researching acceleration and Cameron Wolfe is focused on sparsity. Both presented papers at the Learning for Dynamics and Control (L4DC) Conference.
Accelerate with SSPAM
Kim said when he is studying optimization, he often feels as if he is learning a secret language that can be applied to almost any quantitative setting, such as minimizing the expenditures of a company or maximizing the return of a stock portfolio. For optimization algorithms to work properly, some parameters have to be determined based on function class, problem structure, and many more considerations. One of the parameters that can be tuned or customized is the “step size” — the part that determines how aggressively the optimization algorithm makes progress.
Humans who enjoy driving standard transmissions use a manual shift to customize their vehicle’s performance. Lower gears make the car easier to handle but the overall speed is slow. Speeding up by aggressively shifting to a higher gear too soon can stall the engine. An automatic transmission solves the problem by shifting gears at pre-scheduled points, and Kim believes a good optimization algorithm should function similarly.
“Users should be able to simply ‘plug-in’ an optimization algorithm and achieve reasonable results,” said Kim.
With his paper on SPPAM (stochastic proximal point algorithm with momentum), he addresses the challenge of identifying step size to better optimize and accelerate algorithm performance.
He said, “If the step size is too small, it can extend indefinitely the overall computation time for the optimization algorithm to converge. On the other hand, if the step size is too large, it can make the whole procedure diverge. In this regard, the users of optimization algorithms spend the majority of their time figuring out the right step size to use for their own problems at hand.
“SPPAM was an attempt to exactly address this challenge: it is a fast algorithm that works robustly with a wide range of step sizes, for some class of functions. Going forward, the same philosophy governs my optimization research: I believe a well-developed optimization algorithm should work reasonably well without requiring too much a-priori knowledge (e.g., properties of the function class) and without requiring too much fine-tuning of the hyperparameters.”
While Kim is developing fast and robust algorithms that can be applied to many models, Wolfe is improving the models by getting rid of excess baggage. He said pruning is a method for making a neural network sparse.
Sparsity through i-SpaSP
“You start with a big neural network, then you use pruning to eliminate portions of this network without damaging its performance. Thus, the result is a smaller, more efficient neural network that still performs similarly to the original, large neural network,” said Wolfe.
One pruning method, greedy forward selection (GFS), was inspired by the conditional gradient algorithm — a popular algorithm for constrained optimization. Wolfe learned about the paper that proposed GFS during a Rice colloquium talk and began wondering if other well-known optimization algorithms, such as CoSAMP, might also work well for pruning.
“I found that we could achieve pruning results similar to those of GFS by using a CoSAMP-inspired pruning algorithm,” said Wolfe. “Actually, the CoSAMP-inspired pruning algorithm is a lot faster than the conditional gradient-inspired GFS algorithm.”
Wolfe already presented his findings in the L4DC paper, i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery, but his pruning work is far from over.
“Pruning has tons of real-life applications — any scenario where you would want to use less memory or computation with your neural network,” said Wolfe. “Two examples include on-device applications like running an app on a mobile phone and applications where latency is a big concern (e.g., for self-driving cars, smaller/sparse neural networks make getting predictions from them fast, so that the car can respond to external stimulus more quickly).”
Conclusion
Reflecting on the accomplishments of his team members over the last six months, Kyrillidis beams like a parent on graduation day. He said, “I am proud of my team. We have arrived at a level where we work harmonically.”
“It has been both exciting and energy-consuming to start building a group that will implement my vision. The pandemic hit, and things were difficult for everybody. But I can feel only pride and honor to be surrounded by these colleagues. They each stood by and with everyone else when needed and kept the group together.”
“Beyond the pandemic, other adversities challenged our progress, but these students did all the heavy lifting and arrived at a research maturity that only can make me truly proud to be advising them. We have spent tons of time in discussion and my students have spent lots of time trying to implement their/our ideas. Kudos to them!”