The increasing demand for on-device artificial intelligence, coupled with growing concerns about data privacy, drives the need for efficient deep learning on edge devices. Prashanthi S. K., Saisamarth Taluri, and Pranav Gupta, along with their colleagues, address this challenge by developing Fulcrum, a novel system for optimising concurrent deep learning training and inference on edge accelerators. Their work recognises that existing hardware often lacks the flexibility to simultaneously support both tasks efficiently, requiring careful management of resources and power consumption. Fulcrum intelligently interleaves training and inference, dynamically adjusting power settings and batch sizes to maximise performance while adhering to strict latency and power constraints, and crucially, it achieves this with minimal need for costly and time-consuming profiling. This advancement promises to unlock the full potential of edge devices for a wide range of applications, from autonomous vehicles to personalised healthcare.
Optimizing Deep Learning Resource Management and Power
This research focuses on efficiently managing hardware resources for deep learning tasks, both during training and inference. The overarching goal is to find the best configuration of components, cores, CPU and GPU frequencies, and memory speed, to meet performance requirements while minimizing power consumption. This presents a classic optimization challenge, particularly in dynamic environments where workloads and resource availability change constantly. To achieve this, scientists developed a Gradient-based Multi-Dimensional Search (GMD) algorithm for training workloads. GMD explores possible configurations by iteratively adjusting resource settings, guided by the trade-off between performance and power.
It begins with an initial configuration and then moves in the direction that improves performance while reducing power usage, pruning the search space to focus on the most likely candidates. For inference workloads, the team proposed Active Learning-based Sampling (ALS). This method uses machine learning to predict the performance and power consumption of different configurations, intelligently selecting those most likely to improve the accuracy of its predictive model. A neural network is trained to predict performance and power, and the algorithm iteratively samples configurations, profiles them, and updates the network, allowing for efficient exploration of the configuration space and accurate prediction of optimal settings.
Intelligent Time-Slicing for Edge Neural Networks
This study tackles the challenge of running deep neural network training and inference simultaneously on edge devices, such as Nvidia Jetson platforms. These devices often lack native support for concurrent GPU utilization and present a complex landscape of power modes. To optimize performance under power and latency constraints, researchers developed an intelligent time-slicing approach, formulating an optimization problem to interleave training and inference minibatches while maximizing training throughput. A key aim of this work is to minimize the need for costly profiling to achieve optimal configurations.
To solve this optimization problem, the team proposed two strategies: Active Learning Sampling (ALS) and Gradient-descent based Multi-Dimensional Search (GMD). GMD rapidly explores the solution space, profiling between 10 and 15 power modes to arrive at a solution within 5-10 minutes for each configuration. In contrast, ALS profiles a larger range of 50-150 power modes, taking approximately 1. 5 hours, but offers the potential to generalize to other problem configurations with varying power, latency, and arrival rates. Both strategies are integrated within a scheduler called Fulcrum to execute workloads.
GMD operates within a four-dimensional solution space defined by CPU/GPU/memory frequencies and CPU core counts, which collectively determine the power mode. The method begins by profiling an initial power mode and uses this knowledge to prune the search space, iteratively selecting and profiling subsequent power modes. Researchers demonstrated that GMD’s use of domain knowledge to guide the search direction avoids incorrectly pruning viable candidates. Researchers investigated the relationship between GPU frequency and training time, revealing a non-linear correlation where training time initially decreases sharply with increasing frequency before plateauing, while power consumption steadily increases. These insights informed the development of GMD, enabling it to efficiently navigate the complex interplay between performance and power consumption.
Concurrent DNN Training and Inference on Edge Devices
This research presents a novel approach to concurrently managing Deep Neural Network training and inference on edge devices, addressing the limitations of current systems in sharing GPU resources and navigating a wide range of power modes. The work focuses on intelligently time-slicing these concurrent workloads to maximize performance while adhering to strict power and latency constraints, minimizing the need for extensive profiling. The core of their system is an optimization problem that interleaves training and inference minibatches, dynamically adjusting device power modes and inference minibatch sizes. To solve this complex optimization, scientists proposed two key strategies: GMD, a gradient descent search that profiles only power modes, and ALS, an Active Learning technique that identifies reusable, optimal power modes while also minimizing profiling costs.
Experiments demonstrate that both ALS and GMD outperform simpler and more complex baseline methods, achieving success in 97% of cases and delivering solutions within 0. 5% of the theoretically optimal throughput. The team further refined their approach with ALS, leveraging Active Learning to reduce the number of power modes requiring profiling. This technique builds an initial Neural Network model using a small set of randomly selected power modes and then iteratively selects additional modes based on their potential to diversify the observed power and time values. The resulting system constructs a partial Pareto front, a curve representing the trade-off between performance and power consumption, directly from the profiled data, eliminating prediction errors in the optimization process.
Jetson Power and Task Optimisation
This research presents a novel approach to efficiently managing concurrent deep neural network training and inference on edge devices, specifically Nvidia Jetson platforms. Recognizing the limitations of existing systems in sharing GPU resources and navigating a vast number of power modes, the team developed an intelligent time-slicing method to optimize performance while adhering to strict power and latency constraints. The core of this achievement lies in formulating an optimization problem that carefully interleaves training and inference tasks, dynamically adjusting both the device’s power mode and the inference minibatch size. To solve this complex problem, researchers designed two innovative optimization strategies, GMD and ALS. GMD efficiently searches for optimal power modes with minimal profiling, while ALS leverages Active Learning to identify reusable, optimal power modes and minimize profiling costs.