This repository contains the source code for the LoudVA thesis project, a system for orchestrating distributed inference workloads across a heterogeneous cluster of NVIDIA Jetson devices. The system is designed to intelligently schedule tasks to optimize for performance, energy consumption, and latency constraints.
More information about the thesis can be found here
This project makes the following key contributions:
-
Distributed Heterogeneous Inference System: A complete, end-to-end system for managing inference tasks across a cluster of mixed NVIDIA Jetson devices (Nano, Xavier NX, AGX Xavier) using the Triton Inference Server.
-
Intelligent Energy-Aware Scheduling: The core of the project is the
LoudScheduler, a custom scheduler that dynamically assigns inference requests to the most suitable device. It optimizes for both energy efficiency and latency constraints, making real-time decisions based on device profiles and incoming workload. -
Performance and Energy Prediction: A machine learning model (
LoudCostPredictor) that predicts the latency and energy consumption of different device configurations (GPU frequency, batch size). This allows the system to make informed scheduling decisions even for configurations that have not been profiled, enabling generalization to new devices and workloads. -
Comprehensive Automation: The entire system is managed through a powerful combination of a
MakefileandAnsibleplaybooks. This allows for one-command setup, deployment, configuration, and operational control of the entire cluster, significantly simplifying management. -
Extensive Experimentation & Profiling Framework: The repository includes a full suite of scripts for:
- Performance Profiling: Systematically measuring performance, power, and energy across different hardware settings.
- Experimentation: Running automated, repeatable experiments to compare the custom scheduler against other common strategies (e.g., Round Robin, Fixed Batch Size) under various simulated workloads.
LoudVA uses a controller-worker architecture:
- LoudController: A central control node that receives inference requests via a Flask API (
LoudServer) and uses an intelligent scheduler (LoudScheduler) to dispatch tasks. - Worker Nodes (Jetsons): Edge devices running NVIDIA's Triton Inference Server and a lightweight
WorkerControllerto allow for dynamic control of hardware settings like GPU frequency.
| Path | Description |
|---|---|
LoudController/ |
The core application with the API server, scheduler, device models, and predictors. |
├─ altSchedulers/ |
Alternative scheduling algorithms for comparison. |
├─ Experiments/ |
Scripts for running various performance and scheduling experiments. |
WorkerController/ |
Lightweight Flask app deployed on each worker to manage device settings. |
ansible/ |
Ansible playbooks for automating the setup and management of worker nodes. |
scripts/ |
Helper scripts for setup, data processing, and system management. |
measurements/ |
Raw and processed data from performance, power, and energy profiling. |
results/ |
Output directory for experiment logs and reports. |
model_repository/ |
Machine learning models configured for the Triton Inference Server. |
makefile |
The main entry point for installing, running, and managing the entire system. |
This project appears in the following publications:
We present a novel scheduling framework for energy-efficient video analytics at the edge. The proposed approach dynamically adjusts GPU frequency and batch size to optimize inference execution while ensuring QoS compliance. It integrates a priority queue for QoS-aware scheduling and leverages offline profiling and ML-based estimation to select energy-optimal configurations. Based on these configurations, it employs a priority-based scheduler with greedy batching and adaptive waiting strategies to efficiently dispatch inference requests. Experimental evaluation conducted on an NVIDIA Jetson Xavier AGX device demonstrates a ~ 28% improvement in frames served per joule compared to a baseline with no scheduler. This outperforms a static batch scheduler (32 frames per batch) by ~ 15% and a time-triggered scheduler (batch formation every 200 ms) by ~ 29%, while introducing minimal increase in QoS violations (5%) relative to the best-performing policy.
This project is licensed under the MIT License. See the LICENSE file for details.