AI vWS Sizing Advisor

Introduction

AI vWS Sizing Advisor is designed explicitly for setting up the right Virtual Environment for diverse AI use cases - such as sizing NIMs, different LLMs, setting up environments for workloads such as inferencing, RAG, and fine-tuning. By leveraging a RAG architecture, the Advisor is able to take inputs on your AI workload and translate them into the exact tested vGPU configuration.

Please refer to this guide to verify that all required software and development tools are properly installed and configured prior to initiating the deployment process.

Prerequisites

Required Software

Running this toolkit on Linux-based virtual workstations:

NVIDIA vGPU Software: vGPU version 17.4 or later
Hypervisor: vGPU supported hypervisors
VM Operating System: Ubuntu 24.04 or Ubuntu 22.04
Minimum system requirements: 16 vCPU, 24 GB system memory, 96 GB storage
Recommended vGPU profile: 24Q
Download Docker for Ubuntu here (v20.10+)
Download Docker Compose Plugin here
Activate, download, and install your RTX Virtual Workstation licenses
Join the NVIDIA Developer Program to access NVIDIA NIM for Developers

Important: Don't have an NVIDIA vGPU license yet? Request a free 90-day evaluation license

Required Hardware

NVIDIA Certified systems with any supported GPU with a 24Q profile

Deployment Guide

Note: Although this guide uses vCenter, NVIDIA AI vWS can be deployed on any NVIDIA vGPU-supported hypervisor. It's assumed that all vWS requirements, including licensing, are already configured.

Virtual Machine (VM) Configuration

Advisor Download: Set up a Linux VM for creating the Advisor with the following configuration:

vCPU - 16 CPU
Memory - 96 GB
vGPU Profile - 24Q

Verification Step: Set up a Linux VM based on the Advisor's Recommendation. To validate that this VM is properly configured run the following command:

nvidia-smi

At this point, the VM setup is complete. The installation guide for Ubuntu can be found here.

Repository Setup

GitHub Repository: https://github.com/NVIDIA-AI-Blueprints/rag

Clone the repository onto your IDE's terminal:

git clone https://github.com/anpandacoding/vws-sizing
cd vws-sizing

Within the shell, run the following commands (make sure you are in the workspace top root):

export NGC_API_KEY="nvapi-your-key-here"

# Authenticate to NVIDIA NGC Registry
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

source deploy/compose/.env

# Start core service
./scripts/start_vgpu_rag.sh --skip-nims

NVIDIA Developer Program Setup

To Obtain NVIDIA Developer Program Membership and a Personal API Key:

Visit the NVIDIA Developer Program page, click on Join and sign up for an NVIDIA account.
Use the NVIDIA Developer Program credential to log into NVIDIA NGC Catalog
Click the account name at the top right. In the drop-down menu, select Setup.
Click on "Generate API Key" then click on "+ Generate Personal Key"
Enter the key name and expiration. Under Services Included, make sure NGC Catalog is selected.
Once your personal API key is generated, save the key that is required for accessing NVIDIA NIMs during the subsequent deployment phase.

Deployment Steps

Launch the Local Server

Start the local web server:

docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Open your browser to http://localhost:3000

Configure Your Workload

In the UI, select the AI workload type (Inference, RAG, etc.)
Fill in model details and any other parameters needed
Once the build is completed, click on the Citations tab (bottom left). Expand any entry with "Click to view details" to see the exact docs or benchmarks the AI used for its recommendation.

What's in the Recommendation?

vGPU Profile: Suggested vGPU (e.g., 48Q on L40S) based on model memory and concurrency
GPU Memory: Required memory (including model, KV cache, overhead)
System RAM: Calculated based on inference load and user count (rule of thumb: Model Memory × 2.5 + overhead)
vCPUs: Computed from system RAM (e.g., 1 vCPU per 4 GB)
Expected TTFT: Estimated Time to First Token
Latency: Predicted performance metrics under your config

HuggingFace API Setup

To Obtain HuggingFace API Key + Access Permissions:

You must select the first two options in the User Permissions Section:
- 'public gated repositories'
- 'repos under your personal namespace'
In the 'Apply Configuration', locate the input field
Paste or enter your credentials
The VM IP Address must source a VM that must fit the recommendation
Start the Environment Container
Apply Configuration

This spins up the sandbox that runs your model microservice on the VM. Once the service container is started, you will receive a detailed log.

Overview

This blueprint serves as a reference solution for a foundational Retrieval Augmented Generation (RAG) pipeline with an integrated AI vWS Sizing Advisor. It combines two key capabilities:

Enterprise RAG Pipeline: Enable users to ask questions and receive answers based on their enterprise data corpus.
AI vWS Sizing Advisor: Provide intelligent, validated recommendations for NVIDIA vGPU deployments, including profile validation, capacity calculations, and deployment strategies.

By default, this blueprint leverages locally-deployed NVIDIA NIM microservices to meet specific data governance and latency requirements. However, you can replace these models with your NVIDIA-hosted models available in the NVIDIA API Catalog.

Key Features

Core RAG Features

Multimodal data extraction support with text, tables, charts, and infographics
Hybrid search with dense and sparse search
Multilingual and cross-lingual retrieval
Reranking to further improve accuracy
GPU-accelerated Index creation and search
Multi-turn conversations with opt-in query rewriting
Multi-session support
Telemetry and observability
OpenAI-compatible APIs
Decomposable and customizable

vGPU Advisor Features

Automatic validation of vGPU profiles against NVIDIA specifications
Accurate VM capacity calculations based on GPU inventory
Support for heterogeneous GPU configurations
Intelligent recommendations for vGPU vs. passthrough modes
Cost-efficiency and performance trade-off analysis
Integration with official NVIDIA vGPU documentation
Single pre-loaded knowledge base for simplified operation

Target Audience

This blueprint is for:

IT System Administrators: Looking for validated vGPU configuration recommendations
DevOps Engineers: Deploying virtualized GPU environments
Solution Architects: Designing GPU-accelerated infrastructure

Software Components

The following are the default components included in this blueprint:

NVIDIA NIM Microservices
- Response Generation (Inference)
  - nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6 - The primary NIM microservice used for AI vWS Sizing Advisor
- Retriever Models
- Optional NIMs
RAG Orchestrator server - Langchain based
Milvus Vector Database - accelerated with NVIDIA cuVS
Ingestion - Nvidia-Ingest is leveraged for ingestion of files. NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.
File Types: File types supported by Nvidia-Ingest are supported by this blueprint. This includes .pdf, .pptx, .docx having images. Image captioning support is turned off by default to improve latency, so questions about images in documents will yield poor accuracy. Files with following extensions are supported:

bmp
docx
html (treated as text)
jpeg
json (treated as text)
md (treated as text)
pdf
png
pptx
sh (treated as text)
tiff
txt

We provide Docker Compose scripts that deploy the microservices on a single node. When you are ready for a large-scale deployment, you can use the included Helm charts to deploy the necessary microservices. You use sample Jupyter notebooks with the JupyterLab service to interact with the code directly.

The Blueprint contains sample data from the NVIDIA Developer Blog and also some sample multimodal data. You can build on this blueprint by customizing the RAG application to your specific use case.

We also provide a sample user interface named rag-playground.

Technical Diagram

The image represents the architecture and workflow. Here's a step-by-step explanation of the workflow from end-user perspective:

User Interaction via RAG Playground or APIs:
- Users interact through the RAG Playground UI or APIs, sending queries about vGPU configurations or general knowledge base questions
- For vGPU queries, the system automatically validates profiles and calculates capacities
- The POST /generate API handles both RAG and vGPU advisor functionalities
Query Processing with Enhanced vGPU Support:
- The RAG Server processes queries using LangChain
- For vGPU queries, additional validation and calculation modules ensure accurate recommendations
- Optional components like Query Rewriter and NeMoGuardrails enhance accuracy
Intelligent Document Retrieval:
- The system maintains a unified vgpu_knowledge_base collection
- For vGPU queries, it automatically includes baseline documentation and relevant specialized collections
- The Retriever module identifies the most relevant information using the Milvus Vector Database
Enhanced Response Generation:
- Responses are generated using NeMo LLM inference
- For vGPU configurations, additional validation ensures only valid profiles are recommended
- Capacity calculations and deployment recommendations are included where relevant

Minimum System Requirements

OS Requirements

Ubuntu 22.04 OS

Deployment Options

Driver versions

GPU Driver - 530.30.02 or later
CUDA version - 12.6 or later

Hardware Requirements

By default, this blueprint deploys the referenced NIM microservices locally. For this, you will require a minimum of:

24Q profile

The blueprint can be also modified to use NIM microservices hosted by NVIDIA in NVIDIA API Catalog.

Following are the hardware requirements for each component. The reference code in the solution (glue code) is referred to as as the "pipeline".

The overall hardware requirements depend on whether you Deploy With Docker Compose or Deploy With Helm Chart.

Hardware requirements for self hosting all NVIDIA NIM microservices

The NIM and hardware requirements only need to be met if you are self-hosting them with default settings of RAG. See Using self-hosted NVIDIA NIM microservices.

Pipeline operation: 1x L40 GPU or similar recommended. It is needed for Milvus vector store database, as GPU acceleration is enabled by default.
LLM NIM: Nvidia llama-3.3-nemotron-super-49b-v1
- For improved paralleled performance, we recommend 8x or more H100s/A100s for LLM inference.
Embedding NIM: Llama-3.2-NV-EmbedQA-1B-v2 Support Matrix
- The pipeline can share the GPU with the Embedding NIM, but it is recommended to have a separate GPU for the Embedding NIM for optimal performance.
Reranking NIM: llama-3_2-nv-rerankqa-1b-v2 Support Matrix
NVIDIA NIM for Image OCR: baidu/paddleocr
NVIDIA NIMs for Object Detection:

Next Steps

Follow the Deployment Guide to set up the AI vWS Sizing Advisor
See the OpenAPI Specifications
Explore notebooks that demonstrate how to use the APIs here
Explore observability support
Explore best practices for enhancing accuracy or latency
For detailed deployment options, see Get Started

Available Customizations

The following are some of the customizations that you can make after you complete the deployment:

NVIDIA NIM

NVIDIA NIM provides containers to self-host GPU-accelerated inferencing microservices for pretrained and customized AI models across clouds and data centers. NIM microservices expose industry-standard APIs for simple integration into AI applications, development frameworks, and workflows. Built on pre-optimized inference engines from NVIDIA and the community, including NVIDIA® TensorRT™ and TensorRT-LLM, NIM microservices optimize response latency and throughput for each combination of foundation model and GPU. NVIDIA NIM for Developer is the edition used in this toolkit.

The NIM microservices used in this toolkit:

Llama 3.1 8B Instruct - Primary model for generating vGPU sizing recommendations

Appendix

Reference Documentation

Inviting the community to contribute

We're posting these examples on GitHub to support the NVIDIA LLM community and facilitate feedback. We invite contributions! To open a GitHub issue or pull request, see the contributing guidelines.

License

This NVIDIA NVIDIA AI BLUEPRINT is licensed under the Apache License, Version 2.0. This project will download and install additional third-party open source software projects and containers. Review the license terms of these open source projects before use.

Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.

Terms of Use

This blueprint is governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Software License Agreement and the NVIDIA Agreements | Enterprise Software | Product Specific Terms for AI Product. The models are governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License and the NVIDIA RAG dataset which is governed by the NVIDIA Asset License Agreement.

The following models that are built with Llama are governed by the Llama 3.2 Community License Agreement: llama-3.3-nemotron-super-49b-v1, nvidia/llama-3.2-nv-embedqa-1b-v2, and nvidia/llama-3.2-nv-rerankqa-1b-v2.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
deploy		deploy
docs		docs
frontend		frontend
notebooks		notebooks
scripts		scripts
src		src
vgpu_docs		vgpu_docs
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-3rd-party.txt		LICENSE-3rd-party.txt
README.md		README.md
SECURITY.md		SECURITY.md
demo_model_extractor.py		demo_model_extractor.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

License

anpandacoding/vws-sizing

Folders and files

Latest commit

History

Repository files navigation

AI vWS Sizing Advisor

Introduction

Table of Contents

Prerequisites

Required Software

Required Hardware

Deployment Guide

Virtual Machine (VM) Configuration

Repository Setup

NVIDIA Developer Program Setup

Deployment Steps

Launch the Local Server

Configure Your Workload

What's in the Recommendation?

HuggingFace API Setup

Overview

Key Features

Core RAG Features

vGPU Advisor Features

Target Audience

Software Components

Technical Diagram

Minimum System Requirements

OS Requirements

Deployment Options

Driver versions

Hardware Requirements

Hardware requirements for self hosting all NVIDIA NIM microservices

Next Steps

Available Customizations

NVIDIA NIM

Appendix

Reference Documentation

Inviting the community to contribute

License

Terms of Use

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages