Skip to content

Comments

Direct parsing for Nvidia devices from NVML#297

Open
EyupKeremBas7 wants to merge 10 commits intoHSF:mainfrom
EyupKeremBas7:fix-issue-242
Open

Direct parsing for Nvidia devices from NVML#297
EyupKeremBas7 wants to merge 10 commits intoHSF:mainfrom
EyupKeremBas7:fix-issue-242

Conversation

@EyupKeremBas7
Copy link
Contributor

Fixes #242.

As discussed in the issue by @graeme-a-stewart, this PR replaces the nvidia-smi subprocess parsing with direct calls to the NVML C API.

Key Changes & Architecture:

  • Build Option: As requested, GPU monitoring is now an opt-in build feature. Added the ENABLE_NVIDIA_GPU CMake flag (defaults to OFF to prevent build failures on non-GPU systems).

  • Code Style & RAII: I strictly maintained the original RAII pattern of the nvidiamon class (initializing NVML in the constructor and shutting it down in the destructor). The logging style and overall class structure were kept as close to the original as possible (for example, retaining the read_path parameter in the update_stats signature to preserve the existing interface, even though it is no longer in use).

  • Note on Loops: You might notice the use of classic index-based for loops instead of range-based C++ iterators in update_stats. This was necessary to properly handle the C-style arrays and pointers required by the NVML functions.

  • Safe Fallback: If the binary is compiled with NVML enabled but runs on a machine without NVIDIA drivers/GPUs, it simply logs a warning and disables the GPU monitoring without crashing.

Testing:

To ensure zero regressions and verify the exact same output format for downstream tools (like prmon_plot.py), I created a custom automated test suite. I ran a matrix of 54 tests comparing the legacy nvidia-smi parser vs. the new NVML implementation. Scenarios included CPU-only, GPU-heavy, and Parent/Child forking across three environments:

  1. Google Colab (Tesla T4)

  2. Local Docker (Ubuntu 22.04)

  3. Local Bare Metal (4050 RTX GPU)

Performance Validation Note: The plot below illustrates the resource tracking of the dummy_gpu process from the prmon-test repository. This specific benchmark was executed on a Google Colab T4 instance for 30 seconds with a 1-second sampling interval to verify NVML parity with legacy outputs.

comparison_smi_vs_nvml_clean

AI Assisted Contributions Disclosure:

In accordance with the guidelines introduced in Add section on AI contributions (#291) (commit d52e11d), I confirm that I take full responsibility for all code and results submitted in this PR. For full transparency regarding non-routine tasks: the core C++ and CMake implementations in this PR were written entirely by me (using AI strictly for routine tasks like code review). However, to accelerate the extensive benchmarking process, the auxiliary test scripts in the prmon-test repository, as well as the code comments and this documentation, were generated with AI assistance.

…o NVML C API

- Replace nvidia-smi text output parsing with direct NVML API calls
- Add proper NVML lifecycle management (init/shutdown) via RAII
- Add NVML_ERROR_INSUFFICIENT_SIZE warning for buffer overflow detection
- Move buffer allocation (resize) to constructor for zero-overhead updates
- Implement per-process GPU utilization tracking via nvmlDeviceGetProcessUtilization
- Implement per-process memory tracking via nvmlDeviceGetComputeRunningProcesses
- Add GPU hardware info collection (name, SM freq, total memory) via NVML
- Remove obsolete nvidia-smi parsing code, unused includes, and dead comments
- Remove NvidiamonValueTestFixed test (nvidia-smi specific)
- Fix indentation to 2-space consistent with codebase style
- Add Apache2 license headers
- Fix typos and comment accuracy
@graeme-a-stewart graeme-a-stewart changed the title Fix issue 242 Direct parsing for Nvidia devices from NVML Feb 16, 2026
add_test(NAME testFieldsAll COMMAND test_fields)
add_test(NAME testFieldsSomeDisabled COMMAND test_fields --disable netmon --disable cpumon)

# Link NVML to test targets (nvidiamon.cpp requires it)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should rather use a more modern approach to finding the CUDA libraries, which do support CMake properly from CMake 3.17 onwards. See comments below regarding the top level CMakeLists.txt, but this should be more along the lines of:

    # Link NVML to test targets (nvidiamon.cpp requires it)                                                                                   
    if(CUDAToolkit_FOUND)
      target_sources(test_values PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../src/nvidiamon.cpp)
      target_link_libraries(test_values PRIVATE CUDA::nvml)
      target_sources(test_fields PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../src/nvidiamon.cpp)
      target_link_libraries(test_fields PRIVATE CUDA::nvml)
    endif()

# - Define the minimum CMake version
# HSF recommends 3.10 to support C/C++ compile features for C/C++11 across all
# platforms
cmake_minimum_required(VERSION 3.10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 3.17 CMake introduces its "modern" way to support CUDA libraries, so I am fine with updating this to the be the minimum version. CMake 3.17 was released in May 2020, so almost 6 years ago.

if(ENABLE_NVIDIA_GPU)
message(STATUS "Checking for NVIDIA Management Library (NVML)...")

find_library(NVML_LIBRARY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good. It's very much hand rolled and relies on too fragile a set of paths to find the CUDA libraries.

If we upgrade to CMake 3.17 this should be replicable with a simple

  find_package(CUDAToolkit REQUIRED)
  message(STATUS "CUDA Toolkit version: ${CUDAToolkit_VERSION}")
  message(STATUS "CUDA Toolkit include dir: ${CUDAToolkit_INCLUDE_DIRS}")

endif(BUILD_BENCHMARK_LOG)

#--- add NVML support ----------------------------------------------------------
if(ENABLE_NVIDIA_GPU AND NVML_FOUND)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to

if(CUDAToolkit_FOUND)
  target_sources(prmon PRIVATE src/nvidiamon.cpp)
  target_link_libraries(prmon PRIVATE CUDA::nvml)
endif()

@graeme-a-stewart
Copy link
Member

graeme-a-stewart commented Feb 16, 2026

Thank you for the patch here @EyupKeremBas7. It's really great that you contribute to the project, 🙏.

However we have some issues needing to be addressed.

The way you find and use the CUDA library is very "old school" and we should update to a more modern approach, illustrated in the line-by-line comments above.

For my next comment, let me prefix by highlighting that for the main use of prmon in WLCG the experiments usually build the library themselves, deploying it to a shared filesystem called CVMFS (which is accessible worldwide). So there is one build and many executions on different machines. When prmon executes it needs to be able to dynamically decide if GPU monitoring should be enabled.

Currently, we would be in the following situation with this PR:

  • Scenario 1 - build machine has no CUDA, no support added to prmon, no monitoring possible even on WLCG GPU nodes
  • Scenario 2 - build machine has CUDA and we enable it, now prmon depends on libnvidia-ml.so and if it cannot be found the monitoring won't even run:
lxplus928 :: ~/build/prmon » ./package/prmon --help                                                                                     130 ↵
./package/prmon: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

1 is bad, and 2 is really terrible!

To address this we would need to rather be able to dynamically load the Nvidia monitoring code for prmon as a module (dlopen()), aborting any GPU monitoring if that fails. This will also probably impact on our ability to build statically - we'd never be able to support GPU monitoring in this case (I tried and I got a loader error, even though libnvidia-ml.a does exist).

So it's tricky....

I also worry about compatibility with different CUDA versions - I don't know how stable the NVML interfaces are.

A better approach is to enable the NVML monitoring as a different monitoring plugin, and that we keep the old style nvidia-smi external monitoring for a transition period, until we are confident that we addresses all of the deployment issues.

Please let @amete and me know if you would be able to work with us to address these issues.

@graeme-a-stewart
Copy link
Member

P.S. Some of the suggested fixes are in a rough-and-ready state here:

graeme-a-stewart@c8e26df

They need polished!

@EyupKeremBas7
Copy link
Contributor Author

​Hi @graeme-a-stewart and @amete , thanks for the detailed review! I completely agree with the dlopen() approach and keeping the old plugin around for a transition period.
​Regarding your concern about NVML stability: according to the official NVIDIA documentation nvidia-smi is actually built directly on top of the NVML C API. As I understand from this, NVIDIA maintains strict backward compatibility for these core functions, so the API should be highly stable across different versions. That said, keeping the old parser for the transition period is definitely the smartest way to go.
​I’ll start working on the new dynamic loading plugin and update the PR soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Convert Nvidia monitoring to use C API

2 participants