Direct parsing for Nvidia devices from NVML by EyupKeremBas7 · Pull Request #297 · HSF/prmon

EyupKeremBas7 · 2026-02-14T14:48:08Z

Fixes #242.

As discussed in the issue by @graeme-a-stewart, this PR replaces the nvidia-smi subprocess parsing with direct calls to the NVML C API.

Key Changes & Architecture:

Build Option: As requested, GPU monitoring is now an opt-in build feature. Added the ENABLE_NVIDIA_GPU CMake flag (defaults to OFF to prevent build failures on non-GPU systems).
Code Style & RAII: I strictly maintained the original RAII pattern of the nvidiamon class (initializing NVML in the constructor and shutting it down in the destructor). The logging style and overall class structure were kept as close to the original as possible (for example, retaining the read_path parameter in the update_stats signature to preserve the existing interface, even though it is no longer in use).
Note on Loops: You might notice the use of classic index-based for loops instead of range-based C++ iterators in update_stats. This was necessary to properly handle the C-style arrays and pointers required by the NVML functions.
Safe Fallback: If the binary is compiled with NVML enabled but runs on a machine without NVIDIA drivers/GPUs, it simply logs a warning and disables the GPU monitoring without crashing.

Testing:

To ensure zero regressions and verify the exact same output format for downstream tools (like prmon_plot.py), I created a custom automated test suite. I ran a matrix of 54 tests comparing the legacy nvidia-smi parser vs. the new NVML implementation. Scenarios included CPU-only, GPU-heavy, and Parent/Child forking across three environments:

Google Colab (Tesla T4)
Local Docker (Ubuntu 22.04)
Local Bare Metal (4050 RTX GPU)

The test scripts used are available here: https://github.com/EyupKeremBas7/prmon-test
All raw test data and benchmark logs in this Drive folder: https://drive.google.com/file/d/1QwMiAY_u8rLbibOdik56bOhTKtAwWvpO/view?usp=sharing

Performance Validation Note: The plot below illustrates the resource tracking of the dummy_gpu process from the prmon-test repository. This specific benchmark was executed on a Google Colab T4 instance for 30 seconds with a 1-second sampling interval to verify NVML parity with legacy outputs.

AI Assisted Contributions Disclosure:

In accordance with the guidelines introduced in Add section on AI contributions (#291) (commit d52e11d), I confirm that I take full responsibility for all code and results submitted in this PR. For full transparency regarding non-routine tasks: the core C++ and CMake implementations in this PR were written entirely by me (using AI strictly for routine tasks like code review). However, to accelerate the extensive benchmarking process, the auxiliary test scripts in the prmon-test repository, as well as the code comments and this documentation, were generated with AI assistance.

…o NVML C API - Replace nvidia-smi text output parsing with direct NVML API calls - Add proper NVML lifecycle management (init/shutdown) via RAII - Add NVML_ERROR_INSUFFICIENT_SIZE warning for buffer overflow detection - Move buffer allocation (resize) to constructor for zero-overhead updates - Implement per-process GPU utilization tracking via nvmlDeviceGetProcessUtilization - Implement per-process memory tracking via nvmlDeviceGetComputeRunningProcesses - Add GPU hardware info collection (name, SM freq, total memory) via NVML - Remove obsolete nvidia-smi parsing code, unused includes, and dead comments - Remove NvidiamonValueTestFixed test (nvidia-smi specific) - Fix indentation to 2-space consistent with codebase style - Add Apache2 license headers - Fix typos and comment accuracy

… handling

…nitialization

… test macros

…arnings

graeme-a-stewart · 2026-02-16T13:12:25Z

package/tests/CMakeLists.txt

    add_test(NAME testFieldsAll COMMAND test_fields)
    add_test(NAME testFieldsSomeDisabled COMMAND test_fields --disable netmon --disable cpumon)

+    # Link NVML to test targets (nvidiamon.cpp requires it)


This should rather use a more modern approach to finding the CUDA libraries, which do support CMake properly from CMake 3.17 onwards. See comments below regarding the top level CMakeLists.txt, but this should be more along the lines of:

# Link NVML to test targets (nvidiamon.cpp requires it) if(CUDAToolkit_FOUND) target_sources(test_values PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../src/nvidiamon.cpp) target_link_libraries(test_values PRIVATE CUDA::nvml) target_sources(test_fields PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../src/nvidiamon.cpp) target_link_libraries(test_fields PRIVATE CUDA::nvml) endif()

graeme-a-stewart · 2026-02-16T13:17:14Z

CMakeLists.txt

 # - Define the minimum CMake version
 # HSF recommends 3.10 to support C/C++ compile features for C/C++11 across all
 # platforms
 cmake_minimum_required(VERSION 3.10)


From 3.17 CMake introduces its "modern" way to support CUDA libraries, so I am fine with updating this to the be the minimum version. CMake 3.17 was released in May 2020, so almost 6 years ago.

graeme-a-stewart · 2026-02-16T13:19:26Z

CMakeLists.txt

+if(ENABLE_NVIDIA_GPU)
+  message(STATUS "Checking for NVIDIA Management Library (NVML)...")
+
+  find_library(NVML_LIBRARY


This is not good. It's very much hand rolled and relies on too fragile a set of paths to find the CUDA libraries.

If we upgrade to CMake 3.17 this should be replicable with a simple

find_package(CUDAToolkit REQUIRED) message(STATUS "CUDA Toolkit version: ${CUDAToolkit_VERSION}") message(STATUS "CUDA Toolkit include dir: ${CUDAToolkit_INCLUDE_DIRS}")

graeme-a-stewart · 2026-02-16T13:20:30Z

CMakeLists.txt

 endif(BUILD_BENCHMARK_LOG)
+
+#--- add NVML support ----------------------------------------------------------
+if(ENABLE_NVIDIA_GPU AND NVML_FOUND)


Update to

if(CUDAToolkit_FOUND) target_sources(prmon PRIVATE src/nvidiamon.cpp) target_link_libraries(prmon PRIVATE CUDA::nvml) endif()

graeme-a-stewart · 2026-02-16T13:40:05Z

Thank you for the patch here @EyupKeremBas7. It's really great that you contribute to the project, 🙏.

However we have some issues needing to be addressed.

The way you find and use the CUDA library is very "old school" and we should update to a more modern approach, illustrated in the line-by-line comments above.

For my next comment, let me prefix by highlighting that for the main use of prmon in WLCG the experiments usually build the library themselves, deploying it to a shared filesystem called CVMFS (which is accessible worldwide). So there is one build and many executions on different machines. When prmon executes it needs to be able to dynamically decide if GPU monitoring should be enabled.

Currently, we would be in the following situation with this PR:

Scenario 1 - build machine has no CUDA, no support added to prmon, no monitoring possible even on WLCG GPU nodes
Scenario 2 - build machine has CUDA and we enable it, now prmon depends on libnvidia-ml.so and if it cannot be found the monitoring won't even run:

lxplus928 :: ~/build/prmon » ./package/prmon --help                                                                                     130 ↵
./package/prmon: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

1 is bad, and 2 is really terrible!

To address this we would need to rather be able to dynamically load the Nvidia monitoring code for prmon as a module (dlopen()), aborting any GPU monitoring if that fails. This will also probably impact on our ability to build statically - we'd never be able to support GPU monitoring in this case (I tried and I got a loader error, even though libnvidia-ml.a does exist).

So it's tricky....

I also worry about compatibility with different CUDA versions - I don't know how stable the NVML interfaces are.

A better approach is to enable the NVML monitoring as a different monitoring plugin, and that we keep the old style nvidia-smi external monitoring for a transition period, until we are confident that we addresses all of the deployment issues.

Please let @amete and me know if you would be able to work with us to address these issues.

graeme-a-stewart · 2026-02-16T14:57:31Z

P.S. Some of the suggested fixes are in a rough-and-ready state here:

graeme-a-stewart@c8e26df

They need polished!

EyupKeremBas7 · 2026-02-16T16:08:41Z

Hi @graeme-a-stewart and @amete , thanks for the detailed review! I completely agree with the dlopen() approach and keeping the old plugin around for a transition period.
Regarding your concern about NVML stability: according to the official NVIDIA documentation nvidia-smi is actually built directly on top of the NVML C API. As I understand from this, NVIDIA maintains strict backward compatibility for these core functions, so the API should be highly stable across different versions. That said, keeping the old parser for the transition period is definitely the smartest way to go.
I’ll start working on the new dynamic loading plugin and update the PR soon!

EyupKeremBas7 added 9 commits February 11, 2026 21:52

backup

22df64f

nvidiamon: fix buffer overflow risk and add missing INSUFFICIENT_SIZE…

902c7a5

… handling

some style fixed

4e7453a

nvidiamon: handle NOT_FOUND edge case and replace memset with value i…

09cf89c

…nitialization

feat(test): add conditional compilation for NVML and basic unit testing

f0f09ec

Refactor nvidiamon to support building without NVML headers and unify…

855cbd8

… test macros

fix(nvidiamon): accurately count active GPUs and resolve signedness w…

253d45e

…arnings

add new path cmake

f2ead22

graeme-a-stewart changed the title ~~Fix issue 242~~ Direct parsing for Nvidia devices from NVML Feb 16, 2026

graeme-a-stewart requested changes Feb 16, 2026

View reviewed changes

last analyses before pr

a6de3cd

EyupKeremBas7 force-pushed the fix-issue-242 branch from 2af2d1d to a6de3cd Compare February 17, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Direct parsing for Nvidia devices from NVML#297

Direct parsing for Nvidia devices from NVML#297
EyupKeremBas7 wants to merge 10 commits intoHSF:mainfrom
EyupKeremBas7:fix-issue-242

EyupKeremBas7 commented Feb 14, 2026

Uh oh!

graeme-a-stewart Feb 16, 2026

Uh oh!

graeme-a-stewart Feb 16, 2026

Uh oh!

graeme-a-stewart Feb 16, 2026

Uh oh!

graeme-a-stewart Feb 16, 2026

Uh oh!

graeme-a-stewart commented Feb 16, 2026 •

edited

Loading

Uh oh!

graeme-a-stewart commented Feb 16, 2026

Uh oh!

EyupKeremBas7 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

EyupKeremBas7 commented Feb 14, 2026

Uh oh!

graeme-a-stewart Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

graeme-a-stewart Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

graeme-a-stewart Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

graeme-a-stewart Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

graeme-a-stewart commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graeme-a-stewart commented Feb 16, 2026

Uh oh!

EyupKeremBas7 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

graeme-a-stewart commented Feb 16, 2026 •

edited

Loading