Direct parsing for Nvidia devices from NVML#297
Direct parsing for Nvidia devices from NVML#297EyupKeremBas7 wants to merge 10 commits intoHSF:mainfrom
Conversation
…o NVML C API - Replace nvidia-smi text output parsing with direct NVML API calls - Add proper NVML lifecycle management (init/shutdown) via RAII - Add NVML_ERROR_INSUFFICIENT_SIZE warning for buffer overflow detection - Move buffer allocation (resize) to constructor for zero-overhead updates - Implement per-process GPU utilization tracking via nvmlDeviceGetProcessUtilization - Implement per-process memory tracking via nvmlDeviceGetComputeRunningProcesses - Add GPU hardware info collection (name, SM freq, total memory) via NVML - Remove obsolete nvidia-smi parsing code, unused includes, and dead comments - Remove NvidiamonValueTestFixed test (nvidia-smi specific) - Fix indentation to 2-space consistent with codebase style - Add Apache2 license headers - Fix typos and comment accuracy
| add_test(NAME testFieldsAll COMMAND test_fields) | ||
| add_test(NAME testFieldsSomeDisabled COMMAND test_fields --disable netmon --disable cpumon) | ||
|
|
||
| # Link NVML to test targets (nvidiamon.cpp requires it) |
There was a problem hiding this comment.
This should rather use a more modern approach to finding the CUDA libraries, which do support CMake properly from CMake 3.17 onwards. See comments below regarding the top level CMakeLists.txt, but this should be more along the lines of:
# Link NVML to test targets (nvidiamon.cpp requires it)
if(CUDAToolkit_FOUND)
target_sources(test_values PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../src/nvidiamon.cpp)
target_link_libraries(test_values PRIVATE CUDA::nvml)
target_sources(test_fields PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../src/nvidiamon.cpp)
target_link_libraries(test_fields PRIVATE CUDA::nvml)
endif()| # - Define the minimum CMake version | ||
| # HSF recommends 3.10 to support C/C++ compile features for C/C++11 across all | ||
| # platforms | ||
| cmake_minimum_required(VERSION 3.10) |
There was a problem hiding this comment.
From 3.17 CMake introduces its "modern" way to support CUDA libraries, so I am fine with updating this to the be the minimum version. CMake 3.17 was released in May 2020, so almost 6 years ago.
| if(ENABLE_NVIDIA_GPU) | ||
| message(STATUS "Checking for NVIDIA Management Library (NVML)...") | ||
|
|
||
| find_library(NVML_LIBRARY |
There was a problem hiding this comment.
This is not good. It's very much hand rolled and relies on too fragile a set of paths to find the CUDA libraries.
If we upgrade to CMake 3.17 this should be replicable with a simple
find_package(CUDAToolkit REQUIRED)
message(STATUS "CUDA Toolkit version: ${CUDAToolkit_VERSION}")
message(STATUS "CUDA Toolkit include dir: ${CUDAToolkit_INCLUDE_DIRS}")| endif(BUILD_BENCHMARK_LOG) | ||
|
|
||
| #--- add NVML support ---------------------------------------------------------- | ||
| if(ENABLE_NVIDIA_GPU AND NVML_FOUND) |
There was a problem hiding this comment.
Update to
if(CUDAToolkit_FOUND)
target_sources(prmon PRIVATE src/nvidiamon.cpp)
target_link_libraries(prmon PRIVATE CUDA::nvml)
endif()|
Thank you for the patch here @EyupKeremBas7. It's really great that you contribute to the project, 🙏. However we have some issues needing to be addressed. The way you find and use the CUDA library is very "old school" and we should update to a more modern approach, illustrated in the line-by-line comments above. For my next comment, let me prefix by highlighting that for the main use of prmon in WLCG the experiments usually build the library themselves, deploying it to a shared filesystem called CVMFS (which is accessible worldwide). So there is one build and many executions on different machines. When prmon executes it needs to be able to dynamically decide if GPU monitoring should be enabled. Currently, we would be in the following situation with this PR:
lxplus928 :: ~/build/prmon » ./package/prmon --help 130 ↵
./package/prmon: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory1 is bad, and 2 is really terrible! To address this we would need to rather be able to dynamically load the Nvidia monitoring code for prmon as a module ( So it's tricky.... I also worry about compatibility with different CUDA versions - I don't know how stable the NVML interfaces are. A better approach is to enable the NVML monitoring as a different monitoring plugin, and that we keep the old style Please let @amete and me know if you would be able to work with us to address these issues. |
|
P.S. Some of the suggested fixes are in a rough-and-ready state here: They need polished! |
|
Hi @graeme-a-stewart and @amete , thanks for the detailed review! I completely agree with the dlopen() approach and keeping the old plugin around for a transition period. |
2af2d1d to
a6de3cd
Compare
Fixes #242.
As discussed in the issue by @graeme-a-stewart, this PR replaces the
nvidia-smisubprocess parsing with direct calls to the NVML C API.Key Changes & Architecture:
Build Option: As requested, GPU monitoring is now an opt-in build feature. Added the
ENABLE_NVIDIA_GPUCMake flag (defaults toOFFto prevent build failures on non-GPU systems).Code Style & RAII: I strictly maintained the original RAII pattern of the
nvidiamonclass (initializing NVML in the constructor and shutting it down in the destructor). The logging style and overall class structure were kept as close to the original as possible (for example, retaining theread_pathparameter in theupdate_statssignature to preserve the existing interface, even though it is no longer in use).Note on Loops: You might notice the use of classic index-based
forloops instead of range-based C++ iterators inupdate_stats. This was necessary to properly handle the C-style arrays and pointers required by the NVML functions.Safe Fallback: If the binary is compiled with NVML enabled but runs on a machine without NVIDIA drivers/GPUs, it simply logs a warning and disables the GPU monitoring without crashing.
Testing:
To ensure zero regressions and verify the exact same output format for downstream tools (like
prmon_plot.py), I created a custom automated test suite. I ran a matrix of 54 tests comparing the legacynvidia-smiparser vs. the new NVML implementation. Scenarios included CPU-only, GPU-heavy, and Parent/Child forking across three environments:Google Colab (Tesla T4)
Local Docker (Ubuntu 22.04)
Local Bare Metal (4050 RTX GPU)
The test scripts used are available here: https://github.com/EyupKeremBas7/prmon-test
All raw test data and benchmark logs in this Drive folder: https://drive.google.com/file/d/1QwMiAY_u8rLbibOdik56bOhTKtAwWvpO/view?usp=sharing
Performance Validation Note: The plot below illustrates the resource tracking of the dummy_gpu process from the prmon-test repository. This specific benchmark was executed on a Google Colab T4 instance for 30 seconds with a 1-second sampling interval to verify NVML parity with legacy outputs.
AI Assisted Contributions Disclosure:
In accordance with the guidelines introduced in Add section on AI contributions (#291) (commit
d52e11d), I confirm that I take full responsibility for all code and results submitted in this PR. For full transparency regarding non-routine tasks: the core C++ and CMake implementations in this PR were written entirely by me (using AI strictly for routine tasks like code review). However, to accelerate the extensive benchmarking process, the auxiliary test scripts in theprmon-testrepository, as well as the code comments and this documentation, were generated with AI assistance.