-
Notifications
You must be signed in to change notification settings - Fork 6
test: SKU customisations #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: SKU customisations #50
Conversation
Reviewer's GuideAdds Azure HPC SKU customisation support with scripts, topology files, and tests, wiring them into the Ansible role behind a new hpc_sku_customisation toggle and installing a systemd service to apply SKU-specific NCCL/topology tweaks at boot. Sequence diagram for SKU customisation at boot via systemd servicesequenceDiagram
participant Systemd
participant SkuCustomisationService
participant SetupScript
participant AzureIMDS
participant SkuCustomisationHandler
participant NvidiaFabricManager
participant NvidiaDCGM
participant NCCL
Systemd->>SkuCustomisationService: start sku_customisation.service
SkuCustomisationService->>SetupScript: exec setup_sku_customisations.sh
SetupScript->>SetupScript: init NCCL_CONF / topology runtime dirs
alt sku mocked
SetupScript->>SetupScript: read env __MOCK_SKU as sku
else sku from IMDS
loop up to 5 retries
SetupScript->>AzureIMDS: GET /metadata/instance vmSize
AzureIMDS-->>SetupScript: vmSize or empty
end
SetupScript->>SetupScript: tolower(sku)
end
alt known SKU pattern
SetupScript->>SkuCustomisationHandler: run ncv4.sh | ndv4.sh | ndv5.sh | ndv2.sh | ncv5.sh | ndv6.sh | hbv4.sh
SkuCustomisationHandler->>SetupScript: configure TOPOLOGY_FILE / TOPOLOGY_GRAPH
opt NCCL tuning
SkuCustomisationHandler->>SetupScript: append NCCL_* to NCCL_CONF
end
opt fabric manager control
SkuCustomisationHandler->>NvidiaFabricManager: systemctl enable/start
NvidiaFabricManager-->>SkuCustomisationHandler: is-active status
end
opt NVLink workaround
SkuCustomisationHandler->>NvidiaDCGM: stop nvidia-dcgm.service
SkuCustomisationHandler->>NvidiaDCGM: reload NVIDIA kernel modules
SkuCustomisationHandler->>NvidiaDCGM: start nvidia-dcgm.service
end
SetupScript->>SetupScript: add NCCL_TOPO_FILE / NCCL_GRAPH_FILE to NCCL_CONF
else unknown SKU
SetupScript->>SetupScript: remove TOPOLOGY_FILE / TOPOLOGY_GRAPH / NCCL_CONF
end
SetupScript-->>SkuCustomisationService: exit status
SkuCustomisationService-->>Systemd: service result
Systemd-->>NCCL: NCCL_CONF ready for MPI jobs at runtime
Sequence diagram for manual SKU setup test using mocked SKUsequenceDiagram
actor Admin
participant TestScript
participant SetupScript
participant SkuCustomisationHandler
participant NvidiaFabricManager
Admin->>TestScript: sudo test-sku-setup.sh --manual
TestScript->>TestScript: select test SKU
loop for each test_sku
TestScript->>TestScript: export __MOCK_SKU=test_sku
TestScript->>SetupScript: run setup_sku_customisations.sh
SetupScript->>SetupScript: detect __MOCK_SKU and set sku
SetupScript->>SkuCustomisationHandler: dispatch per SKU
opt SKU requires fabric manager
SkuCustomisationHandler->>NvidiaFabricManager: enable/start
NvidiaFabricManager-->>SkuCustomisationHandler: status
end
SetupScript-->>TestScript: exit code
alt exit code 0
TestScript->>Admin: "Test Passed: test_sku"
else failure
TestScript->>Admin: log failure details
end
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 1 issue, and left some high level feedback:
- The systemd unit name used in the test script (
systemctl is-active --quiet sku-customisations) does not match the installed service name (sku_customisation.service); aligning these (and choosing a single spelling/format) will avoid false negatives in service state checks. - The IMDS SKU lookup and retry logic is duplicated between
setup_sku_customisations.shandtest-sku-setup.sh; consider factoring this into a common helper or at least a shared function to keep the behavior consistent and easier to change. - Several SKU customisation scripts (e.g. NDv4/NDv5) contain very similar
nvidia-fabricmanagerenable/start/error-handling blocks; extracting this into a common helper or function would reduce repetition and the risk of divergent behavior between SKUs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The systemd unit name used in the test script (`systemctl is-active --quiet sku-customisations`) does not match the installed service name (`sku_customisation.service`); aligning these (and choosing a single spelling/format) will avoid false negatives in service state checks.
- The IMDS SKU lookup and retry logic is duplicated between `setup_sku_customisations.sh` and `test-sku-setup.sh`; consider factoring this into a common helper or at least a shared function to keep the behavior consistent and easier to change.
- Several SKU customisation scripts (e.g. NDv4/NDv5) contain very similar `nvidia-fabricmanager` enable/start/error-handling blocks; extracting this into a common helper or function would reduce repetition and the risk of divergent behavior between SKUs.
## Individual Comments
### Comment 1
<location> `README.md:198` </location>
<code_context>
+Whether to install the hardware tuning files for different Azure VM types (SKUs).
+
+This will install definitions for optimal hardware configurations for the different types of high performance VMs that are typically used for HPC workloads in the Azure environment.
+These include Infiniband and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
+
+Default: `true`
</code_context>
<issue_to_address>
**suggestion (typo):** Use the standard spelling "InfiniBand" for the interconnect name.
Use the vendor-standard capitalization "InfiniBand" to match industry and Azure documentation conventions.
```suggestion
These include InfiniBand and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
abb3d95 to
23ce0c1
Compare
|
@dgchinner you'll need to rebase this and your other PRs since I just merged #48 |
23ce0c1 to
557e232
Compare
db31d1d to
68ee2b4
Compare
|
@dgchinner just rebase on top of main branch and we should be good to go |
68ee2b4 to
8ba89bd
Compare
Add a script to enable both manual and automated testing of the Azure SKU customisation scripts. When running the tests manually, it will exercise all the different supported SKU types via mocking and checking that appropriate links are installed. It will not check that the customisation service is active and running as manual mode is expected to used on dev machines that are unsupported SKU types. Manual testing like this may throw some warnings or errors because hardware is not directly supported. For example, testing on a VM type that does not have GPUs that are supported by the fabric manager will result in warnings that the service failed to start: $ sudo /opt/hpc/azure/tests/test-sku-setup.sh --manual Testing standard_nc96ads_a100_v4 Test Passed: standard_nc96ads_a100_v4 Testing standard_nd40rs_v2 Test Passed: standard_nd40rs_v2 Testing standard_nd96asr_v4 Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details. NVIDIA Fabric Manager Inactive! Test Passed: standard_nd96asr_v4 Testing standard_hb176rs_v4 Test Passed: standard_hb176rs_v4 Testing standard_nc80adis_h100_v5 Check NVLink status after reloading NVIDIA kernel modules... NVLink is Active. Test Passed: standard_nc80adis_h100_v5 Testing standard_nd96isr_h200_v5 Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details. NVIDIA Fabric Manager Inactive! Test Passed: standard_nd96isr_h200_v5 $ Such warnings are fine. When not in manual mode, the test expects that it is running on a supported SKU VM (e.g. in the CI system) and will query the current the SKU type. If the SKU is unsupported, it will check that no files are currently installed. It will fail in the casei where stale config files are found: $ sudo /opt/hpc/azure/tests/test-sku-setup.sh Unknown SKU Failed: Standard_NC8as_T4_v3: /etc/nccl.conf not empty $ If the SKU is supported, it will check that appropriate files are installed and the service is running. Signed-off-by: Dave Chinner <dchinner@redhat.com>
8ba89bd to
78be778
Compare
|
How do we get a log file from the CI test that is failing quite frequently so we can fix it and get rid of teh noise? It's failing because: INFO:root:Running playbooks ['/home/runner/.cache/linux-system-roles/centos-9_setup.yml', '/home/runner/work/hpc/hpc/tests/tests_skip_toolkit.yml'] But "error 2" tells us nothing about what is actually going wrong.... |
At https://github.com/linux-system-roles/hpc/actions/runs/21456387995/job/61798143384?pr=50 Qemu result summary FAIL: tests_skip_toolkit.yml It is the skip_toolkit test that is failing Show test log failures tests_skip_toolkit.yml-FAIL.log Install NVIDIA Fabric Manager Detail: That is this task - name: Install NVIDIA Fabric Manager and enable service
when: hpc_install_nvidia_fabric_manager
block:
- name: Install NVIDIA Fabric Manager
package:
name: "{{ __hpc_nvidia_fabric_manager_packages }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
register: __hpc_nvidia_fabric_manager_packages_install
until: __hpc_nvidia_fabric_manager_packages_install is successwhich is __hpc_nvidia_fabric_manager_packages:
- nvidia-fabric-managerWhich repo provides the In fact, if doing any sort of testing requires the customized images for hpc, then either we will need to greatly reduce the scope of testing, or we will need to change the test to add these packages to the image before running the role. If you want to see the actual logs from the test run, Upload test logs on failure has a link |
|
@dgchinner fixed the test - #55 - please rebase to pick up the fix |
The nvidia repo we install all the other NVidia packages from. This repo is set up by the system-role right at the start. However, I note that the kernel drivers are not installed because the VM is not running on Azure, (i.e. system vendor != Microsoft), the CUDA toolkit is skipped because the test sets |
Which is what I did in https://github.com/linux-system-roles/hpc/pull/55/changes#diff-63d118695b7459b06db5e120617cafd42267c62b9d4a994d902f5e026f5ff244R20 |
Enhancement: Test script for the SKU customisations feature in PR #49
This is dependent on the changes in RP #48 and PR #49, the commits a duplicated in the branch the PR is generated from.
Manual (mocked SKU) testing:
Testing the SKU installed correctly and the service is running on a given VM running the built image (e.g. via a CI system):
Issue Tracker Tickets (Jira or BZ if any): RHELHPC-126
Summary by Sourcery
Add Azure-specific SKU customisation support for NCCL/topology tuning on HPC VMs and provide tooling to manage and validate these configurations.
New Features:
Enhancements:
Tests: