Skip to content

Conversation

@dgchinner
Copy link
Contributor

@dgchinner dgchinner commented Jan 22, 2026

Enhancement: Test script for the SKU customisations feature in PR #49

This is dependent on the changes in RP #48 and PR #49, the commits a duplicated in the branch the PR is generated from.

Manual (mocked SKU) testing:

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh --manual

Testing standard_nc96ads_a100_v4
Test Passed: standard_nc96ads_a100_v4

Testing standard_nd40rs_v2
Test Passed: standard_nd40rs_v2

Testing standard_nd96asr_v4
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96asr_v4

Testing standard_hb176rs_v4
Test Passed: standard_hb176rs_v4

Testing standard_nc80adis_h100_v5
Check NVLink status after reloading NVIDIA kernel modules...
NVLink is Active.
Test Passed: standard_nc80adis_h100_v5

Testing standard_nd96isr_h200_v5
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96isr_h200_v5

Testing standard_nd128isr_gb300_v6
Test Passed: standard_nd128isr_gb300_v6

Testing some_unknown_sku_for_testing
No SKU customization for some_unknown_sku_for_testing
Unknown SKU: some_unknown_sku_for_testing
Test Passed: some_unknown_sku_for_testing
$

Testing the SKU installed correctly and the service is running on a given VM running the built image (e.g. via a CI system):

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh

Testing standard_nc8as_t4_v3
Unknown SKU: standard_nc8as_t4_v3
Test Passed: standard_nc8as_t4_v3
$

Issue Tracker Tickets (Jira or BZ if any): RHELHPC-126

Summary by Sourcery

Add Azure-specific SKU customisation support for NCCL/topology tuning on HPC VMs and provide tooling to manage and validate these configurations.

New Features:

  • Introduce configurable SKU customisation for Azure HPC VM types, including topology, NCCL, and hardware workaround scripts managed via a systemd service.
  • Add Azure HPC resource, tools, tests, and runtime directories to host SKU-specific configuration and runtime data.
  • Provide a test script to validate SKU customisation behaviour both via mocked SKUs and on real Azure VMs.

Enhancements:

  • Document the new hpc_sku_customisation boolean variable and default behaviour in the role README.

Tests:

  • Add SKU customisation test script and supporting files to verify correct installation and runtime behaviour across supported and unknown Azure SKUs.

@sourcery-ai
Copy link

sourcery-ai bot commented Jan 22, 2026

Reviewer's Guide

Adds Azure HPC SKU customisation support with scripts, topology files, and tests, wiring them into the Ansible role behind a new hpc_sku_customisation toggle and installing a systemd service to apply SKU-specific NCCL/topology tweaks at boot.

Sequence diagram for SKU customisation at boot via systemd service

sequenceDiagram
    participant Systemd
    participant SkuCustomisationService
    participant SetupScript
    participant AzureIMDS
    participant SkuCustomisationHandler
    participant NvidiaFabricManager
    participant NvidiaDCGM
    participant NCCL

    Systemd->>SkuCustomisationService: start sku_customisation.service
    SkuCustomisationService->>SetupScript: exec setup_sku_customisations.sh

    SetupScript->>SetupScript: init NCCL_CONF / topology runtime dirs
    alt sku mocked
        SetupScript->>SetupScript: read env __MOCK_SKU as sku
    else sku from IMDS
        loop up to 5 retries
            SetupScript->>AzureIMDS: GET /metadata/instance vmSize
            AzureIMDS-->>SetupScript: vmSize or empty
        end
        SetupScript->>SetupScript: tolower(sku)
    end

    alt known SKU pattern
        SetupScript->>SkuCustomisationHandler: run ncv4.sh | ndv4.sh | ndv5.sh | ndv2.sh | ncv5.sh | ndv6.sh | hbv4.sh
        SkuCustomisationHandler->>SetupScript: configure TOPOLOGY_FILE / TOPOLOGY_GRAPH
        opt NCCL tuning
            SkuCustomisationHandler->>SetupScript: append NCCL_* to NCCL_CONF
        end
        opt fabric manager control
            SkuCustomisationHandler->>NvidiaFabricManager: systemctl enable/start
            NvidiaFabricManager-->>SkuCustomisationHandler: is-active status
        end
        opt NVLink workaround
            SkuCustomisationHandler->>NvidiaDCGM: stop nvidia-dcgm.service
            SkuCustomisationHandler->>NvidiaDCGM: reload NVIDIA kernel modules
            SkuCustomisationHandler->>NvidiaDCGM: start nvidia-dcgm.service
        end
        SetupScript->>SetupScript: add NCCL_TOPO_FILE / NCCL_GRAPH_FILE to NCCL_CONF
    else unknown SKU
        SetupScript->>SetupScript: remove TOPOLOGY_FILE / TOPOLOGY_GRAPH / NCCL_CONF
    end

    SetupScript-->>SkuCustomisationService: exit status
    SkuCustomisationService-->>Systemd: service result

    Systemd-->>NCCL: NCCL_CONF ready for MPI jobs at runtime
Loading

Sequence diagram for manual SKU setup test using mocked SKU

sequenceDiagram
    actor Admin
    participant TestScript
    participant SetupScript
    participant SkuCustomisationHandler
    participant NvidiaFabricManager

    Admin->>TestScript: sudo test-sku-setup.sh --manual
    TestScript->>TestScript: select test SKU
    loop for each test_sku
        TestScript->>TestScript: export __MOCK_SKU=test_sku
        TestScript->>SetupScript: run setup_sku_customisations.sh
        SetupScript->>SetupScript: detect __MOCK_SKU and set sku
        SetupScript->>SkuCustomisationHandler: dispatch per SKU
        opt SKU requires fabric manager
            SkuCustomisationHandler->>NvidiaFabricManager: enable/start
            NvidiaFabricManager-->>SkuCustomisationHandler: status
        end
        SetupScript-->>TestScript: exit code
        alt exit code 0
            TestScript->>Admin: "Test Passed: test_sku"
        else failure
            TestScript->>Admin: log failure details
        end
    end
Loading

File-Level Changes

Change Details Files
Introduce Azure HPC resource/runtime directory layout and default variables, and ensure directories are created during role execution.
  • Define __hpc_azure_resource_dir, __hpc_azure_tools_dir, __hpc_azure_tests_dir, and __hpc_azure_runtime_dir in vars/main.yml.
  • Add an Ansible task block to stat and create the resource and runtime directories with root ownership and 0755 permissions.
vars/main.yml
tasks/main.yml
Add an Ansible-controlled SKU customisation feature guarded by a new hpc_sku_customisation boolean variable.
  • Introduce hpc_sku_customisation default variable (true) and document it in README.md including its purpose for Azure SKU-specific tuning.
  • Add a task block that, when hpc_sku_customisation is true and not already installed, copies topology and customisation files, installs setup/removal scripts and tests, and installs/enables the sku_customisation systemd service.
defaults/main.yml
README.md
tasks/main.yml
Provide SKU setup, removal, and test scripts that drive SKU-specific NCCL and topology configuration using Azure IMDS or a mocked SKU for tests.
  • Implement setup_sku_customisations.sh to query Azure IMDS (or use __MOCK_SKU), select per-SKU customisation scripts, manage topology files under /var/hpc/azure/topology, and populate /etc/nccl.conf including NCCL_TOPO_FILE/NCCL_GRAPH_FILE.
  • Implement remove_sku_customisations.sh to stop/disable nvidia-fabricmanager, unload nvidia_peermem, remove runtime topology files, and clear /etc/nccl.conf.
  • Implement test-sku-setup.sh to run manual mode tests over a fixed SKU list (via __MOCK_SKU) or CI mode using the real VM SKU, asserting presence/absence of topology/graph files and nccl.conf contents, and verifying the sku customisation service is active for supported SKUs.
templates/sku/setup_sku_customisations.sh
templates/sku/remove_sku_customisations.sh
templates/sku/test-sku-setup.sh
Deliver per-SKU topology descriptions and customisation scripts for various Azure GPU/HPC VM types, including NVLink workaround for NCv5 and NCCL tuning/Fabric Manager enablement for NDv4/NDv5.
  • Add static topology XML files for NCv4, NDv2, NDv4, and NDv5 SKUs describing GPU and NIC PCI layout and, where applicable, NVLink relationships.
  • Add customisation scripts ncv4.sh, ndv2.sh, ndv4.sh, ndv5.sh, ndv6.sh, hbv4.sh, and ncv5.sh to configure topology symlinks, manage topology/graph removal when unused, set NCCL_IB_PCI_RELAXED_ORDERING where appropriate, and manage nvidia-fabricmanager and NVLink reinitialisation for NCv5.
  • Wire these SKU customisation scripts into setup_sku_customisations.sh’s case statement keyed by VM size patterns (e.g. standard_ndv4, standard_nc80adis_h100_v5, standard_nd128is_gb[2-3]00_v6).
files/sku/topology/ncv4-graph.xml
files/sku/topology/ncv4-topo.xml
files/sku/topology/ndv2-topo.xml
files/sku/topology/ndv4-topo.xml
files/sku/topology/ndv5-topo.xml
files/sku/customisations/ncv4.sh
files/sku/customisations/ndv2.sh
files/sku/customisations/ndv4.sh
files/sku/customisations/ndv5.sh
files/sku/customisations/ndv6.sh
files/sku/customisations/hbv4.sh
files/sku/customisations/ncv5.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The systemd unit name used in the test script (systemctl is-active --quiet sku-customisations) does not match the installed service name (sku_customisation.service); aligning these (and choosing a single spelling/format) will avoid false negatives in service state checks.
  • The IMDS SKU lookup and retry logic is duplicated between setup_sku_customisations.sh and test-sku-setup.sh; consider factoring this into a common helper or at least a shared function to keep the behavior consistent and easier to change.
  • Several SKU customisation scripts (e.g. NDv4/NDv5) contain very similar nvidia-fabricmanager enable/start/error-handling blocks; extracting this into a common helper or function would reduce repetition and the risk of divergent behavior between SKUs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The systemd unit name used in the test script (`systemctl is-active --quiet sku-customisations`) does not match the installed service name (`sku_customisation.service`); aligning these (and choosing a single spelling/format) will avoid false negatives in service state checks.
- The IMDS SKU lookup and retry logic is duplicated between `setup_sku_customisations.sh` and `test-sku-setup.sh`; consider factoring this into a common helper or at least a shared function to keep the behavior consistent and easier to change.
- Several SKU customisation scripts (e.g. NDv4/NDv5) contain very similar `nvidia-fabricmanager` enable/start/error-handling blocks; extracting this into a common helper or function would reduce repetition and the risk of divergent behavior between SKUs.

## Individual Comments

### Comment 1
<location> `README.md:198` </location>
<code_context>
+Whether to install the hardware tuning files for different Azure VM types (SKUs).
+
+This will install definitions for optimal hardware configurations for the different types of high performance VMs that are typically used for HPC workloads in the Azure environment.
+These include Infiniband and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
+
+Default: `true`
</code_context>

<issue_to_address>
**suggestion (typo):** Use the standard spelling "InfiniBand" for the interconnect name.

Use the vendor-standard capitalization "InfiniBand" to match industry and Azure documentation conventions.

```suggestion
These include InfiniBand and GPU/NVLink and NCCL customisations, as well as any workarounds for specific hardware problems that may be needed.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dgchinner dgchinner force-pushed the test-sku-customisations branch 2 times, most recently from abb3d95 to 23ce0c1 Compare January 27, 2026 22:46
@richm
Copy link
Contributor

richm commented Jan 28, 2026

@dgchinner you'll need to rebase this and your other PRs since I just merged #48

@dgchinner dgchinner force-pushed the test-sku-customisations branch from 23ce0c1 to 557e232 Compare January 28, 2026 04:03
@dgchinner dgchinner changed the title Test SKU customisations test: SKU customisations Jan 28, 2026
@dgchinner dgchinner force-pushed the test-sku-customisations branch 3 times, most recently from db31d1d to 68ee2b4 Compare January 28, 2026 20:33
@richm
Copy link
Contributor

richm commented Jan 28, 2026

@dgchinner just rebase on top of main branch and we should be good to go

@dgchinner dgchinner force-pushed the test-sku-customisations branch from 68ee2b4 to 8ba89bd Compare January 28, 2026 21:33
Add a script to enable both manual and automated testing of the
Azure SKU customisation scripts.

When running the tests manually, it will exercise all the different
supported SKU types via mocking and checking that appropriate links
are installed. It will not check that the customisation service is
active and running as manual mode is expected to used on dev
machines that are unsupported SKU types.

Manual testing like this may throw some warnings or errors because
hardware is not directly supported. For example, testing on a VM
type that does not have GPUs that are supported by the fabric
manager will result in warnings that the service failed to start:

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh --manual
Testing standard_nc96ads_a100_v4
Test Passed: standard_nc96ads_a100_v4
Testing standard_nd40rs_v2
Test Passed: standard_nd40rs_v2
Testing standard_nd96asr_v4
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96asr_v4
Testing standard_hb176rs_v4
Test Passed: standard_hb176rs_v4
Testing standard_nc80adis_h100_v5
Check NVLink status after reloading NVIDIA kernel modules...
NVLink is Active.
Test Passed: standard_nc80adis_h100_v5
Testing standard_nd96isr_h200_v5
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
NVIDIA Fabric Manager Inactive!
Test Passed: standard_nd96isr_h200_v5
$

Such warnings are fine.

When not in manual mode, the test expects that it is running on a
supported SKU VM (e.g. in the CI system) and will query the current
the SKU type.

If the SKU is unsupported, it will check that no files are currently
installed. It will fail in the casei where stale config files are
found:

$ sudo /opt/hpc/azure/tests/test-sku-setup.sh
Unknown SKU
Failed: Standard_NC8as_T4_v3: /etc/nccl.conf not empty
$

If the SKU is supported, it will check that appropriate files are
installed and the service is running.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
@dgchinner dgchinner force-pushed the test-sku-customisations branch from 8ba89bd to 78be778 Compare January 28, 2026 21:36
@dgchinner
Copy link
Contributor Author

How do we get a log file from the CI test that is failing quite frequently so we can fix it and get rid of teh noise?

It's failing because:

INFO:root:Running playbooks ['/home/runner/.cache/linux-system-roles/centos-9_setup.yml', '/home/runner/work/hpc/hpc/tests/tests_skip_toolkit.yml']
ERROR:root:Playbook run failed with error 2

But "error 2" tells us nothing about what is actually going wrong....

@richm
Copy link
Contributor

richm commented Jan 29, 2026

How do we get a log file from the CI test that is failing quite frequently so we can fix it and get rid of teh noise?

It's failing because:

INFO:root:Running playbooks ['/home/runner/.cache/linux-system-roles/centos-9_setup.yml', '/home/runner/work/hpc/hpc/tests/tests_skip_toolkit.yml'] ERROR:root:Playbook run failed with error 2

But "error 2" tells us nothing about what is actually going wrong....

At https://github.com/linux-system-roles/hpc/actions/runs/21456387995/job/61798143384?pr=50

Qemu result summary

FAIL: tests_skip_toolkit.yml
PASS: tests_include_vars_from_parent.yml
PASS: tests_default.yml
PASS: tests_azure.yml

It is the skip_toolkit test that is failing

Show test log failures

tests_skip_toolkit.yml-FAIL.log Install NVIDIA Fabric Manager
Ansible Version: 2.16.15
Task Path: /home/runner/work/hpc/hpc/tests/roles/linux-system-roles.hpc/tasks/main.yml:341
Url: tests/tests_skip_toolkit.yml-FAIL.log
RC: 1
Start: 2026-01-28T21:41:40.911992+00:00Z
End: 2026-01-28T21:42:00.798961+00:00Z
Host: centos-9.qcow2

Detail:
Failed to install some of the specified packages

That is this task

- name: Install NVIDIA Fabric Manager and enable service
  when: hpc_install_nvidia_fabric_manager
  block:
    - name: Install NVIDIA Fabric Manager
      package:
        name: "{{ __hpc_nvidia_fabric_manager_packages }}" 
        state: present
        use: "{{ (__hpc_server_is_ostree | d(false)) |
          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
      register: __hpc_nvidia_fabric_manager_packages_install
      until: __hpc_nvidia_fabric_manager_packages_install is success

which is

__hpc_nvidia_fabric_manager_packages:
  - nvidia-fabric-manager

Which repo provides the nvidia-fabric-manager package? If this is one of those packages which is supposed to be built-in to the image, then I guess we will need to skip this test.

In fact, if doing any sort of testing requires the customized images for hpc, then either we will need to greatly reduce the scope of testing, or we will need to change the test to add these packages to the image before running the role.

If you want to see the actual logs from the test run, Upload test logs on failure has a link Artifact download URL: https://github.com/linux-system-roles/hpc/actions/runs/21456387995/artifacts/5295075202

@richm
Copy link
Contributor

richm commented Jan 29, 2026

@dgchinner fixed the test - #55 - please rebase to pick up the fix

@richm richm merged commit 970a16f into linux-system-roles:main Jan 29, 2026
21 of 22 checks passed
@dgchinner
Copy link
Contributor Author

Which repo provides the nvidia-fabric-manager package? If this is one of those packages which is supposed to be built-in to the image, then I guess we will need to skip this test.

The nvidia repo we install all the other NVidia packages from. This repo is set up by the system-role right at the start. However, I note that the kernel drivers are not installed because the VM is not running on Azure, (i.e. system vendor != Microsoft), the CUDA toolkit is skipped because the test sets hpc_install_cuda_toolkit: false. There's no nvidia hardware in the CI test VM, so we should probably set 'hpc_install_nvidia_fabric_manager: false', too, so that we don't try to install and start a service that cannot run on the test machine.

@richm
Copy link
Contributor

richm commented Jan 30, 2026

Which repo provides the nvidia-fabric-manager package? If this is one of those packages which is supposed to be built-in to the image, then I guess we will need to skip this test.

The nvidia repo we install all the other NVidia packages from. This repo is set up by the system-role right at the start. However, I note that the kernel drivers are not installed because the VM is not running on Azure, (i.e. system vendor != Microsoft), the CUDA toolkit is skipped because the test sets hpc_install_cuda_toolkit: false. There's no nvidia hardware in the CI test VM, so we should probably set 'hpc_install_nvidia_fabric_manager: false', too, so that we don't try to install and start a service that cannot run on the test machine.

Which is what I did in https://github.com/linux-system-roles/hpc/pull/55/changes#diff-63d118695b7459b06db5e120617cafd42267c62b9d4a994d902f5e026f5ff244R20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants