HIVE-3067: infra_test: match machine pool e2e by Hive labels by huangmingxia · Pull Request #2841 · openshift/hive

huangmingxia · 2026-01-30T13:18:08Z

Improve machinepool matching in e2e tests by using Hive labels instead of machine name prefixes

Changes:

infra_test.go: Replace machine name prefix matching with Hive label-based matching
e2e-test.sh: Add powerState check to installation verification logic
client.go: Add controller-runtime logger initialization

openshift-ci-robot · 2026-01-30T13:18:13Z

@huangmingxia: This pull request references HIVE-3067 which is a valid jira issue.

Details

In response to this:

/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-01-30T13:19:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huangmingxia
Once this PR has been reviewed and has the lgtm label, please assign jstuever for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2026-01-30T15:38:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.31%. Comparing base (e33d703) to head (380a50a).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2841      +/-   ##
==========================================
- Coverage   50.42%   50.31%   -0.12%     
==========================================
  Files         279      280       +1     
  Lines       34198    34274      +76     
==========================================
  Hits        17244    17244              
- Misses      15593    15669      +76     
  Partials     1361     1361

see 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

2uasimojo

This seems sane. One non-blocking suggestion inline.

test/e2e/postinstall/machinesets/infra_test.go

2uasimojo · 2026-02-02T20:17:54Z

test/e2e/postinstall/machinesets/infra_test.go

+	clusterAutoscaler.Spec.IgnoreDaemonsetsUtilization = ptr.To(true)
+	clusterAutoscaler.Spec.SkipNodesWithLocalStorage = ptr.To(true)
+	clusterAutoscaler.Spec.BalanceSimilarNodeGroups = ptr.To(true)


Are you just experimenting here? This change seems orthogonal to the purpose of the PR.

With respect to the change itself, my question would be: why now? These knobs have been on the autoscaler for years and we've never needed them before.

@2uasimojo Thanks for your review!

This change seems orthogonal to the purpose of the PR.

Yes，this change itself should ideally live in a separate PR. If I understand correctly, this change addresses the root cause of the scale-down test flakiness.

The original commit included a misunderstanding. The only change needed here is enabling IgnoreDaemonsetsUtilization; BalanceSimilarNodeGroups/SkipNodesWithLocalStorage is not related to the case failure. I have removed.
Could you please review this and let me know if I should move this change into a separate commit? Thanks:)

With respect to the change itself, my question would be: why now? These knobs have been on the autoscaler for years and we've never needed them before.

Since the e2e-weekly CI switched to 4.21, I checked the job history and it has never passed. When I running the e2e tests locally, they continue to fail on 4.21 and 4.22 as well, but passed on 4.20.

So I checked the cluster-autoscaler side configuration, we have been using the default ClusterAutoscaler configuration.

IgnoreDaemonsetsUtilization is disabled by default.

BalanceSimilarNodeGroups was enabled by default when we set up MachinePool autoscaling.

I checked the IgnoreDaemonsetsUtilization configuration. On the spoke cluster, after removing the BusyBox pod, node utilization exceeded the 50% threshold, which blocked scale-down.
Setting ignoreDaemonsetsUtilization: true excludes DaemonSet pods from the node utilization calculation, allowing the autoscaler to evaluate scale-down based on actual workloads rather than unavoidable system overhead. With this change, tests passed on 4.21 and 4.22.
It's possible that this issue has existed for some time and was ~~only~~ exposed after switching to 4.21. I'm not sure why DaemonSet resource requests increased in 4.21+; this could be due to changes in other OpenShift components or may need further investigation.

Later update: For this case failure on AWS/Azure/GCP, increasing waitForNode / waitForMachines timeout does not resolve this tests failure.

It's possible that this issue has existed for some time and was ~~only~~ exposed after switching to 4.21. I'm not sure why DaemonSet resource requests increased in 4.21+; this could be due to changes in other OpenShift components or may need further investigation.

Right, this is exactly my point. When a behavior changes without us doing something to cause it, we need to understand why. I'll start a conversation with the autoscaler team.

@2uasimojo Thank you so much for your help. I have removed this workaround, and the scale-down issue for 4.21+ should be tracked under HIVE-3068.

openshift-ci · 2026-02-04T18:15:43Z

@huangmingxia: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`380a50a`	link	true	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-02-04T18:44:32Z

@huangmingxia: This pull request references HIVE-3067 which is a valid jira issue.

Details

In response to this:

Improve machinepool matching in e2e tests by using Hive labels instead of machine name prefixes

Changes:

infra_test.go: Replace machine name prefix matching with Hive label-based matching

e2e-test.sh: Add powerState check to installation verification logic

client.go: Add controller-runtime logger initialization

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

huangmingxia · 2026-02-04T19:07:29Z

hack/e2e-test.sh

 while [ $i -le ${max_cluster_deployment_status_checks} ]; do
  CD_JSON=$(oc get cd ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o json)
-  if [[ $(jq .spec.installed <<<"${CD_JSON}") == "true" ]] ; then
+  if [[ $(jq .spec.installed <<<"${CD_JSON}") == "true" ]] && [[ $(jq -r .status.powerState <<<"${CD_JSON}") == "Running" ]] ; then


Why add the powerState=Running check:
When we create a cluster using hiveutil, a worker MachinePool is also created. After the ClusterDeployment completes installation, cd.spec.installed=true, but .status.powerState not have reached Running.
If there is an issue with the MachinePool, the MachineSets/Machines will continue syncing, and the spoke cluster's cluster operators will also be impacted. Until then, the CD will not reach the Running state.

2uasimojo · 2026-02-05T15:12:34Z

security fail to be handled via HIVE-3069.

@huangmingxia is this ready or are we waiting for the outcome of the autoscaler issue?

huangmingxia · 2026-02-05T15:49:32Z

@2uasimojo Thanks.
I ran the test locally and it passed, scale-down.log, but that was using the ignoreDaemonsetsUtilization=true workaround. I removed it when updating the PR.
Not sure when the autoscaler fix will come, could you please help review this PR first? We can track the autoscaler fix using HIVE-3068.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 30, 2026

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2026

openshift-ci bot requested review from 2uasimojo and suhanime January 30, 2026 13:19

2uasimojo reviewed Jan 30, 2026

View reviewed changes

test/e2e/postinstall/machinesets/infra_test.go Outdated Show resolved Hide resolved

huangmingxia force-pushed the HIVE-3067 branch 3 times, most recently from 7fd8b42 to 45f5a2c Compare February 2, 2026 18:35

2uasimojo reviewed Feb 2, 2026

View reviewed changes

huangmingxia force-pushed the HIVE-3067 branch 5 times, most recently from 31275fe to 21a27c6 Compare February 4, 2026 07:09

HIVE-3067 infra_test: match machine pool e2e by Hive labels

380a50a

huangmingxia force-pushed the HIVE-3067 branch from 21a27c6 to 380a50a Compare February 4, 2026 16:08

huangmingxia commented Feb 4, 2026

View reviewed changes

Conversation

huangmingxia commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 30, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jan 30, 2026

Uh oh!

codecov bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

2uasimojo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2uasimojo Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

huangmingxia Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

huangmingxia Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 4, 2026

Uh oh!

openshift-ci-robot commented Feb 4, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huangmingxia Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

2uasimojo commented Feb 5, 2026

Uh oh!

huangmingxia commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huangmingxia commented Jan 30, 2026 •

edited

Loading

openshift-ci-robot commented Jan 30, 2026 •

edited by openshift-ci bot

Loading

codecov bot commented Jan 30, 2026 •

edited

Loading

huangmingxia Feb 3, 2026 •

edited

Loading

openshift-ci-robot commented Feb 4, 2026 •

edited by openshift-ci bot

Loading