Skip to content

HIVE-3067: infra_test: match machine pool e2e by Hive labels#2841

Open
huangmingxia wants to merge 1 commit intoopenshift:masterfrom
huangmingxia:HIVE-3067
Open

HIVE-3067: infra_test: match machine pool e2e by Hive labels#2841
huangmingxia wants to merge 1 commit intoopenshift:masterfrom
huangmingxia:HIVE-3067

Conversation

@huangmingxia
Copy link
Contributor

@huangmingxia huangmingxia commented Jan 30, 2026

Improve machinepool matching in e2e tests by using Hive labels instead of machine name prefixes

Changes:

  • infra_test.go: Replace machine name prefix matching with Hive label-based matching
  • e2e-test.sh: Add powerState check to installation verification logic
  • client.go: Add controller-runtime logger initialization

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 30, 2026
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 30, 2026

@huangmingxia: This pull request references HIVE-3067 which is a valid jira issue.

Details

In response to this:

/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from 2uasimojo and suhanime January 30, 2026 13:19
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huangmingxia
Once this PR has been reviewed and has the lgtm label, please assign jstuever for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.31%. Comparing base (e33d703) to head (380a50a).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2841      +/-   ##
==========================================
- Coverage   50.42%   50.31%   -0.12%     
==========================================
  Files         279      280       +1     
  Lines       34198    34274      +76     
==========================================
  Hits        17244    17244              
- Misses      15593    15669      +76     
  Partials     1361     1361              

see 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems sane. One non-blocking suggestion inline.

@huangmingxia huangmingxia force-pushed the HIVE-3067 branch 3 times, most recently from 7fd8b42 to 45f5a2c Compare February 2, 2026 18:35
Comment on lines 307 to 309
clusterAutoscaler.Spec.IgnoreDaemonsetsUtilization = ptr.To(true)
clusterAutoscaler.Spec.SkipNodesWithLocalStorage = ptr.To(true)
clusterAutoscaler.Spec.BalanceSimilarNodeGroups = ptr.To(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you just experimenting here? This change seems orthogonal to the purpose of the PR.

With respect to the change itself, my question would be: why now? These knobs have been on the autoscaler for years and we've never needed them before.

Copy link
Contributor Author

@huangmingxia huangmingxia Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo Thanks for your review!

This change seems orthogonal to the purpose of the PR.

Yes,this change itself should ideally live in a separate PR. If I understand correctly, this change addresses the root cause of the scale-down test flakiness.

The original commit included a misunderstanding. The only change needed here is enabling IgnoreDaemonsetsUtilization; BalanceSimilarNodeGroups/SkipNodesWithLocalStorage is not related to the case failure. I have removed.
Could you please review this and let me know if I should move this change into a separate commit? Thanks:)

With respect to the change itself, my question would be: why now? These knobs have been on the autoscaler for years and we've never needed them before.

Since the e2e-weekly CI switched to 4.21, I checked the job history and it has never passed. When I running the e2e tests locally, they continue to fail on 4.21 and 4.22 as well, but passed on 4.20.

So I checked the cluster-autoscaler side configuration, we have been using the default ClusterAutoscaler configuration.

  • IgnoreDaemonsetsUtilization is disabled by default.
  • BalanceSimilarNodeGroups was enabled by default when we set up MachinePool autoscaling.

I checked the IgnoreDaemonsetsUtilization configuration. On the spoke cluster, after removing the BusyBox pod, node utilization exceeded the 50% threshold, which blocked scale-down.
Setting ignoreDaemonsetsUtilization: true excludes DaemonSet pods from the node utilization calculation, allowing the autoscaler to evaluate scale-down based on actual workloads rather than unavoidable system overhead. With this change, tests passed on 4.21 and 4.22.
It's possible that this issue has existed for some time and was only exposed after switching to 4.21. I'm not sure why DaemonSet resource requests increased in 4.21+; this could be due to changes in other OpenShift components or may need further investigation.

Later update: For this case failure on AWS/Azure/GCP, increasing waitForNode / waitForMachines timeout does not resolve this tests failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that this issue has existed for some time and was only exposed after switching to 4.21. I'm not sure why DaemonSet resource requests increased in 4.21+; this could be due to changes in other OpenShift components or may need further investigation.

Right, this is exactly my point. When a behavior changes without us doing something to cause it, we need to understand why. I'll start a conversation with the autoscaler team.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo Thank you so much for your help. I have removed this workaround, and the scale-down issue for 4.21+ should be tracked under HIVE-3068.

@huangmingxia huangmingxia force-pushed the HIVE-3067 branch 5 times, most recently from 31275fe to 21a27c6 Compare February 4, 2026 07:09
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

@huangmingxia: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 380a50a link true /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@huangmingxia: This pull request references HIVE-3067 which is a valid jira issue.

Details

In response to this:

Improve machinepool matching in e2e tests by using Hive labels instead of machine name prefixes

Changes:

  • infra_test.go: Replace machine name prefix matching with Hive label-based matching
  • e2e-test.sh: Add powerState check to installation verification logic
  • client.go: Add controller-runtime logger initialization

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

while [ $i -le ${max_cluster_deployment_status_checks} ]; do
CD_JSON=$(oc get cd ${CLUSTER_NAME} -n ${CLUSTER_NAMESPACE} -o json)
if [[ $(jq .spec.installed <<<"${CD_JSON}") == "true" ]] ; then
if [[ $(jq .spec.installed <<<"${CD_JSON}") == "true" ]] && [[ $(jq -r .status.powerState <<<"${CD_JSON}") == "Running" ]] ; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the powerState=Running check:
When we create a cluster using hiveutil, a worker MachinePool is also created. After the ClusterDeployment completes installation, cd.spec.installed=true, but .status.powerState not have reached Running.
If there is an issue with the MachinePool, the MachineSets/Machines will continue syncing, and the spoke cluster's cluster operators will also be impacted. Until then, the CD will not reach the Running state.

@2uasimojo
Copy link
Member

security fail to be handled via HIVE-3069.

@huangmingxia is this ready or are we waiting for the outcome of the autoscaler issue?

@huangmingxia
Copy link
Contributor Author

@2uasimojo Thanks.
I ran the test locally and it passed, scale-down.log, but that was using the ignoreDaemonsetsUtilization=true workaround. I removed it when updating the PR.
Not sure when the autoscaler fix will come, could you please help review this PR first? We can track the autoscaler fix using HIVE-3068.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants