OCPBUGS-29894: Check if CRLs are downloaded when determining ready status#595
OCPBUGS-29894: Check if CRLs are downloaded when determining ready status#595rfredette wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@rfredette: This pull request references Jira Issue OCPBUGS-29894, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@rfredette: This pull request references Jira Issue OCPBUGS-29894, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
/assign |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
645c9ea to
e6243d4
Compare
Miciah
left a comment
There was a problem hiding this comment.
This makes a router pod start failing readiness checks if it has outdated CRLs, right?
To fix OCPBUGS-29894, it should be sufficient to fail readiness only for the initial synch, so that startup probes (which use the readiness endpoint) fail until the initial synch is done.
Once the router pod has done the initial synch, we want readiness checks to pass even if refresh fails, for two reasons:
- The expectation is to restore the behavior prior to openshift/cluster-ingress-operator#939 and #472, and that behavior was to prevent a router pod from serving traffic until it had CRLs, not to prevent a router pod from serving traffic if it had outdated CRLs.
- It is generally less bad to continue using outdated CRLs, rather than to stop serving traffic entirely when refresh fails.
This does make me realize that we need a Prometheus metric and an alert when refresh fails for a prolonged period. Failure to refresh has two nasty implications:
- Router pods are using outdated CRLs.
- The next rolling update of the router deployment (for an upgrade, configuration change, or whatever reason) could get stuck as presumably the new pods would fail on initial synch.
Ack, I'll update this so that the CRLs readiness check is only used for the initial sync.
That make sense, although I think that's out of the scope of this bug. I'll open a jira issue for that. |
e7b4fc2 to
4b7b65f
Compare
|
e2e-upgrade failed during bootstrap. /test e2e-upgrade |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
|
/assign @alebedev87 |
alebedev87
left a comment
There was a problem hiding this comment.
LGTM, just a nit question.
| return crlsUpdated | ||
| } | ||
|
|
||
| func SetCRLsUpdated(value bool) { |
There was a problem hiding this comment.
Would it make sense to remove the possibility to set updated to false? Taking into account the fact that we want to probe the fully present CRL list only at startup.
| func SetCRLsUpdated(value bool) { | |
| func SetCRLsUpdated() { |
There was a problem hiding this comment.
I think that's reasonable. I've updated this to include that change 👍
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
This fixes OCPBUGS-29894
4b7b65f to
c81119b
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@rfredette: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Require all CRLs to be downloaded before the router can report that it's ready. This prevents forwarding requests to a router until it's ready to handle mTLS.
This fixes OCPBUGS-29894