THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138
THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138urbanikb wants to merge 4 commits into3scale:masterfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
pkg/helper/job_test.go
Outdated
| @@ -0,0 +1,14 @@ | |||
| package helper | |||
There was a problem hiding this comment.
In my opinion, this test is useless; we're comparing two strings, yet we're letting the test determine the value of the annotation?
There was a problem hiding this comment.
I have removed the test file, it wasn't very good. It was meant to nag developer if they change the value of the constant - sort of a question "was this intentional?".
Instead I've made the test helper function createCompletedJob use string literal instead of the constant. This is going to have same effect, but it will be asking "do you want to implement some handling for when there are completed jobs with old annotation?".
|
/retest |
|
/ok-to-test |
|
Thanks the code looks much better. I have perform the following tests:
During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013 I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process? |
This commit fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version. The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version. The basic idea is that the job annotation shoudld always match the "version" of system-app deployment. The 2.15 used generation as the version. 2.16 uses revision. The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise). This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job. Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration. This PR also includes all edge cases I could think of... even though we only hit the one described above.
The function needed to deal with cases when job didn't exist or the annotation didn't exist. It could return an error but that would just make it more complicated in the calling code. The new function needAppHookJobRerun returns false if the job didn't exist or revision matches deployment revision. It returns true if the job revision is unknown or deployment revision has changed.
b85cb4e to
d5facb4
Compare
This should not be possible on the upgrade path - the deployment should update only after job with revision+1 has been scheduled. Having jobs run after is possible on the "manual trigger" path - on revision change - although it shouldn't take a minute to trigger, so even for that path it seems to be a little bit off. If you still have this instance, could you check the logs of the "pre" hook to check whether the migration was executed in that run or it was already a no-op?
I built the bundle but couldn't reproduce the state where the migration isn't run from the "image changed" path - the bundle I tested with is this one (contains CSV from 2.15 and "alpha" with quay.io images): |
No tests have been changed - this is mainly to change the flow of the Reconcile function to avoid calling ReconcileJob when the revision has not changed.
00a5d4e to
9f53af0
Compare
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com
What
Fix:
This PR fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version.
The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version.
The basic idea is that the job annotation shoudld always match the "version" of system-app deployment.
The 2.15 used generation as the version. 2.16 uses revision.
The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise).
This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job.
Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration.
This PR also includes some more edge cases, renamed annotation and some very minor cleanup of test suite.
Steps to reproduce
The deployment generation and job annotation should match. However the deployment generation should be ahead of the revision - this is needed to reproduce the bug. In my case I have generation 6 and revision 2:
Upgrade operator to 2.16... note the job image hasn't been updated but the deployment has.
The 2.16 operator also has updated the annotation to the current revision of the deployment - that is part of the bug.
Verification
The regression test case is implemented in test with test case name
"EDGE CASE: Both image AND revision changed - old job deleted and new one created".Due to this PR also renaming the annotation, the next upgrade will fall into test case
"UPGRADE: Job exists without annotation during upgrade - deleted and recreated".