Skip to content

THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138

Open
urbanikb wants to merge 4 commits into3scale:masterfrom
urbanikb:THREESCALE-12161
Open

THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138
urbanikb wants to merge 4 commits into3scale:masterfrom
urbanikb:THREESCALE-12161

Conversation

@urbanikb
Copy link

@urbanikb urbanikb commented Jan 29, 2026

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

What

Fix:

This PR fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version.

The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version.

The basic idea is that the job annotation shoudld always match the "version" of system-app deployment.

The 2.15 used generation as the version. 2.16 uses revision.

The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise).

This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job.

Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration.

This PR also includes some more edge cases, renamed annotation and some very minor cleanup of test suite.

Steps to reproduce

  1. install version 2.15 and create APIManager deployment
  2. trigger deployment scale down / up, this should result in generation change, the 2.15 operator will automatically re-create the job to match the generation:
% oc get job system-app-pre -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations
NAME             NAMESPACE   ANNOTATIONS
system-app-pre   bu-v215     map[system-app-deployment-generation:5]

% oc scale deployment system-app --replicas 1
deployment.apps/system-app scaled

The deployment generation and job annotation should match. However the deployment generation should be ahead of the revision - this is needed to reproduce the bug. In my case I have generation 6 and revision 2:

% oc get job system-app-pre -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations
NAME             NAMESPACE   ANNOTATIONS
system-app-pre   bu-v215     map[system-app-deployment-generation:6]
% oc get deployment system-app -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations,GENERATION:.metadata.generation
NAME         NAMESPACE   ANNOTATIONS                                GENERATION
system-app   bu-v215     map[deployment.kubernetes.io/revision:2]   6

Upgrade operator to 2.16... note the job image hasn't been updated but the deployment has.

The 2.16 operator also has updated the annotation to the current revision of the deployment - that is part of the bug.

% oc get job system-app-pre -o 'custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations,IMAGES:.spec.template.spec.containers[*].image'
NAME             NAMESPACE   ANNOTATIONS                               IMAGES
system-app-pre   bu-v215     map[system-app-deployment-generation:3]   registry.redhat.io/3scale-amp2/system-rhel8@sha256:4c72dfeb1919a1de4fbc47f4611d0c830ef89cf9f17255ad55a9b370618e0e8f
% oc get deployment system-app -o 'custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,ANNOTATIONS:.metadata.annotations,GENERATION:.metadata.generation,IMAGES:.spec.template.spec.containers[*].image'
NAME         NAMESPACE   ANNOTATIONS                                GENERATION   IMAGES
system-app   bu-v215     map[deployment.kubernetes.io/revision:3]   7            registry.redhat.io/3scale-amp2/system-rhel9@sha256:3a01cbf4581099bad69c4fea7cf9182b275f5f60f38610de9a6c1f3038f1b949,registry.redhat.io/3scale-amp2/system-rhel9@sha256:3a01cbf4581099bad69c4fea7cf9182b275f5f60f38610de9a6c1f3038f1b949,registry.redhat.io/3scale-amp2/system-rhel9@sha256:3a01cbf4581099bad69c4fea7cf9182b275f5f60f38610de9a6c1f3038f1b949

Verification

The regression test case is implemented in test with test case name "EDGE CASE: Both image AND revision changed - old job deleted and new one created".

Due to this PR also renaming the annotation, the next upgrade will fall into test case "UPGRADE: Job exists without annotation during upgrade - deleted and recreated".

@urbanikb urbanikb requested a review from a team as a code owner January 29, 2026 18:25
@briangallagher
Copy link
Contributor

briangallagher commented Jan 29, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@@ -0,0 +1,14 @@
package helper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this test is useless; we're comparing two strings, yet we're letting the test determine the value of the annotation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the test file, it wasn't very good. It was meant to nag developer if they change the value of the constant - sort of a question "was this intentional?".

Instead I've made the test helper function createCompletedJob use string literal instead of the constant. This is going to have same effect, but it will be asking "do you want to implement some handling for when there are completed jobs with old annotation?".

@tkan145
Copy link
Contributor

tkan145 commented Feb 2, 2026

/retest

@tkan145
Copy link
Contributor

tkan145 commented Feb 2, 2026

/ok-to-test

@tkan145
Copy link
Contributor

tkan145 commented Feb 4, 2026

Thanks the code looks much better.

I have perform the following tests:

  • Fresh install
  • Rollout system-app
  • Scalling up/down
  • Upgrade

During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013

 ▲ lib/3scale/3scale-operator oc get pod -o wide | grep system-app                                                                                         
system-app-84486b9f9c-btdzh                                       3/3     Running     0               8m25s   10.217.1.224   crc    <none>           <none>
system-app-post-zsjfw                                             0/1     Completed   0               6m54s   10.217.1.232   crc    <none>           <none>
system-app-pre-s6h25                                              0/1     Completed   0               7m24s   10.217.1.231   crc    <none>           <none>

I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process?

This commit fixes issue with system deployment reconciler happening
on upgrade from 2.15 to 2.16 version.

The "state" of the upgrade process is persisted in an annotation on
the job that records the deployment version.

The basic idea is that the job annotation shoudld always match the "version"
of system-app deployment.

The 2.15 used generation as the version. 2.16 uses revision.

The reconciliation logic would only activate upgrade logic if the revision
on the existing job matches exactly the deployment revision (it assumes the
upgrade is in-flight otherwise).

This condition has been causing issue on upgrade from 2.16 because as part
of the upgrade the customers need to scale the deployment down, which would
bump the generation - which would cause the upgrade logic to incorrectly
assume that migration is in flight and fail to delete the old job.

Reconciliation then simply updates metadata including the annotation without
recreating the job and the process silently fails to execute migration.

This PR also includes all edge cases I could think of... even though we only
hit the one described above.
The function needed to deal with cases when job didn't exist
or the annotation didn't exist. It could return an error but
that would just make it more complicated in the calling code.

The new function needAppHookJobRerun returns false if the job
didn't exist or revision matches deployment revision.

It returns true if the job revision is unknown or deployment
revision has changed.
@urbanikb
Copy link
Author

urbanikb commented Feb 5, 2026

During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013

 ▲ lib/3scale/3scale-operator oc get pod -o wide | grep system-app                                                                                         
system-app-84486b9f9c-btdzh                                       3/3     Running     0               8m25s   10.217.1.224   crc    <none>           <none>
system-app-post-zsjfw                                             0/1     Completed   0               6m54s   10.217.1.232   crc    <none>           <none>
system-app-pre-s6h25                                              0/1     Completed   0               7m24s   10.217.1.231   crc    <none>           <none>

This should not be possible on the upgrade path - the deployment should update only after job with revision+1 has been scheduled.

Having jobs run after is possible on the "manual trigger" path - on revision change - although it shouldn't take a minute to trigger, so even for that path it seems to be a little bit off. If you still have this instance, could you check the logs of the "pre" hook to check whether the migration was executed in that run or it was already a no-op?

I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process?

I built the bundle but couldn't reproduce the state where the migration isn't run from the "image changed" path - the bundle I tested with is this one (contains CSV from 2.15 and "alpha" with quay.io images):

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: threescale-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/burbanik/3scale-operator-index:v0.2.0

No tests have been changed - this is mainly to change the flow
of the Reconcile function to avoid calling ReconcileJob when
the revision has not changed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants