fix: recent events touch refactor, cache buster worker name reference#2999
Merged
fix: recent events touch refactor, cache buster worker name reference#2999
Conversation
Contributor
Author
|
will be merging this earlier to get this to prod. |
Contributor
Author
|
follow up issues created on linear relating to percentage-based run strategy. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
two big fixes:
High volume of starting/stopping of SourceSup across the clusters result in broadcast storms across the cluster, with syn getting overloaded with messages of procs starting/stopping and resulting in high message queue counts.
as the conflict resolutino is handled on a singular process on a per-scope basis, backing up of the message queue can result in bootloops, even if we always choose the original process with nanosecond resolution. although clock time discrepancies is a possibility, i am doubtful that it is the root cause for the broadcast storm we are seeing on prod.
Previous partitioning of the recent events scope was done to mitigate this, but removing GenSingleton usage altogether will fix the root problem.
Furthermore, the fact that it starts to occur across all clusters in a synchronized manner leads me to believe strongly that it is related to the scheduled SourceSup shutdown which happens every half hour across all nodes.
Given above hypothesis, refactoring achieves the following:
To achieve the global job scheduler, i added in an additional config to Quantum to prevent the startup of the inbuilt Task.Supervisor, so now the global run strategy works without duplicating jobs.
Expected outcomes of this PR: