Skip to content

fix: recent events touch refactor, cache buster worker name reference#2999

Merged
Ziinc merged 14 commits intomainfrom
fix/recent-events-touch
Dec 4, 2025
Merged

fix: recent events touch refactor, cache buster worker name reference#2999
Ziinc merged 14 commits intomainfrom
fix/recent-events-touch

Conversation

@Ziinc
Copy link
Contributor

@Ziinc Ziinc commented Dec 3, 2025

two big fixes:

  • fixes partition supervisor name - discovered by @bblaszkow06 🙏
  • Removes GenSingleton usage from non-stable processes

High volume of starting/stopping of SourceSup across the clusters result in broadcast storms across the cluster, with syn getting overloaded with messages of procs starting/stopping and resulting in high message queue counts.

as the conflict resolutino is handled on a singular process on a per-scope basis, backing up of the message queue can result in bootloops, even if we always choose the original process with nanosecond resolution. although clock time discrepancies is a possibility, i am doubtful that it is the root cause for the broadcast storm we are seeing on prod.
Previous partitioning of the recent events scope was done to mitigate this, but removing GenSingleton usage altogether will fix the root problem.

Furthermore, the fact that it starts to occur across all clusters in a synchronized manner leads me to believe strongly that it is related to the scheduled SourceSup shutdown which happens every half hour across all nodes.

Given above hypothesis, refactoring achieves the following:

  • moves ui procs to its own :syn scope
  • moves general quantum scheduler to its own global process using GenSingleton
  • runs the recent events touch every 5 minutely on at most 500 sources.
  • performs at most 5 transactions each 5 minutes when updating sources
  • increased auto shutdown interval to hourly on all nodes.

To achieve the global job scheduler, i added in an additional config to Quantum to prevent the startup of the inbuilt Task.Supervisor, so now the global run strategy works without duplicating jobs.

  • Upstream PR has been opened here for the new config option

Expected outcomes of this PR:

  • significant reduction in DB transactions per second
  • no message queue buildup on :syn_gen_scope for the :core scope
  • less frequent memory reclaiming from SourceSup auto-shutdown
  • cross cluster cache busting working again

@Ziinc Ziinc requested review from a team, amokan and chasers December 3, 2025 21:11
@Ziinc
Copy link
Contributor Author

Ziinc commented Dec 4, 2025

will be merging this earlier to get this to prod.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ziinc 👋 some comments inline!

@Ziinc
Copy link
Contributor Author

Ziinc commented Dec 4, 2025

follow up issues created on linear relating to percentage-based run strategy.

@Ziinc Ziinc merged commit f625ba2 into main Dec 4, 2025
8 checks passed
@Ziinc Ziinc deleted the fix/recent-events-touch branch December 4, 2025 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants