Skip to content

TemporalWorkerOwnedResource: attach HPAs, PDBs, and other resources to versioned Deployments#215

Closed
carlydf wants to merge 22 commits intomainfrom
temporal-worker-owned-resource
Closed

TemporalWorkerOwnedResource: attach HPAs, PDBs, and other resources to versioned Deployments#215
carlydf wants to merge 22 commits intomainfrom
temporal-worker-owned-resource

Conversation

@carlydf
Copy link
Collaborator

@carlydf carlydf commented Mar 3, 2026

Summary

  • Adds a new TemporalWorkerOwnedResource CRD (shortname: twor) that lets users attach arbitrary namespaced Kubernetes resources (HPAs, PodDisruptionBudgets, KEDA ScaledObjects, etc.) to each worker version with running workers
  • The controller creates one copy per worker version with a running Deployment, auto-injects scaleTargetRef and selector.matchLabels when set to null, and cleans up via Kubernetes owner-reference GC when the versioned Deployment is sunset
  • Adds a validating webhook (always-on, requires cert-manager) that enforces resource kind restrictions and SubjectAccessReview checks against both the requesting user and the controller SA
  • Helm chart: new ownedResourceConfig values section, cert-manager Issuer+Certificate, updated RBAC, certmanager.caBundle for advanced TLS setups
  • Integration test covering the full HPA happy path
  • Full documentation: docs/owned-resources.md, examples/twor-hpa.yaml, demo README cert-manager setup, main README updated

Key design decisions

  • Auto-injection triggered by explicit null sentinel (absent = no injection; non-null = webhook rejects)
  • Owner reference points to the versioned Deployment, not the TWD, so k8s GC handles cleanup automatically
  • SSA field manager is twc/{namespace}/{name}, unique per TWOR instance
  • Resource names capped at 47 characters (safe for all types including Deployment pod-name constraints); 8-char hash suffix guarantees uniqueness per (twdName, tworName, buildID) triple even after prefix truncation
  • cert-manager required for webhook TLS (no untested BYO-cert path documented)

Test plan

  • go test ./... passes
  • make test-integration passes (28/28 tests including new twor-creates-hpa-per-build-id)
  • helm lint helm/temporal-worker-controller/ passes
  • Apply examples/twor-hpa.yaml against helloworld demo in minikube and verify one HPA per Build ID

🤖 Generated with Claude Code

carlydf and others added 22 commits February 26, 2026 15:53
…ned Deployments

Introduces a new `TemporalWorkerOwnedResource` (TWOR) CRD that lets users attach
arbitrary namespaced Kubernetes resources (HPA, PDB, WPA, custom CRDs, etc.) to
each per-Build-ID versioned Deployment managed by a TemporalWorkerDeployment.

Key design points:
- One copy of the attached resource is created per active Build ID, owned by the
  corresponding versioned Deployment — Kubernetes GC deletes it automatically when
  the Deployment is removed, requiring no explicit cleanup logic.
- Resources are applied via Server-Side Apply (create-or-update), so the controller
  is idempotent and co-exists safely with other field managers (e.g. the HPA controller).
- Two-layer auto-population for well-known fields:
    Layer 1: `scaleTargetRef: null` and `matchLabels: null` in spec.object are
             auto-injected with the versioned Deployment's identity and selector labels.
    Layer 2: Go template expressions (`{{ .DeploymentName }}`, `{{ .BuildID }}`,
             `{{ .Namespace }}`) are rendered in all string values before apply.
- Generated resource names use a hash-suffix scheme (`{prefix}-{8-char-hash}`) to
  guarantee uniqueness per (twdName, tworName, buildID) triple even when the prefix
  is truncated; the buildID is always represented in the hash regardless of name length.
- `ComputeSelectorLabels` is now the single source of truth for selector labels used
  both in Deployment creation and in owned-resource matchLabels injection.
- Partial-failure isolation: all owned resources are attempted on each reconcile even
  if some fail; errors are collected and surfaced together.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract getOwnedResourceApplies into planner package so it can be
  tested without a live API client
- Add OwnedResourceApply type and OwnedResourceApplies slice to Plan
- Thread twors []TemporalWorkerOwnedResource through GeneratePlan
- Add TestGetOwnedResourceApplies (8 cases: nil/empty inputs, N×M
  cartesian, nil Raw skipped, invalid template skipped)
- Add TestGetOwnedResourceApplies_ApplyContents (field manager, kind,
  owner reference, deterministic name)
- Add TestGetOwnedResourceApplies_FieldManagerDistinctPerTWOR
- Add two TWOR cases to TestGeneratePlan for end-to-end count check
- Add helpers: createTestTWOR, createDeploymentWithUID,
  createTestTWORWithInvalidTemplate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both the controller plan field and the planner Plan field now share the
same name, making the copy-assignment self-documenting:
  plan.ApplyOwnedResources = planResult.ApplyOwnedResources

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Users don't need to template the k8s namespace (they already know it
when creating their TWOR in that namespace). The Temporal namespace is
more useful since it configures where the worker connects to.

- TemplateData.Namespace → TemplateData.TemporalNamespace
- RenderOwnedResource gains a temporalNamespace string parameter
- getOwnedResourceApplies threads the value from
  spec.WorkerOptions.TemporalNamespace down to RenderOwnedResource
- Update all tests: {{ .Namespace }} → {{ .TemporalNamespace }}
- GoTemplateRendering test now uses distinct k8s ns ("k8s-production")
  and Temporal ns ("temporal-production") to make the difference clear

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements the admission webhook for TemporalWorkerOwnedResource with:
- Pure spec validation: apiVersion/kind required, metadata.name/namespace
  forbidden, banned kinds (Deployment/StatefulSet/Job/Pod/CronJob by default),
  minReplicas≠0, scaleTargetRef/matchLabels absent-or-null enforcement
- API checks: RESTMapper namespace-scope assertion, SubjectAccessReview for
  the requesting user and controller SA (with correct SA group memberships)
- ValidateUpdate enforces workerRef.name immutability and uses verb="update"
- ValidateDelete checks delete permissions on the underlying resource
- Helm chart: injects POD_NAMESPACE and SERVICE_ACCOUNT_NAME via downward API,
  BANNED_KINDS from ownedResources.bannedKinds values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- cmd/main.go: register TemporalWorkerOwnedResourceValidator unconditionally
- webhook.yaml: rewrite to always create the webhook Service and TWOR
  ValidatingWebhookConfiguration; TWD validating webhook remains optional
  behind webhook.enabled
- certmanager.yaml: fix service DNS names, remove fail guard, default enabled
- manager.yaml: move cert volume mount and webhook port outside the
  webhook.enabled gate so the webhook server always starts
- values.yaml: default certmanager.enabled to true, clarify that
  webhook.enabled only controls the optional TWD webhook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add helm/crds/temporal.io_temporalworkerownedresources.yaml so Helm
  installs the CRD before the controller starts
- Add temporalworkerownedresources get/list/watch/patch/update rules to
  the manager ClusterRole so the controller can watch and update status
- Add authorization.k8s.io/subjectaccessreviews create permission for
  the validating webhook's SubjectAccessReview checks
- Add editor and viewer ClusterRoles for end-user RBAC on TWOR objects

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TemporalWorkerOwnedResource supports arbitrary user-defined resource
types (HPA, PDB, custom CRDs) that are not known at install time.
Add a wildcard rule to the manager ClusterRole so the controller can
create/get/patch/update/delete any namespaced resource on behalf of
TWOR objects.

Security note: the TWOR validating webhook is a required admission
control that verifies the requesting user has permission on the
embedded resource type before the TWOR is admitted, so the
controller's broad permissions act as executor, not gatekeeper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the wildcard ClusterRole rule with a configurable list of
explicit resource type rules. Default to HPA and PDB — the two primary
documented TWOR use cases. Wildcard mode is still available as an
opt-in via ownedResources.rbac.wildcard=true for development clusters
or when users attach many different custom CRD types.

Operators add entries to ownedResources.rbac.rules for each additional
API group their TWOR objects will use (e.g. keda.sh/scaledobjects).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidates bannedKinds and rbac under a single top-level key for
clarity. Update all template references accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TWORName, TWORNamespace, BuildID to OwnedResourceApply so the
executor knows which status entry to update after each apply attempt.

Refactor the apply loop in execplan.go to collect per-(TWOR, BuildID)
results (success or error) and then, after all applies complete, write
OwnedResourceVersionStatus entries back to each TWOR's status
subresource. This means:

- Applied=true + ResourceName set on success
- Applied=false + Message set on failure
- All Build IDs for a TWOR are written atomically in one status update
- Apply errors and status write errors are both returned via errors.Join
  so the reconcile loop retries on either kind of failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the double-nested errors.Join with a single call over the
concatenated slice, which is equivalent and more readable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a TemporalWorkerDeployment is reconciled, ensure each
TemporalWorkerOwnedResource referencing it has an owner reference
pointing back to the TWD (controller: true). This lets Kubernetes
garbage-collect TWOR objects automatically when the TWD is deleted.

The patch is skipped when the reference is already present (checked via
metav1.IsControlledBy) to avoid a write on every reconcile loop.
Uses client.MergeFrom to avoid conflicts with concurrent modifications.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
genplan should only read state and build a plan — not perform writes.
Instead of patching TWORs directly in generatePlan, build (base,
patched) pairs in genplan.go (pure computation) and let executePlan
apply them, consistent with how all other writes are structured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add TWOROwnerRefPatch type and EnsureTWOROwnerRefs to planner.Plan
- Add getTWOROwnerRefPatches to planner package, unit-tested in
  planner_test.go (TestGetTWOROwnerRefPatches)
- GeneratePlan now accepts a twdOwnerRef and populates
  EnsureTWOROwnerRefs; genplan.go builds the OwnerReference from the
  TWD object and passes it through
- Owner ref patch failures in execplan.go now log-and-continue so that
  a deleted TWOR (race between list and patch) cannot block the more
  important owned-resource apply step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing it pre-built

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bhook validator

- Add +kubebuilder:object:generate=false to TemporalWorkerOwnedResourceValidator
  (client.Client interface field was blocking controller-gen)
- Regenerate zz_generated.deepcopy.go: adds GateInputSource, GateWorkflowConfig,
  OwnedResourceVersionStatus deepcopy that were missing from the manual edit
- Regenerate CRD manifests: adds type:object to spec.object in TWOR CRD, field
  ordering change in TWD CRD
- Remove now-unused metav1 import from genplan.go (was missed in prior commit)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests the full reconciliation loop: create TWOR with HPA spec → controller
applies one HPA per active Build ID via SSA → asserts scaleTargetRef is
auto-injected with the correct versioned Deployment name → asserts
TWOR.Status.Versions shows Applied: true → asserts TWD controller owner
reference is set on the TWOR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…functions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etup

- docs/owned-resources.md: full TWOR reference (auto-injection, RBAC, webhook TLS, examples)
- examples/twor-hpa.yaml: ready-to-apply HPA example for the helloworld demo
- helm/webhook.yaml + values.yaml: add certmanager.caBundle for BYO TLS without cert-manager
- internal/demo/README.md: add cert-manager install step and TWOR demo walkthrough
- README.md + docs/README.md: add cert-manager prerequisite, TWOR feature bullet, and doc link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Truncate owned resource names to 47 chars (safe for Deployments; avoids
  per-resource-type special cases if Deployment is ever un-banned)
- Fix docs: replace "active Build ID" with "worker version with running workers"
  throughout; "active" is reserved for Ramping/Current versions
- Fix docs: owned-resource deletion is due to versioned Deployment sunset, not
  a separate "version delete" operation
- Fix docs: scaleTargetRef injection applies to any resource type with that field,
  not just HPA; clarify webhook rejects non-null values because controller owns them
- Fix docs: remove undocumented/untested BYO TLS path; cert-manager is required
- Fix docs: expand TWOR abbreviation to full name throughout; remove ⏳ autoscaling
  bullet from README and clarify TWOR is the path for metric/backlog-based autoscaling
- Add note on how to inspect the banned kinds list (BANNED_KINDS env var)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@carlydf carlydf requested review from a team and jlegrone as code owners March 3, 2026 00:11
@carlydf carlydf closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant