Skip to content

Comments

Add pre/post lease hooks#140

Open
kirkbrauer wants to merge 95 commits intomainfrom
add-hooks
Open

Add pre/post lease hooks#140
kirkbrauer wants to merge 95 commits intomainfrom
add-hooks

Conversation

@kirkbrauer
Copy link
Member

@kirkbrauer kirkbrauer commented Jan 24, 2026

Summary

Implement hook scripts that can be configured in the exporter config to run j commands within a temporary lease environment. This PR combines changes across Python, Controller, and Protocol components.

Environment Variables

The following environment variables are exposed to hook scripts:

  • JUMPSTARTER_HOST - Path to the Unix socket for j CLI access
  • JMP_DRIVERS_ALLOW=UNSAFE - Allows all drivers for local access
  • LEASE_NAME - The name of the current lease
  • CLIENT_NAME - Name of the client that acquired the lease

Example Configuration

apiVersion: jumpstarter.dev/v1alpha1
kind: ExporterConfig
metadata:
  namespace: default
  name: test-exporter
endpoint: grpc.jumpstarter.100.123.60.47.nip.io:8082
hooks:
  beforeLease:
    script: |
      echo "Starting lease setup for ${CLIENT_NAME}"
      j power on
      echo "Pre-lease setup completed!"
    timeout: 120
    onFailure: exit
  afterLease:
    script: |
      echo "Starting lease cleanup for ${CLIENT_NAME}"
      j power off
      echo "Post-lease cleanup completed!"
    timeout: 60
    onFailure: warn
tls:
  ca: ''
  insecure: true
token: abc123
export:
  power:
    type: "jumpstarter_driver_power.driver.MockPower"

Changes

Python (python/)

  • Hook executor with before/after lease lifecycle support
  • Configurable timeouts and failure modes (exit, warn)
  • Exporter status reporting and streaming
  • EndSession support for graceful session termination
  • Typed Protobuf/gRPC stubs generation
  • Status field exposed in CLI (jmp admin get exporter)

Controller (controller/)

  • Full ExporterStatus field support with state transitions
  • Status updates reflected from exporters in real-time
  • EndSession API implementation
  • Exporter-initiated lease release support
  • CRDs managed via Helm chart templates

Protocol (protocol/)

  • EndSession API for keeping connections open after lease ends
  • release_lease field for exporter-initiated lease release
  • status_version and previous_status fields for transition tracking

Related PRs

Summary by CodeRabbit

  • New Features

    • Rich exporter status (state, message, version); ReportStatus may request lease release; EndSession RPC; background status monitor; configurable before/after-lease hooks with timeouts, on-failure policies and live hook logs.
  • Bug Fixes

    • Improved online/offline detection, stricter driver-call readiness checks, connection retry/backoff, and safer lease-release sequencing.
  • Chores

    • CRDs and installer regenerated; printer columns for Status/Message added; Helm deploys skip CRD install; ArgoCD sync-wave annotation injected.
  • Tests

    • Expanded unit, integration and e2e coverage for hooks, status monitor, controller flows and logging.
  • Documentation

    • New Hooks docs, glossary and updated configuration/getting-started guides.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds exporter lifecycle hooks (beforeLease/afterLease), exporter status fields/enums and status_message, new ReportStatus release_lease flag and EndSession RPC, StatusMonitor and client status APIs, HookExecutor and LeaseContext, CRD/schema and printer-column updates, protobuf/grpc regeneration, CLI/status rendering, extensive tests and e2e hook configs, and small Helm/manifest build tweaks.

Changes

Cohort / File(s) Summary
Protocol & Generated Stubs
protocol/proto/jumpstarter/v1/jumpstarter.proto, python/packages/jumpstarter-protocol/.../jumpstarter_pb2*.py, .../*.pyi, ..._grpc.py, ..._grpc.pyi
Add ReportStatus.release_lease, EndSession RPC/messages, GetStatus fields (status_version, previous_status), LogStream.source; regenerate protobuf/grpc runtime and typing stubs.
CRDs & Manifests
controller/deploy/helm/jumpstarter/crds/*, controller/deploy/helm/jumpstarter/charts/.../templates/crds/*, controller/deploy/operator/config/crd/bases/*, controller/deploy/operator/bundle/manifests/*
Add CRDs (Exporter, Lease, Client, ExporterAccessPolicy); add exporterStatus and statusMessage to Exporter schema and printer columns; copy CRDs into chart templates and bundle.
Controller Makefile / Helm
controller/Makefile, controller/hack/deploy_with_helm.sh
Inject ArgoCD sync-wave annotation into generated role YAML, copy CRDs into chart templates, regenerate operator installer, and use Helm --skip-crds.
Controller API & Service
controller/api/v1alpha1/exporter_types.go, controller/api/v1alpha1/exporter_helpers.go, controller/internal/controller/exporter_controller.go, controller/internal/service/controller_service.go
Add ExporterStatus fields/constants and printer columns; map CRD↔proto status; adjust online/offline reconcile logic; add ControllerService.ReportStatus handler and lease-release helpers.
Exporter runtime & Hooks
python/packages/jumpstarter/jumpstarter/exporter/... (hooks.py, lease_context.py, session.py, logging*, exporter.py)
Implement HookExecutor (PTY subprocess execution, timeouts, onFailure modes), LeaseContext, multi-socket/session and EndSession handling, per-logger LogSource routing, and exporter lifecycle orchestration.
Status Monitor & Client
python/packages/jumpstarter/jumpstarter/client/status_monitor.py, .../core.py, .../grpc.py, .../lease.py
Add StatusMonitor (background GetStatus polling, waiters, callbacks); extend AsyncDriverClient with status APIs (get/check/wait/end_session), enforce status preconditions for driver calls, and add dial_timeout retry behavior.
Common types & enums
python/packages/jumpstarter/jumpstarter/common/enums.py, .../types.py, .../__init__.py
Add ExporterStatus and LogSource enums with proto conversions, gRPC stub type aliases, and export them from common.
CLI, K8s models & Rendering
python/packages/jumpstarter-cli/..., python/packages/jumpstarter-kubernetes/..., python/packages/jumpstarter-cli-admin/*
Plumb include_status/show_status through CLI and models, add STATUS column to rich tables, update serializers and tests to include exporterStatus/statusMessage.
Hooks Tests & E2E
python/packages/.../hooks_test.py, e2e/exporters/*, e2e/tests-hooks.bats
Add unit tests for HookExecutor, StatusMonitor, LeaseContext, session behavior, controller integration tests, many e2e configs and BATS tests exercising hook scenarios and failure modes.
Build tooling & Misc
python/buf.gen.yaml, controller/.gitignore, e2e/run-e2e.sh
Add mypy-protobuf plugins, ignore operator bin dir, update e2e runner to discover multiple .bats tests, and manifest generation tweaks (sync-wave injection, CRD copy, skip-crds).

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,230,255,0.5)
    participant Client
    participant Exporter
    participant HookExec as HookExecutor
    participant Controller
    end
    Client->>Exporter: Request lease / Dial
    Exporter->>Controller: ReportStatus(status=BEFORE_LEASE_HOOK)
    Exporter->>HookExec: execute_before_lease_hook(LeaseContext)
    HookExec-->>Exporter: result (success / failure)
    alt before-hook success
        Exporter->>Controller: ReportStatus(status=LEASE_READY)
        Client->>Exporter: Driver calls allowed
    else onFailure=endLease
        Exporter->>Controller: ReportStatus(..., release_lease=true)
        Controller->>Controller: mark lease.release = true
    end
    Client->>Exporter: End session
    Exporter->>HookExec: execute_after_lease_hook(...)
    HookExec-->>Exporter: after-hook done
    Exporter->>Controller: ReportStatus(status=OFFLINE)
Loading
sequenceDiagram
    rect rgba(230,255,200,0.5)
    participant AsyncClient
    participant StatusMon as StatusMonitor
    participant ExporterSvc as ExporterService
    end
    AsyncClient->>StatusMon: start()
    loop polling
      StatusMon->>ExporterSvc: GetStatus()
      ExporterSvc-->>StatusMon: status (+version, previous_status)
      StatusMon-->>AsyncClient: notify on change / wake waiters
    end
    AsyncClient->>StatusMon: wait_for_status(LEASE_READY)
    StatusMon-->>AsyncClient: signal when LEASE_READY observed
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

  • [Proposal] Enhanced Lease Statuses for Hooks #104 — Implements exporter status enums, statusMessage, LogSource and hook-related status/log streaming; aligns with this PR’s CRD/proto and runtime changes.
  • jumpstarter-dev/jumpstarter#522 — Requested exporter lifecycle hooks; this PR adds HookConfig/HookInstanceConfig, HookExecutor and hook wiring.
  • jumpstarter-dev/jumpstarter-protocol#25 — Protocol-level request for ExporterStatus/LogSource and GetStatus enhancements; matches proto/stub updates here.

Possibly related PRs

  • jumpstarter-dev/jumpstarter#297 — Overlaps logging surface (LogHandler/LogStream) and log-source routing added here.
  • jumpstarter-dev/jumpstarter#451 — Related CLI logging / RichHandler adjustments present in this PR.
  • jumpstarter-dev/jumpstarter#589 — Related to process/signal and exit-code semantics adjusted in CLI run logic present here.

Suggested labels

backport release-0.7

Suggested reviewers

  • kirkbrauer

Poem

🐰 I dug a hook and tunneled a log,

Before the lease I danced with a cog.
Status bells chimed and monitors peeped,
Hooks ran, leases cleared, and none overslept.
Carrots for reviewers — hop on and leap!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add pre/post lease hooks' is concise and directly describes the main feature addition, which aligns with the extensive changes across controller, Python, protocol, and end-to-end test components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-hooks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/config/client.py (1)

184-196: Add status field to exporter reconstruction to preserve it when include_leases=True.

The exporter objects lose their status field when leases are included. The original exporters have a status field that should be preserved during reconstruction, especially when include_status=True is requested. Update line 187-193 to include status=exporter.status.

Code snippet
exporter_with_lease = Exporter(
    namespace=exporter.namespace,
    name=exporter.name,
    labels=exporter.labels,
    online=exporter.online,
    status=exporter.status,  # Add this line
    lease=lease,
)
python/packages/jumpstarter/jumpstarter/config/exporter.py (1)

224-246: Remove unused device_factory parameter from HookExecutor.

The device_factory parameter passed to HookExecutor at line 231 is never used. HookExecutor only executes shell commands via subprocess and accesses the session for logging; it has no need for driver instantiation. Remove the unused parameter from both the HookExecutor dataclass definition and the instantiation call.

🤖 Fix all issues with AI agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml`:
- Around line 20-47: Update the CRD description strings so they reference
"Client" consistently instead of "identities" or "Identity": change the top
description "Client is the Schema for the identities API" to a concise "Client
is the Schema for the Client API" (or simply "Client"), and update the
spec.description "ClientSpec defines the desired state of Identity" and
status.description "ClientStatus defines the observed state of Identity" to use
"Client" (e.g., "ClientSpec defines the desired state of Client" and
"ClientStatus defines the observed state of Client") so all descriptive fields
in the CRD (the description at the top and the
spec.description/status.description entries) no longer reference
Identity/identities.

In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml`:
- Around line 30-31: Update the CRD description for the Lease resource: replace
the current "Lease is the Schema for the exporters API" text in the description
property of the Lease CRD with the corrected wording using "leases" (e.g.,
"Leases is the Schema for the exporters API" or "leases is the Schema for the
exporters API") so the description uses the plural form; locate the description
property under the Lease CRD (kind: Lease) in the CRD template and change that
string accordingly.

In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py`:
- Line 19: basicConfig assignment line exceeds max line length; split the call
across multiple lines to satisfy Ruff E501. Locate the basicConfig variable
initialization (basicConfig = partial(logging.basicConfig, format="%(message)s
[%(name)s]", handlers=[RichHandler(show_path=False)]) ) and reformat the
arguments onto separate lines (e.g., place format=... and handlers=[...] each on
their own lines and close parens on its own line) so the call to
partial(logging.basicConfig, ...) fits within line-length limits while
preserving the same arguments and use of RichHandler(show_path=False).

In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 80-89: Remove the unused synchronous helper _run_shell_with_lease:
delete the entire function definition (including its docstring and with-blocks)
since the async variant _run_shell_with_lease_async is now the only one used;
then search the repo for any remaining references to _run_shell_with_lease and
remove or update them, run the test suite/lint to confirm no import or name
errors, and ensure behavior relies solely on _run_shell_with_lease_async.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 395-401: The async context manager on StatusMonitor is incomplete:
__aenter__ currently just returns self and never starts the background polling;
update __aenter__ to initialize the task group and call self.start() (mirroring
client.status_monitor_async() behavior) so entering the context begins
monitoring, and keep __aexit__ awaiting self.stop(); alternatively remove both
methods if context-manager usage is unsupported—modify the methods on the
StatusMonitor class (__aenter__, __aexit__) to either perform full start/stop
lifecycle (create task group + await self.start() in __aenter__, await
self.stop() in __aexit__) or delete them to avoid silent failures.

In `@python/packages/jumpstarter/jumpstarter/common/enums.py`:
- Around line 26-27: The docstring for the enum value AFTER_LEASE_HOOK (mapped
to common_pb2.EXPORTER_STATUS_AFTER_LEASE_HOOK) contains a typo; update the
string from "Lease was releaseed, but exporter is executing after lease hook" to
"Lease was released, but exporter is executing after lease hook" so the
documentation reads correctly.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 541-613: The Listen retry task started with
tg.start_soon(self._retry_stream, "Listen",
self._listen_stream_factory(lease_name), listen_tx) is attached to the outer
task group so it keeps running after the lease ends; move that start call into
the lease-scoped task group (use conn_tg.start_soon(...) before starting
wait_for_lease_end/process_connections) so the Listen task is cancelled with the
lease, and in the enclosing finally block ensure the send stream is closed
(await listen_tx.aclose() or the appropriate close method) to signal listen_rx
termination; reference symbols: _retry_stream,
_listen_stream_factory(lease_name), listen_tx/listen_rx, tg -> conn_tg, and
conn_tg.start_soon.
- Around line 202-321: Change _get_controller_stub to be an async context
manager (e.g., async def _controller_stub(self):) that creates the channel via
self.channel_factory(), yields a
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and always closes the
underlying channel with await channel.close() in a finally block; then update
all callers to use "async with self._controller_stub() as controller:" instead
of awaiting _get_controller_stub(), specifically in _retry_stream (where
controller is acquired for stream_factory), _register_with_controller (Register
call), and _report_status (ReportStatus call) so channels are closed after use
and no resources leak.

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 299-307: wait_for_process currently uses
anyio.to_thread.run_sync(process.wait, abandon_on_cancel=True) which can leave
the child as a zombie if the wait is abandoned; modify wait_for_process to
ensure the subprocess is reaped on cancellation by adding a try/finally: in the
try await the run_sync as before, but in the finally check process.poll() and if
it is None try to terminate/kill (process.terminate()/process.kill()) then call
a non-abandoning run_sync(process.wait) or loop-poll with process.poll() until
exit to fully reap the child; reference the wait_for_process coroutine and the
process variable when making the changes.
- Around line 257-260: The debug call inside read_pty_output exceeds the
120-char line limit; shorten the line by splitting the log message across
multiple strings or moving the format string to a short variable and then
calling logger.debug. For example, break the message into two adjacent string
literals or assign msg = "read_pty_output: heartbeat at %.1fs, iterations=%d"
and call logger.debug(msg, elapsed, read_count) so you keep the same
logger.debug call and the same format arguments (logger, elapsed, read_count)
but comply with the E501 line-length rule.
- Around line 186-189: The test mock for logging_session must support the
context manager used by logging_session.context_log_source in hooks.py; update
the test fixture (e.g., mock_session) so mock_session.context_log_source returns
an object that implements __enter__ and __exit__ (for async usage you can set
mock_session.context_log_source.return_value.__enter__ and __exit__ to AsyncMock
or wrap with contextlib.nullcontext), ensuring the context manager protocol is
honored when hooks.py calls logging_session.context_log_source.
🧹 Nitpick comments (10)
controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_exporteraccesspolicies.yaml (2)

149-152: Add schema validation for duration and priority.

Without constraints, malformed durations or negative priorities can be accepted and only fail at runtime. Consider adding format/bounds in the CRD.

♻️ Suggested schema tightening
                   maximumDuration:
                     type: string
+                    format: duration
                   priority:
                     type: integer
+                    minimum: 0

40-157: Require spec/policies to avoid empty CRs.

Right now an empty resource (missing spec or policies) is valid. If that’s not intended, enforce it in the schema to reduce no-op objects.

♻️ Suggested required fields
           spec:
             description: ExporterAccessPolicySpec defines the desired state of ExporterAccessPolicy.
             properties:
               exporterSelector:
                 description: |-
                   A label selector is a label query over a set of resources. The result of matchLabels and
                   matchExpressions are ANDed. An empty label selector matches all objects. A null
                   label selector matches no objects.
                 properties:
                   matchExpressions:
                     description: matchExpressions is a list of label selector requirements.
                       The requirements are ANDed.
                     items:
                       description: |-
                         A label selector requirement is a selector that contains values, a key, and an operator that
                         relates the key and values.
                       properties:
                         key:
                           description: key is the label key that the selector applies
                             to.
                           type: string
                         operator:
                           description: |-
                             operator represents a key's relationship to a set of values.
                             Valid operators are In, NotIn, Exists and DoesNotExist.
                           type: string
                         values:
                           description: |-
                             values is an array of string values. If the operator is In or NotIn,
                             the values array must be non-empty. If the operator is Exists or DoesNotExist,
                             the values array must be empty. This array is replaced during a strategic
                             merge patch.
                           items:
                             type: string
                           type: array
                           x-kubernetes-list-type: atomic
                       required:
                       - key
                       - operator
                       type: object
                     type: array
                     x-kubernetes-list-type: atomic
                   matchLabels:
                     additionalProperties:
                       type: string
                     description: |-
                       matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
                       map is equivalent to an element of matchExpressions, whose key field is "key", the
                       operator is "In", and the values array contains only "value". The requirements are ANDed.
                     type: object
                 type: object
                 x-kubernetes-map-type: atomic
               policies:
                 items:
                   properties:
                     from:
                       items:
                         properties:
                           clientSelector:
                             description: |-
                               A label selector is a label query over a set of resources. The result of matchLabels and
                               matchExpressions are ANDed. An empty label selector matches all objects. A null
                               label selector matches no objects.
                             properties:
                               matchExpressions:
                                 description: matchExpressions is a list of label selector
                                   requirements. The requirements are ANDed.
                                 items:
                                   description: |-
                                     A label selector requirement is a selector that contains values, a key, and an operator that
                                     relates the key and values.
                                   properties:
                                     key:
                                       description: key is the label key that the selector
                                         applies to.
                                       type: string
                                     operator:
                                       description: |-
                                         operator represents a key's relationship to a set of values.
                                         Valid operators are In, NotIn, Exists and DoesNotExist.
                                       type: string
                                     values:
                                       description: |-
                                         values is an array of string values. If the operator is In or NotIn,
                                         the values array must be non-empty. If the operator is Exists or DoesNotExist,
                                         the values array must be empty. This array is replaced during a strategic
                                         merge patch.
                                       items:
                                         type: string
                                       type: array
                                       x-kubernetes-list-type: atomic
                                   required:
                                   - key
                                   - operator
                                   type: object
                                 type: array
                                 x-kubernetes-list-type: atomic
                               matchLabels:
                                 additionalProperties:
                                   type: string
                                 description: |-
                                   matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
                                   map is equivalent to an element of matchExpressions, whose key field is "key", the
                                   operator is "In", and the values array contains only "value". The requirements are ANDed.
                                 type: object
                             type: object
                             x-kubernetes-map-type: atomic
                         type: object
                       type: array
                     maximumDuration:
                       type: string
                     priority:
                       type: integer
                     spotAccess:
                       type: boolean
                   type: object
                 type: array
+                minItems: 1
             type: object
+            required:
+            - policies
           status:
             description: ExporterAccessPolicyStatus defines the observed state of
               ExporterAccessPolicy.
             type: object
         type: object
+        required:
+        - spec

Also applies to: 162-162

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

157-161: Redundant timeout configuration.

Both move_on_after(300) and wait_for_any_of(..., timeout=300.0) specify 300 seconds. Since wait_for_any_of has its own timeout, the outer move_on_after is redundant. Consider either removing the outer timeout or differentiating them if different semantics are intended (e.g., outer as a hard cap slightly longer than inner).

♻️ Suggested simplification
-                                        with anyio.move_on_after(300) as timeout_scope:  # 5 minute timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=300.0
-                                            )
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=300.0
+                                        )
+                                        if result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

Note: This would require adjusting the subsequent conditional logic (lines 162-172) to handle the timeout case inline rather than via timeout_scope.cancelled_caught.

controller/Makefile (1)

66-68: Make the CRD copy step more robust with mkdir.

If the templates directory is missing in a fresh checkout, cp will fail.

♻️ Suggested tweak
 	cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/
+	mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
 	cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
controller/internal/service/controller_service.go (1)

318-338: Consider logging unexpected status values in the default case.

The protoStatusToString function silently maps unknown proto status values to ExporterStatusUnspecified. While this provides a safe fallback, logging at debug/warning level would help diagnose protocol mismatches.

♻️ Optional: Add logging for unexpected status values
 func protoStatusToString(status pb.ExporterStatus) string {
 	switch status {
 	case pb.ExporterStatus_EXPORTER_STATUS_OFFLINE:
 		return jumpstarterdevv1alpha1.ExporterStatusOffline
 	// ... other cases ...
 	default:
+		// Log unexpected status values for debugging protocol mismatches
+		// Consider using a package-level logger if available
 		return jumpstarterdevv1alpha1.ExporterStatusUnspecified
 	}
 }
python/packages/jumpstarter/jumpstarter/common/enums.py (1)

38-45: Consider handling invalid protobuf values gracefully in from_proto.

The current implementation will raise ValueError if an unrecognized integer is passed to from_proto. For robustness against protocol version mismatches, consider returning UNSPECIFIED for unknown values.

♻️ Optional: Handle unknown values gracefully
     `@classmethod`
     def from_proto(cls, value: int) -> "ExporterStatus":
         """Convert from protobuf integer to enum."""
-        return cls(value)
+        try:
+            return cls(value)
+        except ValueError:
+            return cls.UNSPECIFIED
python/packages/jumpstarter/jumpstarter/exporter/logging.py (1)

31-39: Consider longest-prefix matching for hierarchical loggers.

The current prefix matching returns the first match found during iteration. With hierarchical logger names (e.g., driver.power, driver.power.usb), if both driver and driver.power are mapped, the match depends on iteration order. Consider sorting by key length descending to ensure the most specific (longest) prefix wins.

♻️ Suggested improvement for deterministic matching
     def get_source_for_record(self, record):
         """Determine the appropriate log source for a record."""
         with self._lock:
             # Check if this record comes from a logger with a specific source mapping
             logger_name = record.name
-            for mapped_logger, source in self._child_handlers.items():
+            # Sort by key length descending to match most specific prefix first
+            for mapped_logger, source in sorted(
+                self._child_handlers.items(), key=lambda x: len(x[0]), reverse=True
+            ):
                 if logger_name.startswith(mapped_logger):
                     return source
             return self.source
python/packages/jumpstarter/jumpstarter/client/core.py (1)

440-510: Consider extracting helper functions to reduce complexity.

The static analysis correctly flags this method as too complex (C901: 14 > 10). The method handles multiple concerns: reconnection with backoff, log source routing, and filtering. While the current structure is readable, extracting helpers would improve maintainability.

♻️ Suggested structure for reduced complexity
def _route_log_to_logger(self, response, show_all_logs: bool) -> None:
    """Route a log response to the appropriate logger."""
    from jumpstarter.common import LogSource

    if response.HasField("source"):
        source = LogSource(response.source)
        is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK)
    else:
        source = LogSource.SYSTEM
        is_hook = False

    if not (is_hook or show_all_logs):
        return

    severity = response.severity if response.severity else "INFO"
    log_level = getattr(logging, severity, logging.INFO)

    logger_names = {
        LogSource.BEFORE_LEASE_HOOK: "exporter:beforeLease",
        LogSource.AFTER_LEASE_HOOK: "exporter:afterLease",
        LogSource.DRIVER: "exporter:driver",
    }
    logger_name = logger_names.get(source, "exporter:system")
    logging.getLogger(logger_name).log(log_level, response.message)

Then use this helper in the main loop to reduce nesting.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

496-498: TODO: Controller-side lease ending not yet implemented.

The comment indicates that when on_failure='endLease', there's no mechanism to actually end the lease from the exporter side. This should be tracked.

Would you like me to open an issue to track the controller-side lease ending mechanism for beforeLease hook failures with on_failure='endLease'?

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

197-279: Consider extracting common verification logic.

Both wait_for_status and wait_for_any_of share substantial code for connection recovery verification. This duplication contributes to the complexity flagged by static analysis.

♻️ Suggested refactor to reduce duplication
async def _verify_connection_recovery(self) -> ExporterStatus | None:
    """Verify connection and return current status if recovered, None if still lost."""
    if not self._connection_lost:
        return self._current_status
    
    logger.debug("Connection was marked as lost, verifying...")
    try:
        response = await self._stub.GetStatus(jumpstarter_pb2.GetStatusRequest())
        logger.info("Connection recovered during verification poll")
        self._connection_lost = False
        new_status = ExporterStatus.from_proto(response.status)
        self._current_status = new_status
        self._status_version = response.status_version
        return new_status
    except AioRpcError as e:
        if e.code() == StatusCode.UNAVAILABLE:
            logger.debug("Connection still lost (UNAVAILABLE)")
        else:
            logger.debug("GetStatus error during verification: %s", e.code())
        return None
    except Exception as e:
        logger.debug("GetStatus error during verification: %s", e)
        return None

Then use this helper in both wait_for_status and wait_for_any_of.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py`:
- Around line 8-20: The test_log_stream uses fixed time.sleep(1) which is flaky;
replace these sleeps with a short-polling loop that checks caplog.text until the
expected substring appears or a timeout expires. In test_log_stream (around
client.log_stream(), client.on(), client.off()) implement a small helper or
inline poll using time.monotonic() to loop with a tiny sleep (e.g., 0.01–0.1s)
and fail if timeout reached, then assert "power on" and "power off" only after
the poll succeeds; keep using caplog and the same logger scope so the
client.log_stream(), client.on(), and client.off() calls remain unchanged.

In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Line 3: Remove the unused import OrderedDict from the imports in
client_test.py; locate the line "from collections import OrderedDict" and delete
it (or remove OrderedDict from the import list) so the lint warning is resolved
and no functionality is changed.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py`:
- Line 3: The import list in status_monitor_test.py includes an unused symbol
AsyncMock; remove AsyncMock from the import statement (keep MagicMock) so the
module only imports symbols actually used in the tests, ensuring static analysis
no longer flags the unused AsyncMock.

In `@python/packages/jumpstarter/jumpstarter/common/enums_test.py`:
- Line 3: Remove the unused top-level import "import pytest" from enums_test.py;
locate the import statement at the top of the test module and delete it so the
file no longer contains the unused pytest symbol.
🧹 Nitpick comments (4)
python/packages/jumpstarter/jumpstarter/exporter/lease_context_test.py (1)

186-200: Avoid time-based sleep to reduce test flakiness.

Line 195 uses a fixed sleep, which can be flaky on slow CI. Consider a handshake event so the test deterministically waits for wait_for_drivers() to start before setting the hook.

♻️ Proposed deterministic rewrite
     async def test_wait_for_drivers_blocks_until_set(self) -> None:
         """Test that wait_for_drivers() blocks until hook event is set."""
         import anyio
 
         before_hook = Event()
         ctx = LeaseContext(lease_name="test-lease", before_lease_hook=before_hook)
 
-        # Set the event after a short delay
-        async def set_after_delay():
-            await anyio.sleep(0.05)
-            before_hook.set()
+        waiter_started = Event()
+
+        async def wait_and_signal():
+            waiter_started.set()
+            await ctx.wait_for_drivers()
 
         async with anyio.create_task_group() as tg:
-            tg.start_soon(set_after_delay)
-            await ctx.wait_for_drivers()
+            tg.start_soon(wait_and_signal)
+            await waiter_started.wait()
+            before_hook.set()
 
         assert ctx.drivers_ready()
python/packages/jumpstarter/jumpstarter/client/core_test.py (2)

16-32: Consider extracting shared test utilities.

MockAioRpcError and create_mock_rpc_error are duplicated from status_monitor_test.py. Consider extracting these to a shared test utilities module (e.g., jumpstarter/client/test_utils.py or a conftest.py).


90-116: Move import to module level.

The anyio import inside wait_for_hook_status should be at the module level for consistency and clarity.

Proposed fix

Add to imports at top of file:

import anyio

Then update the method:

     async def wait_for_hook_status(self, target_status: ExporterStatus, timeout: float = 60.0) -> bool:
         """Wait for exporter to reach a target status using polling."""
-        import anyio
-
         poll_interval = 0.1  # Fast for testing
python/packages/jumpstarter/jumpstarter/client/client_test.py (1)

121-124: Consider adding assertion to fully verify topological order.

The comment states children should be built before parents, but the assertion only verifies grandparent is built after parent. Adding an assertion that child is built before parent would fully validate the topological ordering.

Suggested improvement
         # Children should be built before parents in topological order
-        # The order should be child, parent, grandparent (or parent, child, grandparent)
+        # The order should be child, parent, grandparent
         assert "grandparent" in build_order
         assert build_order.index("grandparent") > build_order.index("parent")
+        assert build_order.index("parent") > build_order.index("child")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@controller/internal/service/suite_test.go`:
- Around line 56-58: The test harness sets up testEnv as an envtest.Environment
with CRDDirectoryPaths and ErrorIfCRDPathMissing: true, which will abort the
suite if the CRD path is wrong; verify and update the CRDDirectoryPaths value to
the actual CRD directory (or add multiple fallback paths) so envtest can find
the CRDs, e.g., check and correct the path used in the testEnv initialization
(the envtest.Environment struct, the CRDDirectoryPaths slice, and the
ErrorIfCRDPathMissing flag) or set ErrorIfCRDPathMissing to false while you
confirm the correct location.
- Around line 29-31: The test currently ignores errors when creating a
Namespace; change the Namespace creation logic (the call that creates a
corev1.Namespace object, e.g., client.CoreV1().Namespaces().Create or similar in
the test setup) to explicitly handle errors: import
k8s.io/apimachinery/pkg/api/errors as apierrors, then after the Create call
check if err != nil and if apierrors.IsAlreadyExists(err) continue/ignore,
otherwise fail the test (t.Fatalf or return the error). Apply the same pattern
for the other namespace-creation site referenced in the comment (the second
block around lines 81-87) so only AlreadyExists is tolerated and all other
errors cause the test to fail.
- Around line 19-24: Update the test setup in suite_test.go to avoid hard-coding
BinaryAssetsDirectory: first respect the KUBEBUILDER_ASSETS environment variable
if set (use it to set BinaryAssetsDirectory), then fall back to existing helper
GetFirstFoundEnvTestBinaryDir() or a directory-existence check for the current
hard-coded path (bin/k8s/1.30.0-<os>-<arch>), and only assign the hard-coded
path as a last resort; modify the code that sets BinaryAssetsDirectory to
perform conditional checks (os.Getenv("KUBEBUILDER_ASSETS"), os.Stat on
candidate dirs, and GetFirstFoundEnvTestBinaryDir()) so tests run correctly
outside the Makefile and when the version/path differs.

In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Around line 120-124: The test's topological-order check only ensures
"grandparent" comes after "parent" but doesn't verify that "child" precedes
"parent"; update the assertions to explicitly assert that
build_order.index("child") < build_order.index("parent") (and keep/optionally
assert both "child" and "parent" are before "grandparent") so the test enforces
children-before-parents ordering; reference the existing build_order list and
the node names "child", "parent", and "grandparent" when adding the
assertion(s).

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 31-38: The docstring usage example in status_monitor.py
incorrectly calls client.start_status_monitor(); update the example to call the
actual method client.status_monitor_async() (or add a thin alias
start_status_monitor -> status_monitor_async in the client if you prefer) so the
docstring matches the real API; ensure the rest of the example still references
monitor.current_status and monitor.wait_for_status(ExporterStatus.LEASE_READY).

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 426-500: The before-lease failure branch doesn't actually end the
lease for on_failure='endLease'; modify run_before_lease_hook to accept an async
lease-release callback (e.g., end_lease: Callable[[LeaseContext],
Awaitable[None]]) or reuse an existing controller-release method, and in the
HookExecutionError else-branch (the on_failure='endLease' path) invoke await
end_lease(lease_scope) (and still set lease_scope.before_lease_hook
event/unblock connections and report status) so the lease is cleanly terminated;
update all call sites of run_before_lease_hook to pass the new end_lease
callback (mirroring how run_after_lease_hook is wired).
♻️ Duplicate comments (2)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

80-90: Unused synchronous function.

This function is no longer called anywhere in the codebase since line 213 now uses _run_shell_with_lease_async. Consider removing it to reduce maintenance burden.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

397-403: Async context manager won't start monitoring.

The __aenter__ method returns self without calling start(). Direct usage as an async context manager (async with StatusMonitor(...) as monitor:) will silently fail to start background polling. Users must use client.status_monitor_async() instead. Either fix __aenter__ to properly start monitoring or remove the context manager methods to prevent misuse.

🧹 Nitpick comments (9)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

157-161: Redundant double timeout.

Both anyio.move_on_after(300) and timeout=300.0 in wait_for_any_of set 5-minute timeouts. The inner timeout parameter makes the outer cancel scope redundant. Consider removing one:

♻️ Proposed fix
-                                        with anyio.move_on_after(300) as timeout_scope:  # 5 minute timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=300.0
-                                            )
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=300.0
+                                        )
+                                        if result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

Then adjust the subsequent conditional logic accordingly, removing the timeout_scope.cancelled_caught check at line 171.


133-136: Consider checking connection_lost for beforeLease hook.

The afterLease handling (line 166) checks monitor.connection_lost to distinguish between timeout and connection loss. The beforeLease handling treats both cases as "Timeout" which could be misleading in logs.

♻️ Proposed enhancement
                             if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
                                 logger.warning("beforeLease hook failed")
                             elif result is None:
-                                logger.warning("Timeout waiting for beforeLease hook")
+                                if monitor.connection_lost:
+                                    logger.warning("Connection lost during beforeLease hook")
+                                else:
+                                    logger.warning("Timeout waiting for beforeLease hook")
python/packages/jumpstarter/jumpstarter/client/client_test.py (1)

58-58: Ensure ExitStack is closed after each test.

ExitStack() is never closed, which can leak cleanup callbacks if client_from_channel registers any. Prefer with ExitStack() as mock_stack: to guarantee teardown.

♻️ Example pattern (apply to each test)
-        mock_stack = ExitStack()
+        with ExitStack() as mock_stack:
             with patch(
                 "jumpstarter.client.client.MultipathExporterStub",
                 return_value=mock_stub
             ), patch(
                 "jumpstarter.client.client.import_class",
                 return_value=MockDriverClient
             ):
                 from jumpstarter.client.client import client_from_channel

                 client = await client_from_channel(
                     mock_channel, mock_portal, mock_stack, allow=[], unsafe=True
                 )

Also applies to: 98-98, 143-143, 177-177, 211-211, 245-245

python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)

16-32: Consider extracting shared mock utilities to a common test module.

MockAioRpcError and create_mock_rpc_error are duplicated here and in core_test.py (per the relevant snippets). Extracting these to a shared test utilities module (e.g., conftest.py or a dedicated testing.py) would reduce duplication.

python/packages/jumpstarter/jumpstarter/client/core.py (1)

332-351: Asymmetric behavior when monitor is not started.

wait_for_lease_ready_monitored raises RuntimeError if the monitor isn't started, while wait_for_hook_complete_monitored silently returns True. This asymmetry could be surprising. Consider documenting this behavior difference more explicitly or using consistent error handling.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

309-309: Use %-style formatting in logger calls instead of f-strings.

F-strings in logging calls evaluate the string even when the log level is disabled, causing unnecessary overhead. Use %-style formatting for deferred evaluation.

Proposed fix
-                    logger.warning(f"Missed {missed} status transition(s)")
+                    logger.warning("Missed %d status transition(s)", missed)
-                    logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})")
+                    logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)
-                            logger.error(f"Status change callback error: {e}")
+                            logger.error("Status change callback error: %s", e)
-                logger.debug(f"GetStatus poll error: {e.code()}")
+                logger.debug("GetStatus poll error: %s", e.code())
-                logger.debug(f"GetStatus poll error: {e}")
+                logger.debug("GetStatus poll error: %s", e)

Also applies to: 320-320, 337-337, 352-354

controller/internal/service/suite_test.go (1)

90-93: Guard teardown if setup failed.

If testEnv never initialized, Stop() can panic in AfterSuite. A simple nil check avoids cascading failures.

🛡️ Suggested guard
 	By("tearing down the test environment")
-	err := testEnv.Stop()
-	Expect(err).NotTo(HaveOccurred())
+	if testEnv != nil {
+		err := testEnv.Stop()
+		Expect(err).NotTo(HaveOccurred())
+	}
controller/Makefile (1)

66-75: Guard against duplicate metadata.annotations blocks.

Line 66–71 unconditionally inserts a new annotations: map. If the generated role ever includes annotations (now or later), this can create duplicate keys and potentially drop the sync-wave or other annotations depending on YAML parsing behavior. Consider merging into an existing annotations map or using a YAML-aware tool for the insert.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

341-356: Use anyio.to_thread.run_sync() with timeout for blocking process.wait() after process.kill().

The direct process.wait() call at line 354 blocks the event loop if the process doesn't exit promptly. The same block already demonstrates the correct pattern at line 346–347 with anyio.to_thread.run_sync() and anyio.move_on_after(). Apply the same pattern here for consistency and robustness:

Safer non‑blocking wait
                 if process.poll() is None:
                     process.kill()
                     try:
-                        process.wait()
+                        with anyio.move_on_after(5):
+                            await anyio.to_thread.run_sync(process.wait)
                     except Exception:
                         pass

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@controller/Makefile`:
- Line 74: The Makefile currently runs the cp command copying CRDs to
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/, which
will fail if that target directory doesn't exist; update the Makefile to create
the directory first by running a mkdir -p for
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
immediately before the cp command so the cp in the Makefile succeeds even when
the templates/crds directory is missing.
🧹 Nitpick comments (1)
controller/Makefile (1)

66-71: Idempotency risk: duplicate annotations: keys if file already contains annotations.

The AWK script unconditionally inserts an annotations: block after the name: line. If controller-gen ever outputs annotations in the generated role.yaml, or if the manifests target runs without a fresh regeneration, this will produce invalid YAML with duplicate annotations: keys.

Consider using a more robust approach that checks for existing annotations before inserting:

♻️ Suggested safer AWK approach
 	# Add ArgoCD sync-wave annotation to RBAC role for proper deployment ordering (PR `#207`)
-	`@awk` '/^  name: jumpstarter-manager-role$$/{print; print "  annotations:"; print "    argocd.argoproj.io/sync-wave: \"-1\""; next}1' \
+	`@awk` ' \
+		/^  name: jumpstarter-manager-role$$/ { print; found_name=1; next } \
+		found_name && /^  annotations:/ { print; found_annotations=1; next } \
+		found_name && found_annotations && /^  [a-zA-Z]/ && !/^    / { \
+			print "    argocd.argoproj.io/sync-wave: \"-1\""; found_name=0; found_annotations=0 \
+		} \
+		found_name && !found_annotations && /^  [a-zA-Z]/ { \
+			print "  annotations:"; print "    argocd.argoproj.io/sync-wave: \"-1\""; found_name=0 \
+		} \
+		{ print }' \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \
 		mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml

Alternatively, consider using a tool like yq for safer YAML manipulation.

Also, the comment references "PR #207" but this appears to be PR #140.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 125-143: The code currently launches the shell unconditionally
after monitor.wait_for_any_of; change it to treat
ExporterStatus.BEFORE_LEASE_HOOK_FAILED and a None (timeout) as hard stops by
not calling _run_shell_only. After the wait_for_any_of call
(monitor.wait_for_any_of), check result and if it equals
ExporterStatus.BEFORE_LEASE_HOOK_FAILED or is None, log the warning and exit the
function (or raise an appropriate exception) instead of proceeding to call
anyio.to_thread.run_sync(_run_shell_only, lease, config, command, path); only
call _run_shell_only when result == ExporterStatus.LEASE_READY.
♻️ Duplicate comments (5)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)

202-204: Controller channels are not closed after use.

This creates a new gRPC channel per call but never closes it. Multiple calls during status updates and stream retries will leak channels. The pattern in _unregister_with_controller() (lines 357-369) demonstrates proper cleanup with explicit await channel.close().


414-458: Task may hang when lease ends without EndSession.

This task only waits on end_session_requested (line 426). If the lease ends without the client calling EndSession, the task continues waiting indefinitely on the old lease context. Consider racing against lease_ended or using a task group scoped to the lease lifetime.


546-553: Listen task should be scoped to the lease lifetime.

The _retry_stream task for Listen is started on the outer tg (line 548), causing it to continue running after handle_lease exits. This results in the task persisting across lease transitions. Move this to conn_tg so it's automatically cancelled when the lease ends, and close listen_tx in the finally block to signal proper stream termination.


578-579: Consider scoping _handle_end_session to the lease task group.

Starting this task on the outer tg means it survives the lease ending. If combined with the fact that it only waits on end_session_requested, the task can be orphaned. This is related to the hang issue noted in the _handle_end_session method.

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

80-89: Consider removing _run_shell_with_lease if it’s now unused.

The async path appears to be the only entry point, so this sync helper likely adds maintenance overhead.

#!/bin/bash
# Verify remaining usages of the sync helper
rg -nP '\b_run_shell_with_lease\s*\(' --type py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 8-18: The import block is unsorted per Ruff I001; reorder imports
into standard-library first (logging, collections.abc.Awaitable,
collections.abc.Callable, contextlib.AsyncExitStack), then third-party packages
(anyio, anyio.Event, anyio.create_task_group, grpc.StatusCode,
grpc.aio.AioRpcError, jumpstarter_protocol.jumpstarter_pb2), and finally
local/project imports (jumpstarter.common.ExporterStatus) so the import order
satisfies the linter and keeps the existing imported symbols intact.
- Around line 196-201: The timeout checks in wait_for_status() and
wait_for_any_of() incorrectly use "if timeout:" which treats 0 as falsy; change
both checks to "if timeout is not None:" so a timeout of 0 is handled as an
immediate timeout; update the branches around anyio.move_on_after in
wait_for_status() and wait_for_any_of() (the blocks that call wait_loop() and
return False) to use the new condition.
- Around line 396-400: stop() currently sets _running = False and signals
_stop_event but does not wake waiters blocked on _any_change_event; update
stop() to also set _any_change_event (and optionally _any_change_event.set()
followed by _any_change_event.clear() if you want a one-shot wake) so waiters in
wait_for_status and wait_for_any_of that are awaiting _any_change_event.wait()
will wake and observe _running == False and exit their loops; reference the stop
method and the _any_change_event, wait_for_status, and wait_for_any_of functions
when applying this change.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop() method currently ignores the caller's
should_unregister when wait_for_lease_exit is True, causing later calls (e.g.,
in serve()) to override intent; update stop() so that whenever a deferred stop
is requested (the branch setting self._stop_requested = True) you also set
self._unregister = should_unregister (and likewise when a stop has already been
requested you should update/preserve self._unregister accordingly) so the
original caller preference is retained; refer to the stop() method, the
attributes self._unregister, self._stop_requested, and self._tg.cancel_scope to
implement this change.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

147-162: Remove redundant timeout wrapper and improve timeout detection logic.

The StatusMonitor.connection_lost property is valid and well-tested. However, the outer anyio.move_on_after(30) is redundant—wait_for_any_of() already has an identical timeout=30.0 parameter. When both are equal, the inner timeout fires first, making the outer scope's cancelled_caught flag never True. This means timeout cases currently fall through to "Hook completion not confirmed" rather than the timeout warning.

The suggested refactor properly addresses this by removing the outer timeout wrapper and using result is None to detect timeouts, which disambiguates between connection loss and timeout.

♻️ Suggested simplification
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                logger.warning("afterLease hook failed")
-                                            elif monitor.connection_lost:
-                                                # Connection lost - exporter closed, hook likely completed
-                                                logger.info("Exporter connection closed (hook completed)")
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            logger.warning("afterLease hook failed")
+                                        elif result is None and monitor.connection_lost:
+                                            # Connection lost - exporter closed, hook likely completed
+                                            logger.info("Exporter connection closed (hook completed)")
+                                        elif result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        else:
+                                            logger.debug("Hook completion not confirmed")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 237-257: The fixed-count retry loop around self.controller.Dial
(using max_retries/base_delay/max_delay) should be replaced with a time‑based
retry bounded by a configurable timeout (use the existing acquisition_timeout if
available or add a new dial_timeout attribute/parameter) so Dial keeps retrying
until time.monotonic() exceeds the deadline; compute deadline = now + timeout,
then loop attempting await self.controller.Dial(...), on AioRpcError with
FAILED_PRECONDITION and "not ready" apply the same exponential/backoff delay
(min(base_delay*2**attempt, max_delay)) but stop retrying and re-raise once
monotonic() > deadline; update the retry log to show remaining time or attempts
left and ensure the new timeout is plumbed through the Lease acquisition
call/site that calls this Dial logic.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 343-347: In _poll_loop (status_monitor.py) the AioRpcError
UNIMPLEMENTED branch currently breaks without clearing _running or notifying
waiters; update that branch to mirror the UNAVAILABLE handling by setting
self._running = False and calling self._any_change_event.set() (and any other
cleanup done in the UNAVAILABLE branch) before breaking so wait_for_status() and
wait_for_any_of() don't hang; also add a unit test that triggers UNIMPLEMENTED
while a waiter (using wait_for_status() or wait_for_any_of()) is active and
assert the waiter returns promptly.
♻️ Duplicate comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)

8-18: Reorder imports to satisfy Ruff (I001).

Current ordering may still fail Ruff I001.

🔧 Proposed import order
-import logging
-from collections.abc import Awaitable, Callable
-from contextlib import AsyncExitStack
+from collections.abc import Awaitable, Callable
+from contextlib import AsyncExitStack
+import logging

Please verify with:

#!/bin/bash
ruff check python/packages/jumpstarter/jumpstarter/client/status_monitor.py

Based on learnings, Ruff is the formatter/linter for Python files.


196-201: Handle timeout=0 explicitly.

if timeout: treats 0 as falsy, so an immediate timeout becomes an infinite wait (Line 196 and Line 280).

🐛 Proposed fix
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_loop()
             return False
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_for_first()
             return None

Optional verification:

#!/bin/bash
rg -n "wait_for_status|wait_for_any_of|timeout\s*=\s*0" --type=py

Also applies to: 280-285


396-400: Wake waiting coroutines on stop().

stop() doesn’t signal _any_change_event, so wait_for_* can hang after shutdown.

🛠️ Proposed fix
     async def stop(self):
         """Stop the background polling task."""
         self._running = False
         self._stop_event.set()
+        # Wake any waiters so they can observe _running=False
+        self._any_change_event.set()
+        self._any_change_event = Event()

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@e2e/dex.values.yaml`:
- Around line 37-40: The userID value for the test user "test-exporter-hooks" is
not a valid UUID because it contains the non-hex character 'g' (value:
c6ed6f40-6689-6g7e-c64c-55d0a9a5871f); replace this with a valid UUID (only 0-9
and a-f), e.g., generate a new v4 UUID and set userID to that new value so the
entry for username "test-exporter-hooks" and its userID field are valid.
🧹 Nitpick comments (5)
e2e/tests-hooks.bats (5)

9-10: Unused variable declaration.

CURRENT_HOOKS_CONFIG is declared but never referenced anywhere in the file. Consider removing it to avoid confusion.

Suggested fix
-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

51-54: Consider graceful termination before SIGKILL.

Using kill -9 immediately doesn't give the process a chance to clean up. Consider sending SIGTERM first with a brief grace period before escalating to SIGKILL.

Suggested improvement
         if ps -p "$pid" > /dev/null 2>&1; then
           echo "  Killing PID $pid" >&2
-          kill -9 "$pid" 2>/dev/null || true
+          kill -TERM "$pid" 2>/dev/null || true
+          sleep 0.5
+          ps -p "$pid" > /dev/null 2>&1 && kill -9 "$pid" 2>/dev/null || true
         fi

104-110: Fixed sleep may introduce flakiness.

The 2-second sleep on line 107 is intended to avoid catching pre-connect state, but this is timing-dependent. If the exporter connects faster or slower than expected, tests could become flaky. Consider polling for a specific pre-condition instead, or documenting why 2 seconds is sufficient.


205-208: Add safeguards for empty grep results.

If the grep commands fail to find matches (e.g., due to unexpected output format), before_pos or after_pos could be empty, causing the comparison on line 208 to fail with a shell error. Consider adding validation or using || true with a fallback.

Suggested improvement
   # Verify order: BEFORE should appear before AFTER in output
   local before_pos=$(echo "$output" | grep -n "BEFORE_HOOK:" | head -1 | cut -d: -f1)
   local after_pos=$(echo "$output" | grep -n "AFTER_HOOK:" | head -1 | cut -d: -f1)

+  # Ensure both markers were found
+  [ -n "$before_pos" ] && [ -n "$after_pos" ]
   [ "$before_pos" -lt "$after_pos" ]

315-333: Tests lack verification of actual j command effects.

Tests E1 and E2 only verify that the hook scripts complete (via output markers), but don't verify that j power on/off commands actually executed or had the expected effect. Consider adding assertions that verify the power state changed, or at minimum checking for command output in the hook logs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 358-365: The UNAVAILABLE branch breaks the poll loop but doesn't
update the monitor's running state, so set self._running = False when handling
StatusCode.UNAVAILABLE; specifically, inside the elif e.code() ==
StatusCode.UNAVAILABLE block update self._running to False alongside setting
self._connection_lost, trigger and recreate self._any_change_event as shown, and
ensure start() can restart the monitor after this exit.
♻️ Duplicate comments (2)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)

199-204: Handle timeout=0 explicitly.

if timeout: treats 0 as falsy, so timeout=0 waits indefinitely instead of timing out immediately.

🛠️ Proposed fix
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_loop()
             return False
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_for_first()
             return None

Also applies to: 283-286


416-420: Wake waiters on stop to avoid hangs.

stop() only signals _stop_event, but waiters block on _any_change_event, so they can hang indefinitely.

🛠️ Proposed fix
     async def stop(self):
         """Stop the background polling task."""
         self._running = False
         self._stop_event.set()
+        # Wake waiters so they can observe _running=False
+        self._any_change_event.set()
+        self._any_change_event = Event()

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/config/client.py (1)

162-177: Thread include_status through to the ListExporters request.

The include_status parameter is only used to set a display flag on the response (line 176), but it's never passed to ClientService.ListExporters. If the server requires this flag in the request to include status data, status information won't be fetched. Update ClientService.ListExporters to accept and propagate include_status to the protobuf request.

Suggested update
-exporters_response = await svc.ListExporters(page_size=page_size, page_token=page_token, filter=filter)
+exporters_response = await svc.ListExporters(
+    page_size=page_size,
+    page_token=page_token,
+    filter=filter,
+    include_status=include_status,
+)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@controller/internal/service/controller_service_integration_test.go`:
- Around line 76-86: The Lease objects in the tests (the Lease struct created
and passed to k8sClient.Create in controller_service_integration_test.go) are
missing the required Selector field in LeaseSpec which causes API validation to
fail; update the Lease initializations (the variable lease used with
k8sClient.Create and its LeaseSpec) to include Selector: metav1.LabelSelector{}
or an appropriate label selector (e.g., Selector:
&metav1.LabelSelector{MatchLabels: map[string]string{"app":"test"}}) so the
CRD-required selector is present when Create(ctx, lease) is called.

In `@e2e/tests.bats`:
- Around line 190-196: Two occurrences of the yq load expression still reference
the old path load("e2e/exporter.yaml"); update those commands that match the
pattern go run ... -i ". * load(\"e2e/exporter.yaml\")" to use
load("e2e/exporters/exporter.yaml") instead so all yq invocations point to the
moved exporter.yaml file (ensure the same replacement is applied for both
remaining occurrences).

In `@python/packages/jumpstarter-cli/jumpstarter_cli/run.py`:
- Around line 79-94: The parent currently interprets any non-zero child exit as
a signal and adds 128, which double-encodes real exit codes; change the protocol
so the child (exporter) encodes signals as 128 + signal number and the parent in
run.py should stop adding 128 — i.e., use exporter_exit_code directly when set
(replace any logic that converts non-zero exporter_exit_code into 128+value),
and keep received_signal only for locally-caught signals; update handling around
exporter_exit_code and received_signal (and cancellation via
signal_tg.cancel_scope.cancel()) so the parent returns exporter_exit_code as-is
when present.

In `@python/packages/jumpstarter/jumpstarter/client/core.py`:
- Around line 458-485: Guard creation of LogSource from response.source with a
try/except to handle unknown enum values: when evaluating source =
LogSource(response.source) (and computing is_hook), catch ValueError and set
source to LogSource.SYSTEM (and is_hook = False) so the stream continues instead
of raising; keep the existing routing logic that maps source to logger_name and
logging through source_logger.log(log_level, response.message). Also ensure
severity fallback (severity = response.severity if response.severity else
"INFO") remains unchanged.
- Around line 127-187: The wait_for_lease_ready method currently logs a warning
on timeout but returns silently so callers cannot detect failure; modify
wait_for_lease_ready to raise DriverError (or return False if you prefer an API
change) when the timeout elapses instead of only logging: replace the final
self.logger.warning(...) with raising DriverError(f"Timeout waiting for
beforeLease hook to complete after {timeout:.1f}s and {poll_count} polls")
(ensure DriverError is imported/available) so callers of wait_for_lease_ready
can programmatically handle timeouts.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop method currently ignores the caller's
should_unregister value when deferring stop; update stop(self, ...) to persist
should_unregister into an instance field (e.g. set self._unregister =
should_unregister) whenever wait_for_lease_exit is true and you set
self._stop_requested = True so the original preference is preserved, and ensure
the lease-exit path that later invokes stop uses that stored self._unregister
rather than unconditionally passing True; update references to _unregister and
_stop_requested so deferred stops respect the saved preference.
🧹 Nitpick comments (11)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

147-162: Redundant timeout pattern may cause confusion.

The code uses both anyio.move_on_after(30) and timeout=30.0 in wait_for_any_of. Since both timeouts are identical, the inner wait_for_any_of will return None on timeout before move_on_after can trigger, meaning timeout_scope.cancelled_caught at line 161 will likely never be True.

Consider removing one timeout mechanism:

♻️ Suggested simplification
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                logger.warning("afterLease hook failed")
-                                            elif monitor.connection_lost:
-                                                # Connection lost - exporter closed, hook likely completed
-                                                logger.info("Exporter connection closed (hook completed)")
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            logger.warning("afterLease hook failed")
+                                        elif monitor.connection_lost:
+                                            logger.info("Exporter connection closed (hook completed)")
+                                        else:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")
e2e/tests-hooks.bats (2)

75-76: Consider caching yq installation for faster test execution.

Using go run github.com/mikefarah/yq/v4@latest fetches from the network on each invocation. For repeated test runs, this adds latency. Consider installing yq once in the test environment setup or using a pre-installed binary.


78-82: Clarify the 3>&- redirection purpose.

The 3>&- closes file descriptor 3 before running the background loop. While this works, a brief comment explaining why this is needed (likely to prevent fd leaks from bats internals) would help maintainability.

python/packages/jumpstarter/jumpstarter/client/core_test.py (1)

16-53: Optional: consolidate gRPC error/status helpers.

These helpers look similar to those in status_monitor_test.py; a shared test util could reduce duplication.

python/packages/jumpstarter/jumpstarter/client/core.py (1)

105-126: Consider allowing ExporterStatus.UNSPECIFIED for backward compatibility.

If GetStatus returns the default/unspecified value, Line 124 currently raises ExporterNotReady, which can block driver calls for exporters that haven’t adopted explicit status reporting yet. Adding UNSPECIFIED keeps behavior more permissive without impacting the ready path.

♻️ Suggested tweak
-        ALLOWED_STATUSES = {
-            ExporterStatus.LEASE_READY,
-            ExporterStatus.BEFORE_LEASE_HOOK,
-            ExporterStatus.AFTER_LEASE_HOOK,
-        }
+        ALLOWED_STATUSES = {
+            ExporterStatus.LEASE_READY,
+            ExporterStatus.BEFORE_LEASE_HOOK,
+            ExporterStatus.AFTER_LEASE_HOOK,
+            ExporterStatus.UNSPECIFIED,
+        }
python/packages/jumpstarter/jumpstarter/common/types.py (1)

16-18: Consider removing the empty TYPE_CHECKING block.

The TYPE_CHECKING guard and its comment suggest async stub type aliases were planned but the block is empty. This is dead code that could be removed for clarity, or the intended type aliases should be added.

🧹 Proposed cleanup
-# Async stub type aliases are only available for type checking (defined in .pyi files)
-if TYPE_CHECKING:
-    pass
-

Also update the import:

-from typing import TYPE_CHECKING, TypeAlias
+from typing import TypeAlias
python/packages/jumpstarter/jumpstarter/exporter/lease_context.py (1)

54-57: Consider using explicit validation instead of assert statements.

assert statements can be disabled when Python runs with optimization flags (-O), which would skip these critical validations in production. Consider raising ValueError explicitly.

♻️ Proposed fix
     def __post_init__(self):
         """Validate that required resources are present."""
-        assert self.before_lease_hook is not None, "LeaseScope requires a before_lease_hook event"
-        assert self.lease_name, "LeaseScope requires a non-empty lease_name"
+        if self.before_lease_hook is None:
+            raise ValueError("LeaseContext requires a before_lease_hook event")
+        if not self.lease_name:
+            raise ValueError("LeaseContext requires a non-empty lease_name")
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

621-629: Consider making the 1-second delay configurable or documenting its purpose more prominently.

The delay at Line 624 allows clients to poll and see the AVAILABLE status before the lease is released. While the comment explains this, users might wonder why there's a hardcoded delay. Consider extracting this to a constant or making it configurable.

python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)

426-522: Good coverage of wait_for_any_of behavior.

The tests cover the key scenarios for multi-target waiting. The parallel structure to wait_for_status tests ensures consistent behavior.

Consider adding a test for DEADLINE_EXCEEDED handling in wait_for_any_of for completeness (similar to test_wait_for_status_deadline_exceeded_returns_promptly), though the UNAVAILABLE test provides reasonable coverage of the connection-loss path.

python/packages/jumpstarter/jumpstarter/exporter/session.py (1)

270-280: Consider reducing GetStatus log level.

The GetStatus RPC may be polled frequently by clients (e.g., StatusMonitor polls every 0.1-0.3s). Using logger.info could result in excessive log output.

Consider using DEBUG level
     async def GetStatus(self, request, context):
         """Get the current exporter status with transition tracking."""
-        logger.info("GetStatus() -> %s (version=%d)", self._current_status, self._status_version)
+        logger.debug("GetStatus() -> %s (version=%d)", self._current_status, self._status_version)
         response = jumpstarter_pb2.GetStatusResponse(
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)

360-382: Consider using _controller_stub in _unregister_with_controller.

This method manages the channel manually instead of using the _controller_stub context manager. While the explicit channel.close() with CancelScope(shield=True) is correct, using the context manager would be more consistent with the rest of the codebase.

Optional refactor for consistency
     async def _unregister_with_controller(self):
         """Safely unregister from controller with timeout and error handling."""
         if not (self._registered and self._unregister):
             return

         logger.info("Unregistering exporter with controller")
         try:
             with move_on_after(10):  # 10 second timeout
-                channel = await self.channel_factory()
-                try:
-                    controller = jumpstarter_pb2_grpc.ControllerServiceStub(channel)
+                async with self._controller_stub() as controller:
                     await self._report_status(ExporterStatus.OFFLINE, "Exporter shutting down")
                     await controller.Unregister(
                         jumpstarter_pb2.UnregisterRequest(
                             reason="Exporter shutdown",
                         )
                     )
                     logger.info("Controller unregistration completed successfully")
-                finally:
-                    with CancelScope(shield=True):
-                        await channel.close()
         except Exception as e:
             logger.error("Error during controller unregistration: %s", e, exc_info=True)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 150-171: The test "hooks: create client and exporter" currently
provisions shared state inline (jmp admin create client/exporter and jmp login)
which other tests depend on; make provisioning idempotent and runnable in
isolation by moving those create/login steps into a setup_file or setup function
or by extracting helper functions (e.g., ensure_client_exists,
ensure_exporter_exists, ensure_logged_in) that each test calls; implement the
helpers to check for existence and only create/login when missing (using the
same jmp admin create client/exporter and jmp login commands referenced) so
every test explicitly prepares and resets required state.
- Around line 74-76: Before merging per-test overlays, restore a pristine base
exporter YAML into the exporter target file to avoid cross-test leakage: copy
the committed clean fixture (a base exporter YAML checked into the repo) to the
exporter output path for the test, then run the existing go run
github.com/mikefarah/yq/v4... ". * load(\"e2e/exporters/${config_file}\")"
merge; apply the same change to the other occurrence (the block around lines
94-96) so each test starts from a deterministic base rather than mutating a
shared file.
🧹 Nitpick comments (6)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

151-155: Redundant double timeout on afterLease hook wait.

Both anyio.move_on_after(30) and monitor.wait_for_any_of(..., timeout=30.0) enforce the same 30-second limit. The inner timeout parameter should be sufficient; the outer move_on_after is redundant.

♻️ Proposed simplification
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                logger.warning("afterLease hook failed")
-                                            elif monitor.connection_lost:
-                                                # Connection lost - exporter closed, hook likely completed
-                                                logger.info("Exporter connection closed (hook completed)")
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            logger.warning("afterLease hook failed")
+                                        elif monitor.connection_lost:
+                                            # Connection lost - exporter closed, hook likely completed
+                                            logger.info("Exporter connection closed (hook completed)")
+                                        else:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

259-283: Consider adding a default case for unexpected config types.

The match statement handles ClientConfigV1Alpha1 and ExporterConfigV1Alpha1, but lacks a default case. If an unexpected config type is passed, the function silently returns None without calling sys.exit(), which could lead to confusing behavior.

♻️ Proposed fix to add a fallback case
         case ExporterConfigV1Alpha1():
             with config.serve_unix() as path:
                 # SAFETY: the exporter config is local thus considered trusted
                 launch_shell(
                     path,
                     "local",
                     allow=[],
                     unsafe=True,
                     use_profiles=False,
                     command=command,
                 )
+
+        case _:
+            raise click.ClickException(f"Unsupported config type: {type(config).__name__}")
e2e/tests-hooks.bats (1)

17-22: Enable xtrace for easier E2E debugging.

Please add set -o xtrace in setup() so test runs are always traceable.
Based on learnings keep tracing enabled (-x) unconditionally because there are no secrets in the bats tests and it makes debugging easier with more understandable logs.

Proposed diff
 setup() {
+  set -o xtrace
   bats_load_library bats-support
   bats_load_library bats-assert
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)

240-241: Consider moving import time to method level.

The import time statement is inside the nested read_pty_output function, which means it's executed on every call to that function. While the import is cached by Python, moving it to line 174-175 alongside the other local imports (pty, subprocess) would be more consistent and slightly more efficient.

🔧 Suggested fix
     def _execute_hook_process(  # noqa: C901
         ...
     ) -> None:
         ...
         import pty
         import subprocess
+        import time

         ...
                 async def read_pty_output() -> None:  # noqa: C901
                     """Read from PTY master fd line by line using non-blocking I/O."""
                     logger.debug("read_pty_output task started")
                     buffer = b""
                     read_count = 0
                     last_heartbeat = 0
-                    import time
                     start_time = time.monotonic()

625-633: Consider making the status propagation delay configurable.

The 1.0 second delay before requesting lease release is well-documented, but this value may need tuning based on network conditions. Consider exposing this as a configurable parameter or documenting the rationale for this specific value in the code comments.


49-55: Remove unused device_factory field from HookExecutor.

The device_factory field is declared in the dataclass but is never used in any class method (_create_hook_env, _execute_hook, or the public hook execution methods). Since it serves no purpose within the class, it should be removed to keep the code clean and reduce confusion.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 80-84: The restart loop around `jmp run --exporter
test-exporter-hooks` is missing backoff and can spin if the exporter exits
immediately; modify the loop in e2e/tests-hooks.bats (the `while true; do ...
done` that runs `jmp run`) to add a short sleep between iterations (e.g., sleep
1) or implement simple exponential backoff to avoid rapid respawns when startup
fails.

In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/oidc.py`:
- Around line 22-27: The discovery SSL context respects SSL_CERT_FILE via
_get_ssl_context(), but token endpoint calls using
authlib.requests_client.OAuth2Session in token_exchange_grant, password_grant,
refresh_token_grant, and authorization_code_grant do not — causing cert
verification failures with private CAs. Resolve the CA bundle path the same way
(os.environ.get("SSL_CERT_FILE") or certifi.where()) and pass it to
OAuth2Session/fetch_token via the verify parameter (either when instantiating
OAuth2Session or on each fetch_token call) so token requests use the same custom
CA bundle as discovery; reference the _get_ssl_context() logic and update the
four grant functions named above to accept/use that resolved ca_file.

In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 239-270: The AioRpcError handler in the Dial retry loop (around
self.controller.Dial in lease.py) currently treats every
non-FAILED_PRECONDITION/"not ready" error as "exporter went offline" and
returns, which silently swallows important errors; change the except AioRpcError
block so that only terminal/gone-like status codes (e.g.,
grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.NOT_FOUND,
grpc.StatusCode.CANCELLED) are handled by logging a warning and returning, while
all other status codes (e.g., UNAUTHENTICATED, PERMISSION_DENIED, INTERNAL) are
re-raised so they surface; retain informative logging using e.code() and
e.details() in the warning and when re-raising include the original exception.

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 366-380: The timeout block that calls
anyio.to_thread.run_sync(process.wait) inside anyio.move_on_after(5) can hang
because run_sync defaults to abandon_on_cancel=False; change the call in the
cleanup routine (the block that checks process.poll() is None and currently uses
anyio.move_on_after(5) / anyio.to_thread.run_sync(process.wait)) to pass
abandon_on_cancel=True (i.e., anyio.to_thread.run_sync(process.wait,
abandon_on_cancel=True)) so the thread is abandoned when the 5s deadline fires
and the code can proceed to the forced kill; keep the surrounding try/except
behavior unchanged.
🧹 Nitpick comments (12)
e2e/tests-hooks.bats (2)

9-10: Unused variable CURRENT_HOOKS_CONFIG.

This variable is initialized but never referenced anywhere in the file. Remove it to avoid confusion.

-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

74-78: go run …@latest is invoked twice per test start — consider caching the binary.

Each go run github.com/mikefarah/yq/v4@latest may re-resolve and potentially re-compile the tool. With two calls per helper and many tests, this adds up. Consider installing yq once (e.g., in setup_file) and invoking the binary directly.

Example
# In setup_file():
GOBIN="${BATS_RUN_TMPDIR}" go install github.com/mikefarah/yq/v4@latest
YQ="${BATS_RUN_TMPDIR}/yq"

# Then in helpers:
"$YQ" -i 'del(.hooks)' /etc/jumpstarter/exporters/test-exporter-hooks.yaml
"$YQ" -i ". * load(\"e2e/exporters/${config_file}\")" /etc/jumpstarter/exporters/test-exporter-hooks.yaml

Also applies to: 96-100

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

157-178: Redundant double timeout: move_on_after(30) wraps wait_for_any_of(..., timeout=30.0).

Both enforce the same 30-second limit. The wait_for_any_of already returns None on timeout (handled at line 168+), making the outer move_on_after scope and its cancelled_caught check (line 177) unreachable in practice. Consider removing one layer.

♻️ Simplified version removing the outer scope
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                from jumpstarter.common.exceptions import ExporterOfflineError
-                                                reason = monitor.status_message or "afterLease hook failed"
-                                                raise ExporterOfflineError(reason)
-                                            elif monitor.connection_lost:
-                                                from jumpstarter.common.exceptions import ExporterOfflineError
-                                                reason = (
-                                                    monitor.status_message
-                                                    or "Connection to exporter lost during afterLease hook"
-                                                )
-                                                raise ExporterOfflineError(reason)
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            from jumpstarter.common.exceptions import ExporterOfflineError
+                                            reason = monitor.status_message or "afterLease hook failed"
+                                            raise ExporterOfflineError(reason)
+                                        elif monitor.connection_lost:
+                                            from jumpstarter.common.exceptions import ExporterOfflineError
+                                            reason = (
+                                                monitor.status_message
+                                                or "Connection to exporter lost during afterLease hook"
+                                            )
+                                            raise ExporterOfflineError(reason)
+                                        else:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

94-97: Consolidate repeated inline imports of ExporterOfflineError.

ExporterOfflineError is imported inline at lines 122, 126, 150, 165, and 169 — all within the same function. Move this import to the top of the function alongside the other local imports (lines 94-97) to reduce clutter.

Also applies to: 122-126, 150-150, 165-165, 169-169

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)

321-321: Use %-style formatting in logging calls instead of f-strings.

f-strings are evaluated eagerly regardless of log level. While the impact is negligible for info/warning, using %-style defers formatting to the logging framework and is the idiomatic Python logging pattern.

For example:

-                    logger.warning(f"Missed {missed} status transition(s)")
+                    logger.warning("Missed %d status transition(s)", missed)
-                    logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})")
+                    logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)

Same pattern applies to lines 349, 379, and 381.

Also applies to: 332-332, 349-349, 379-379, 381-381


127-204: Significant duplication between wait_for_status and wait_for_any_of.

The connection-recovery verification block (lines 139-162 vs 218-241) and the inner wait-loop structure are nearly identical, differing only in whether the check is == target or in targets. Consider extracting the shared logic (connection verification, wait-loop scaffolding with active-waiter tracking and event capture) into a private helper to reduce the maintenance surface.

Also applies to: 206-288

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)

472-486: Polling loop with floating-point accumulation is fine for the 30s scale but could be cleaner.

The elapsed += interval pattern accumulates small FP errors over iterations, though it's negligible for a 30-second timeout. A monotonic clock comparison would be more robust and is already used elsewhere in this file (line 241).

Optional: use monotonic clock
-            timeout = 30  # seconds
-            interval = 0.1  # seconds
-            elapsed = 0.0
-            while not lease_scope.is_ready():
-                if elapsed >= timeout:
+            import time
+            deadline = time.monotonic() + 30  # seconds
+            interval = 0.1  # seconds
+            while not lease_scope.is_ready():
+                if time.monotonic() >= deadline:
                     error_msg = "Timeout waiting for lease scope to be ready"
                     logger.error(error_msg)
                     await report_status(ExporterStatus.BEFORE_LEASE_HOOK_FAILED, error_msg)
                     lease_scope.before_lease_hook.set()
                     return
                 await anyio.sleep(interval)
-                elapsed += interval

627-635: finally block unconditionally sleeps 1.0s even when no hook ran.

When after_lease is not configured, run_after_lease_hook returns early at line 579, but the finally block still executes the 1.0s sleep and lease release call. This adds unnecessary latency to every lease teardown for exporters without hooks.

Consider guarding the sleep:

Proposed change
         finally:
-            # Brief delay to allow client's status monitor to poll and see AVAILABLE
-            # before we request lease release (which closes the session)
-            await anyio.sleep(1.0)
-
-            # Request lease release from controller after hook completes (success or failure)
-            # This ensures the lease is always released even if the client disconnects
-            if request_lease_release:
-                await request_lease_release()
+            # Only delay and release if a hook was actually executed
+            if self.config.after_lease:
+                # Brief delay to allow client's status monitor to poll and see AVAILABLE
+                # before we request lease release (which closes the session)
+                await anyio.sleep(1.0)
+
+            # Request lease release from controller after hook completes (success or failure)
+            # This ensures the lease is always released even if the client disconnects
+            if request_lease_release:
+                await request_lease_release()

160-168: Consider extracting _execute_hook_process into smaller helpers.

At ~250 lines with nested async closures, this method carries significant cognitive load (acknowledged by noqa: C901). The PTY setup, output reading, process waiting, and timeout handling could each be separate methods. Not blocking, but worth considering for future maintainability.

python/packages/jumpstarter/jumpstarter/exporter/exporter.py (3)

329-332: Use lazy % formatting instead of f-strings in logger calls.

Lines 330 and 332 use f-strings which evaluate the string unconditionally, even when the log level is disabled. Use %-style formatting for consistency with the rest of the file and to avoid unnecessary string construction.

Proposed fix
-            logger.info(f"Updated status to {status}: {message}")
+            logger.info("Updated status to %s: %s", status, message)
         except Exception as e:
-            logger.error(f"Failed to update status: {e}")
+            logger.error("Failed to update status: %s", e)

As per coding guidelines, "Use Ruff for code formatting and linting on all Python files except jumpstarter-protocol" — Ruff's G rules flag f-strings in logging calls.


629-636: Client connection tasks spawned on outer tg may outlive the session.

_handle_client_conn tasks are started on the outer tg (line 629), but the session socket is closed when session_for_lease exits (after the finally block at line 670). Any in-flight client connections will get socket errors. This is likely benign since _handle_client_conn catches exceptions, but spawning on conn_tg instead would give cleaner lifecycle alignment.

Suggested change
-                            tg.start_soon(
+                            conn_tg.start_soon(
                                 self._handle_client_conn,
                                 lease_scope.socket_path,
                                 request.router_endpoint,

360-382: _unregister_with_controller bypasses _controller_stub context manager.

This method manually creates and closes a channel (lines 368-380) rather than using _controller_stub(). This is intentional — it needs the raw channel for both _report_status and Unregister — but note that _report_status at line 371 internally opens another channel via _controller_stub. This means two channels are open simultaneously during unregistration.

Consider passing the existing stub to _report_status or inlining the status report here to avoid the extra channel.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)

683-701: ⚠️ Potential issue | 🟡 Minor

Status stream retry exhaustion would crash serve() with an unhandled exception.

If the Status stream fails permanently (retries exhausted in _retry_stream), the exception propagates into tg, cancelling the entire exporter. The Listen stream is handled similarly but is scoped to a lease. Consider whether the Status stream should have unlimited retries or a more graceful degradation path, since it's the exporter's lifeline to the controller.

🧹 Nitpick comments (5)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)

304-304: Prefer e.code() == grpc.StatusCode.UNAVAILABLE for consistency.

handle_async (line 252) compares status codes directly via grpc.StatusCode, but here it compares via e.code().name == "UNAVAILABLE". Using the enum directly is more robust and consistent.

Suggested fix
-                if e.code().name == "UNAVAILABLE":
+                if e.code() == grpc.StatusCode.UNAVAILABLE:
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)

242-258: Fragile transient-error detection via string matching.

The substring checks "Stream removed" and "UNAVAILABLE" on line 247 are brittle — they depend on exact gRPC error message wording that can change across versions. Consider checking grpc.StatusCode from RpcError.code() instead.

Suggested approach
-                    error_str = str(e)
-                    is_transient = "Stream removed" in error_str or "UNAVAILABLE" in error_str
+                    is_transient = (
+                        isinstance(e, grpc.RpcError)
+                        and e.code() == grpc.StatusCode.UNAVAILABLE
+                    )

330-332: Use %-style formatting in logger calls instead of f-strings.

f-strings are evaluated eagerly regardless of log level. The rest of this file consistently uses %-style formatting for logger calls — these two lines are the only exceptions.

Suggested fix
-            logger.info(f"Updated status to {status}: {message}")
+            logger.info("Updated status to %s: %s", status, message)
         except Exception as e:
-            logger.error(f"Failed to update status: {e}")
+            logger.error("Failed to update status: %s", e)

366-388: Minor: _report_status opens a separate channel inside _unregister_with_controller.

Line 377 calls self._report_status() which opens its own channel via _controller_stub, while the enclosing code already has a channel open at line 374. This creates two channels when one would suffice. Consider passing the existing stub to avoid the extra channel.


786-788: Magic sleep to prevent SSL corruption is fragile.

The sleep(0.2) with a comment about SSL corruption prevention is a timing-based workaround. If the root cause is overlapping gRPC connections on the same socket, the dual-socket approach in session_for_lease should already address that. Consider whether this sleep is still necessary, or if it can be replaced with a more deterministic signal (e.g., confirming the previous session's server has fully stopped).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)

287-321: ⚠️ Potential issue | 🟡 Minor

_wait_for_ready_connection logic is sound overall; one concern about _last_dial_error timing.

The method correctly surfaces a stored _last_dial_error on the first AioRpcError, which works because handle_async (triggered by the incoming connection from client_from_path) runs and sets the field before returning—closing the server-side connection and causing the client to receive the error.

However, this relies on handle_async completing its Dial call (including the full retry loop up to dial_timeout) before _wait_for_ready_connection exhausts its 5 retries × 1s sleep = ~5s budget. With dial_timeout defaulting to 30s, if the exporter takes a long time to reject with "hook failed", the readiness check may exhaust retries before _last_dial_error is populated, yielding a generic "Max retries reached" error instead of the hook failure message.

Consider either:

  • Making _wait_for_ready_connection's total wait budget at least dial_timeout (e.g., deriving retries or total timeout from self.dial_timeout), or
  • Having _wait_for_ready_connection check _last_dial_error even in the max-retries-exhausted path.
Minimal fix: check _last_dial_error on retry exhaustion
                 if retries_left > 1:
                     retries_left -= 1
                 else:
                     logger.error("Max retries reached while waiting for ready connection at %s", path)
+                    if self._last_dial_error:
+                        raise ConnectionError(self._last_dial_error) from e
                     raise ConnectionError("Max retries reached while waiting for ready connection at %s" % path) from e
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

300-310: ⚠️ Potential issue | 🟡 Minor

Local exporter shell exit code is silently discarded.

In the ExporterConfigV1Alpha1 branch, launch_shell(...) returns an exit code, but it's not passed to sys.exit() like the ClientConfigV1Alpha1 path (line 298). The CLI will always exit 0 regardless of the command's actual result.

🔧 Proposed fix
         case ExporterConfigV1Alpha1():
             with config.serve_unix() as path:
                 # SAFETY: the exporter config is local thus considered trusted
-                launch_shell(
+                exit_code = launch_shell(
                     path,
                     "local",
                     allow=[],
                     unsafe=True,
                     use_profiles=False,
                     command=command,
                 )
+                sys.exit(exit_code)
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 157-183: The outer anyio.move_on_after(30) is redundant and can
mask error-handling because it may cancel before monitor.wait_for_any_of(...,
timeout=30.0) returns; remove the with anyio.move_on_after(30) block and call
result = await monitor.wait_for_any_of([ExporterStatus.AVAILABLE,
ExporterStatus.AFTER_LEASE_HOOK_FAILED], timeout=30.0) directly, then keep the
existing branches that raise ExporterOfflineError for AFTER_LEASE_HOOK_FAILED
and connection_lost (inspect monitor.current_status and monitor.status_message
as currently done); replace the timeout_scope.cancelled_caught check by treating
result is None as the timeout case and emit the same logger.warning("Timeout
waiting for afterLease hook to complete") when result is None.
🧹 Nitpick comments (6)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)

321-321: Prefer lazy %-formatting in logger calls instead of f-strings.

f-strings are eagerly evaluated even when the log level is suppressed. Lines 321, 332, 349, 379, and 381 all use f-strings. Consider logger.warning("Missed %d status transition(s)", missed) style for consistency with the rest of the file (e.g., lines 374-377 already use %-formatting).


218-241: Connection-recovery verification logic is duplicated.

The recovery-poll block in wait_for_any_of (lines 218–241) is nearly identical to the one in wait_for_status (lines 139–162). Consider extracting a shared helper like _verify_connection() -> ExporterStatus | None to reduce duplication.

python/packages/jumpstarter/jumpstarter/client/lease.py (2)

253-253: Fragile string matching on gRPC error details.

"not ready" in str(e.details()) (line 253) and "hook failed" in str(e.details()) (line 270) couple the client to the exact wording of server-side error messages. If the server changes the phrasing (e.g., "hook execution failed"), these checks silently stop matching and the error handling degrades.

Consider using gRPC metadata or structured error details (e.g., google.rpc.Status with error details) to communicate the error type, or at minimum define these sentinel strings as shared constants between the client and server.

Also applies to: 270-270


312-315: Comparing status codes by .name string instead of enum identity.

Line 312 uses e.code().name == "UNAVAILABLE" which works but is less idiomatic and more fragile than comparing the enum directly:

-                if e.code().name == "UNAVAILABLE":
+                if e.code() == grpc.StatusCode.UNAVAILABLE:

The rest of this file (line 253) already uses the enum comparison pattern.

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

116-119: Hardcoded 300s timeout for beforeLease hook wait.

The exporter config supports per-hook timeout, but the client-side wait uses a hardcoded 300 seconds. If a hook is configured with a longer timeout on the exporter side, the client will give up prematurely. Consider making this configurable or deriving it from the lease/exporter metadata.


94-97: Consolidate repeated inline imports of ExporterOfflineError.

ExporterOfflineError is imported five times inside this function (lines 122, 126, 150, 165, 169). Move it to a single import alongside the other imports at the top of the function body (line 94–97) to reduce noise and help with the C901 complexity.

  from contextlib import ExitStack
  from jumpstarter.client.client import client_from_path
  from jumpstarter.common import ExporterStatus
+ from jumpstarter.common.exceptions import ExporterOfflineError

Also applies to: 122-126, 150-150, 165-165, 169-169

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Line 64: The _last_dial_error attribute on the Lease dataclass is dead state
(set but never read); either remove the field declaration "_last_dial_error: str
| None = field(default=None, init=False)" from the class, or make its value
observable to callers by surfacing it (for example add a read-only property
get_last_dial_error or include it in the Lease.__repr__/to_dict output and
update any Dial/hook failure handling in the method that sets _last_dial_error
to also document/propagate the error). Locate references to _last_dial_error and
the method that assigns it (the Dial/hook failure path) and implement one
consistent approach: delete the unused field and related assignments, or expose
it via a public accessor and update callers to read that accessor where hook
failure information is needed.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 374-384: The DEADLINE_EXCEEDED branch that handles 20 consecutive
timeouts sets self._connection_lost and resets self._any_change_event but
forgets to stop the monitor loop; set self._running = False in the same block
where deadline_retries >= 20 (alongside self._connection_lost = True and
resetting self._any_change_event) so the monitor actually stops; make this
change in the method that references deadline_retries and _running (the
GetStatus/monitor loop function) to mirror the UNAVAILABLE handling.
🧹 Nitpick comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

390-392: Unhandled gRPC errors and generic exceptions fall through to the sleep-and-retry loop silently.

Line 390 catches any AioRpcError that isn't UNIMPLEMENTED/UNAVAILABLE/DEADLINE_EXCEEDED and only does logger.debug. Line 391-392 catches all other exceptions with logger.debug. For unexpected errors (e.g., PERMISSION_DENIED, INTERNAL), the monitor will keep retrying forever with only debug-level logging. Consider logging at warning level for unexpected gRPC codes so operators can diagnose issues.

python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (2)

35-51: HasField mock returns a fixed boolean regardless of field name.

response.HasField.return_value is set to a single boolean, so response.HasField("any_field_name") returns the same value. This works because the production code only calls HasField("previous_status"), but it's worth noting as a fragility point if the production code evolves.


607-618: test_context_manager_stops_on_exit doesn't fully exercise the context manager lifecycle.

Setting _running = True manually before entering the context manager causes start() (inside __aenter__) to early-return (since it checks if self._running: return). This test only validates that __aexit__stop() sets _running = False, not that __aenter__ properly initializes the polling loop. Consider removing the manual _running = True to let the full lifecycle execute.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 59-64: Remove the unused dataclass field _last_dial_error from the
Lease class and delete any places that assign to it (e.g., the assignment in the
Dial/_dial code path that sets self._last_dial_error); ensure no remaining
references remain (search for "_last_dial_error") and if those assignments were
intended to communicate errors replace them with a direct log or exception
handling in the same function (e.g., log via self.logger or raise) so behavior
and tests remain unchanged.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 717-729: The current logic creates a LeaseContext and starts
handle_lease whenever self._lease_context is None and status.lease_name != "",
which can start unnecessary handlers for stale lease names; update the condition
to also require status.leased to be truthy before creating LeaseContext and
calling tg.start_soon. Specifically, guard the block that creates LeaseContext
(symbol: LeaseContext) and starts the task (tg.start_soon(self.handle_lease,
...)) by checking status.leased along with self._lease_context and
status.lease_name so handle_lease is only spawned for active leases.
🧹 Nitpick comments (5)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)

245-248: Mixing time.monotonic() with anyio sleep is fine, but consider consistency with anyio's clock.

Minor nit: anyio's current_time() returns the same monotonic clock and keeps the code within the anyio abstraction. Not a functional issue — time.monotonic() works correctly here.

python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)

242-258: Prefer gRPC status code checks over string matching for transient error detection.

String matching on str(e) (line 247) is fragile and locale/version-dependent. gRPC errors carry structured status codes that are more reliable:

Suggested approach
-                    error_str = str(e)
-                    is_transient = "Stream removed" in error_str or "UNAVAILABLE" in error_str
+                    is_transient = (
+                        isinstance(e, grpc.aio.AioRpcError)
+                        and e.code() == grpc.StatusCode.UNAVAILABLE
+                    )

This also covers "Stream removed" cases since the server-side removal typically surfaces as UNAVAILABLE. If you need to handle additional codes (e.g., INTERNAL), extend the condition.


330-332: Use lazy %-formatting in logger calls instead of f-strings.

F-strings are evaluated eagerly even when the log level is disabled, adding unnecessary overhead on hot paths like status reporting.

-            logger.info(f"Updated status to {status}: {message}")
+            logger.info("Updated status to %s: %s", status, message)
         except Exception as e:
-            logger.error(f"Failed to update status: {e}")
+            logger.error("Failed to update status: %s", e)

636-645: Client connection tasks spawned on outer tg can outlive the session.

_handle_client_conn tasks are started on the outer tg (line 638), so they survive conn_tg cancellation and may still be running when session_for_lease tears down the Unix socket. The except in _handle_client_conn catches the resulting errors, so this isn't a crash risk, but it means connections aren't drained gracefully.

Consider spawning client connections on conn_tg instead, so they are cancelled together with the listen stream when the lease ends. If graceful drain is needed, you could add a brief move_on_after delay before conn_tg cancellation.


366-388: Minor: Double channel creation during unregistration.

_unregister_with_controller opens a channel explicitly (line 374), then _report_status at line 377 opens a second channel via _controller_stub. You could call controller.ReportStatus(...) directly on the already-open stub to avoid the extra channel.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 536-626: In run_after_lease_hook, wrap the final
request_lease_release() call in a try/except to defensively handle any
exceptions raised during cleanup: catch Exception around the await
request_lease_release() call, log the error with context (e.g., using
logger.error with exc_info=True) and do not re-raise so cleanup failures don’t
propagate out of the finally block; keep the existing delay/ shutdown_called
behavior unchanged.
🧹 Nitpick comments (6)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

121-135: Redundant repeated imports of ExporterOfflineError within the same function.

ExporterOfflineError is imported at lines 122, 126, 150, 164, and 168 — all inside the same function body. A single import at the top of the function (or at module level) would suffice and reduce clutter.

Also applies to: 148-186


226-238: Exception group handling silently drops non-timeout exceptions when a TimeoutError is present.

Lines 227-229 iterate the group and raise the first TimeoutError, discarding any other exceptions (including potential ExporterOfflineError). If both can co-occur, the ExporterOfflineError would be lost. This might be intentional, but worth noting.

python/packages/jumpstarter/jumpstarter/config/exporter.py (1)

166-168: Pre-existing @classmethod uses self instead of cls — confusing but functional.

dump_yaml is decorated with @classmethod but names its first parameter self rather than cls. The added by_alias=True change is correct and necessary for preserving camelCase keys in YAML output. The self/cls naming issue pre-dates this PR.

python/packages/jumpstarter/jumpstarter/client/core.py (1)

476-487: Per-source loggers lack explicit handler configuration.

Loggers like "exporter:beforeLease" and "exporter:afterLease" are created via logging.getLogger() but don't have handlers attached. They rely on propagation to the root logger. If the root logger hasn't been configured with a handler (common in library code), hook log messages at INFO/DEBUG level may be silently dropped.

Consider attaching a RichHandler (or at minimum a NullHandler) to these loggers on first use, or documenting that the consuming application must configure root-level logging.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

448-534: request_lease_release parameter is accepted but never used in run_before_lease_hook.

The parameter is passed on line 453 and accepted from callers (exporter.py line 744), but the method body never invokes it. The endLease failure path (line 522) explicitly comments "No request_lease_release." If this is intentional for API symmetry with run_after_lease_hook, consider documenting it in the docstring or removing the parameter to avoid confusion.

python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)

313-332: Consider using %-style formatting in logger calls instead of f-strings.

Lines 330 and 332 use f-strings in logger calls (logger.info(f"Updated status to {status}: {message}")). F-strings are evaluated eagerly regardless of log level, while %-style formatting (logger.info("Updated status to %s: %s", status, message)) defers evaluation. This pattern appears throughout the PR. Minor performance concern but aligns with Python logging best practices.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 372-378: The current sync call process.wait() after process.kill()
blocks the async event loop; change it to the async form (await process.wait())
so it matches the earlier async wait usage and doesn't block. Locate the block
around process.kill() / process.wait() in hooks.py (the code using the process
variable) and replace the synchronous wait with an awaited wait, keeping the
surrounding try/except behavior (catching exceptions from the awaited wait) so
the kill-and-wait path remains non-blocking.
🧹 Nitpick comments (1)
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

448-454: request_lease_release parameter is accepted but never used.

The request_lease_release parameter on run_before_lease_hook is never called within the method body. If this is intentionally kept for signature consistency with run_after_lease_hook, consider documenting that intent. Otherwise, it's dead code.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 390-399: The finally block unconditionally closes master_fd and
slave_fd which can double-close an fd if slave_fd was already closed earlier;
instead consult the fds_open tracking dict (used earlier in this module) to only
close descriptors still marked open—check fds_open.get(master_fd) and
fds_open.get(slave_fd) (or remove keys after close) before calling os.close, and
update fds_open accordingly to avoid closing a reused unrelated fd; adjust the
cleanup in the finally block around the existing master_fd/slave_fd close calls
to use fds_open checks and safe removal.
- Around line 118-120: The call passes lease_scope.session (typed Session |
None) into _execute_hook_process which requires a non-optional logging_session,
so add an explicit guard in the public entry points (execute_before_lease_hook
and execute_after_lease_hook) to validate lease_scope.session is not None (or
call lease_scope.is_ready() where appropriate) and raise a clear ValueError (or
return early) if missing; alternatively, coalesce to an explicit default Session
only after ensuring availability and then pass that non-None session into
_execute_hook_process, keeping references to _execute_hook_process,
lease_scope.session, execute_before_lease_hook, execute_after_lease_hook,
run_before_lease_hook and is_ready() to locate the change.
🧹 Nitpick comments (5)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)

127-204: Significant duplication between wait_for_status and wait_for_any_of.

The connection-recovery preamble (lines 141–162 vs 220–241), the inner wait loop with double-check pattern, and the timeout wrapper are nearly identical between these two methods. Consider extracting:

  1. A shared _verify_connection helper for the recovery block.
  2. A generic wait loop parameterized by a match predicate (lambda status: status == target vs lambda status: status in targets).

This would halve the surface area for future bugs in the wait/recovery logic.

Also applies to: 206-288


321-321: Use lazy %-formatting in logger calls instead of f-strings.

Lines 321, 332, 349, 392, 394 use f-strings inside logger.warning/logger.info/logger.debug/logger.error. This eagerly interpolates the string even when the log level is disabled.

Example fix for a couple of lines
-                    logger.warning(f"Missed {missed} status transition(s)")
+                    logger.warning("Missed %d status transition(s)", missed)
...
-                    logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})")
+                    logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)
...
-                            logger.error(f"Status change callback error: {e}")
+                            logger.error("Status change callback error: %s", e)
...
-                logger.debug(f"GetStatus poll error: {e.code()}")
+                logger.debug("GetStatus poll error: %s", e.code())
...
-                logger.debug(f"GetStatus poll error: {e}")
+                logger.debug("GetStatus poll error: %s", e)

Also applies to: 332-332, 349-349, 392-394


393-394: Bare except Exception silently swallows unexpected errors at DEBUG level.

Programming errors (e.g., TypeError, AttributeError) in the poll loop will be logged at debug and silently retried forever. Consider logging at warning with exc_info=True so unexpected failures are visible in normal operation.

             except Exception as e:
-                logger.debug(f"GetStatus poll error: {e}")
+                logger.warning("Unexpected GetStatus poll error: %s", e, exc_info=True)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (2)

338-340: Use lazy %s formatting in logger calls instead of f-strings.

f-strings are evaluated eagerly even when the log level is disabled. Minor performance concern on hot paths like status reporting.

-            logger.info(f"Updated status to {status}: {message}")
+            logger.info("Updated status to %s: %s", status, message)
         except Exception as e:
-            logger.error(f"Failed to update status: {e}")
+            logger.error("Failed to update status: %s", e)

640-653: Client connection tasks outlive the session scope.

Tasks spawned at line 646 via tg.start_soon (outer task group) will continue running after handle_lease exits and session_for_lease closes the Unix socket. These tasks will encounter connection errors and exit via the warning at line 437, which is graceful but not clean.

Consider spawning them on conn_tg instead of tg so they're cancelled with the lease, or document this as intentional behavior.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 367-411: The tests "hooks F2: .py file auto-detects Python and
uses driver API" and "hooks F3: script as file path executes the file" create
temp files (/tmp/jumpstarter-e2e-hook-python.py and
/tmp/jumpstarter-e2e-hook-script.sh) that can leak when assertions fail because
rm -f at the end won't run; fix by writing those files into $BATS_TEST_TMPDIR
instead of /tmp and updating the corresponding exporter overlays
(exporter-hooks-exec-python.yaml and exporter-hooks-exec-script-file.yaml) to
reference the test-temp paths, or add a global teardown() that removes both
/tmp/jumpstarter-e2e-hook-python.py and /tmp/jumpstarter-e2e-hook-script.sh to
guarantee cleanup even on failures.
🧹 Nitpick comments (7)
e2e/tests-hooks.bats (2)

9-10: Unused variable CURRENT_HOOKS_CONFIG.

This variable is declared but never read or written elsewhere in the file. Remove it to avoid confusion.

-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

68-106: Extract shared config-merge logic to reduce duplication.

Lines 74-78 and 96-100 are identical. A small helper would keep both callers DRY and reduce the chance of them drifting apart.

Proposed refactor
+# Helper: Prepare exporter config by clearing hooks and merging overlay
+prepare_hooks_config() {
+  local config_file="$1"
+  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
+    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
+    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+}
+
 start_hooks_exporter() {
   local config_file="$1"
   stop_hooks_exporter
-  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
-  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  prepare_hooks_config "$config_file"
   ...
 }
 
 start_hooks_exporter_single() {
   local config_file="$1"
   stop_hooks_exporter
-  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
-  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  prepare_hooks_config "$config_file"
   ...
 }
python/packages/jumpstarter/jumpstarter/config/exporter.py (2)

176-177: Line likely exceeds 120-char limit (Ruff E501).

Line 177 is a single long expression. Consider breaking it for readability and linting compliance.

Suggested fix
     `@classmethod`
     def dump_yaml(self, config: Self) -> str:
-        return yaml.safe_dump(config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}), sort_keys=False)
+        return yaml.safe_dump(
+            config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}),
+            sort_keys=False,
+        )

As per coding guidelines, python/**/*.py: Use Ruff for code formatting and linting.


187-188: Same line-length issue in save.

Suggested fix
         with config.path.open(mode="w") as f:
-            yaml.safe_dump(config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}), f, sort_keys=False)
+            yaml.safe_dump(
+                config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}),
+                f,
+                sort_keys=False,
+            )

As per coding guidelines, python/**/*.py: Use Ruff for code formatting and linting.

python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py (1)

20-37: Fixture lease_scope doesn't set hook_socket_path, which triggers a warning log in every test.

Every test using this fixture will hit the logger.warning("No dedicated hook socket available...") path in _create_hook_env. This is fine for testing the fallback, but if you want cleaner test output or want to also cover the dedicated hook socket path, consider adding a test or parameterizing the fixture.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (2)

503-517: Polling loop uses floating-point accumulation which can drift.

The elapsed += interval pattern accumulates floating-point rounding errors over 300 iterations (30s / 0.1s). While unlikely to cause issues at this scale, using time.monotonic() would be more robust and consistent with the pattern used in read_pty_output.

Suggested improvement
-            timeout = 30  # seconds
-            interval = 0.1  # seconds
-            elapsed = 0.0
+            import time
+            deadline = time.monotonic() + 30  # seconds
+            interval = 0.1  # seconds
             while not lease_scope.is_ready():
-                if elapsed >= timeout:
+                if time.monotonic() >= deadline:
                     error_msg = "Timeout waiting for lease scope to be ready"
                     logger.error(error_msg)
                     await report_status(ExporterStatus.BEFORE_LEASE_HOOK_FAILED, error_msg)
                     lease_scope.before_lease_hook.set()
                     return
                 await anyio.sleep(interval)
-                elapsed += interval

166-236: Inline imports of pty, subprocess, fcntl, sys, time inside the method body.

These standard-library modules are imported inside the method. While pty and fcntl are Linux-only (justifying lazy import for portability), subprocess, sys, and time are universal. Consider moving the universal ones to the top of the file for clarity.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 280-296: In the test "@test \"hooks C2: afterLease onFailure=exit
shuts down exporter\"" replace the strict assert_output --partial "afterLease
hook failed" with a regexp alternation to handle the race (e.g. assert_output
--regexp "(afterLease hook failed|Connection to exporter lost)"); update the
assertion in that test block so it accepts either the hook-failed message or the
connection-drop message to avoid flaky failures when the exporter exits before
the status reaches the client.

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 636-652: The code reports ExporterStatus.AFTER_LEASE_HOOK_FAILED
but comments say the exporter remains available; fix the inconsistency by either
(A) updating the comments/logs around the afterLease error handling (the except
blocks that call report_status with ExporterStatus.AFTER_LEASE_HOOK_FAILED) to
state the exporter is now offline/unavailable (matching enums.py, exporter.py
terminal list and client behavior), or (B) if the exporter should remain
available, change the reported status to a non-terminal/recoverable status
(e.g., add a new ExporterStatus value like AFTER_LEASE_HOOK_FAILED_RECOVERABLE
in enums.py, update exporter.py terminal-status checks and shell.py client
handling, and use that new status in the report_status calls); locate and update
the two places calling report_status(...) with
ExporterStatus.AFTER_LEASE_HOOK_FAILED in hooks.py and make corresponding
enum/terminal-list/client changes if choosing option B.
🧹 Nitpick comments (7)
e2e/tests-hooks.bats (4)

9-10: Unused variable CURRENT_HOOKS_CONFIG.

This variable is declared but never read or written anywhere in the file. Remove it to avoid confusion.

-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

12-33: Consider adding idempotency guards to setup_file provisioning.

If a previous run was interrupted (e.g., Ctrl-C skipping teardown_file), the jmp admin create commands will fail because the resources already exist. Appending || true (or using an ensure_* pattern that checks existence first) would make re-runs safer.

-  jmp admin create client -n "${JS_NAMESPACE}" test-client-hooks --unsafe --out "${BATS_RUN_TMPDIR}/test-client-hooks.yaml" \
-    --oidc-username dex:test-client-hooks
+  jmp admin create client -n "${JS_NAMESPACE}" test-client-hooks --unsafe --out "${BATS_RUN_TMPDIR}/test-client-hooks.yaml" \
+    --oidc-username dex:test-client-hooks || true
 
-  jmp admin create exporter -n "${JS_NAMESPACE}" test-exporter-hooks --out "${BATS_RUN_TMPDIR}/test-exporter-hooks.yaml" \
-    --oidc-username dex:test-exporter-hooks \
-    --label example.com/board=hooks
+  jmp admin create exporter -n "${JS_NAMESPACE}" test-exporter-hooks --out "${BATS_RUN_TMPDIR}/test-exporter-hooks.yaml" \
+    --oidc-username dex:test-exporter-hooks \
+    --label example.com/board=hooks || true

Based on learnings: E2E tests in bats should be runnable in isolation. Each test (or setup/teardown blocks) should explicitly prepare its prerequisites, reset global state, and ensure deterministic results.


92-131: Extract shared config-reset logic into a helper to reduce duplication.

start_hooks_exporter and start_hooks_exporter_single duplicate the same 4-line yq clear-and-merge block. A small helper would keep changes in one place.

Suggested refactor
+# Helper: Reset exporter config and merge overlay
+apply_hooks_config() {
+  local config_file="$1"
+  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
+    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
+    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+}
+
 start_hooks_exporter() {
   local config_file="$1"
   stop_hooks_exporter
-  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
-  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  apply_hooks_config "$config_file"
   ...
 }
 
 start_hooks_exporter_single() {
   local config_file="$1"
   stop_hooks_exporter
-  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
-  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  apply_hooks_config "$config_file"
   ...
 }

339-350: E3: client= regex doesn't verify a non-empty value.

[[ "$output" =~ client= ]] would pass even if CLIENT_NAME is empty. Consider tightening:

-  [[ "$output" =~ client= ]]
+  [[ "$output" =~ client=[^[:space:]]+ ]]
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)

22-46: Exception.__init__ is never called, so e.args is empty.

@dataclass generates __init__ but doesn't call super().__init__(message). This means e.args will be (), which can confuse logging frameworks and traceback formatters that rely on args[0] for the error message.

Suggested fix
 `@dataclass`
 class HookExecutionError(Exception):
     ...
     message: str
     on_failure: Literal["endLease", "exit"]
     hook_type: Literal["before_lease", "after_lease"]
 
+    def __post_init__(self):
+        super().__init__(self.message)
+
     def __str__(self) -> str:
         return self.message

335-364: wait_for_process finally block's async cleanup is unreliable under cancellation.

When move_on_after(timeout) fires, the task group is cancelled, and wait_for_process's finally block runs. However, await anyio.sleep(0.1) (line 356) and await anyio.to_thread.run_sync(...) (line 362) will raise Cancelled immediately in anyio unless shielded. The graceful terminate→poll→kill→reap sequence won't execute.

This isn't a bug because lines 396–411 duplicate the cleanup for the timeout path, but it means the finally block in wait_for_process only works for non-timeout cancellations (e.g., parent task cancellation). Consider wrapping the finally body in CancelScope(shield=True) so the cleanup is reliable in all cancellation scenarios, or add a comment documenting that timeout cleanup is handled externally.


509-520: Busy-wait polling loop could be replaced with an anyio.Event.

The 0.1s polling loop to wait for lease_scope.is_ready() works but is a busy-wait pattern. If LeaseContext could expose an anyio.Event that's set when it becomes ready, this would be cleaner and more efficient. Low priority since 0.1s interval is acceptable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 388-439: The parent_fd is being left unclosed because
fds_open["parent"] is used both to signal read_pty_output to stop and to track
whether parent_fd should be closed; when you set fds_open["parent"]=False to
stop the reader the finally block skips closing parent_fd. Fix by introducing a
separate stop flag (e.g., reader_stop or reader_stop["flag"]) used only by
read_pty_output/read loop and keep fds_open["parent"] solely for tracking
whether parent_fd is open; update read_pty_output to check reader_stop instead
of fds_open, set reader_stop["flag"]=True where currently fds_open["parent"] is
set to False, and ensure the finally block always checks fds_open["parent"] to
close parent_fd (and child_fd) if still open.
🧹 Nitpick comments (3)
e2e/tests-hooks.bats (3)

9-10: Unused variable CURRENT_HOOKS_CONFIG.

This variable is declared but never read or written anywhere else in the file. Remove it to avoid confusion.

Proposed fix
-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

92-131: Duplicated config-preparation logic between start_hooks_exporter and start_hooks_exporter_single.

The del(.hooks) + merge sequence (lines 98–102 and 121–125) is identical. Extract it into a small helper to keep changes in one place.

Proposed refactor
+# Helper: Prepare exporter config by clearing hooks and merging overlay
+prepare_hooks_config() {
+  local config_file="$1"
+  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
+    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
+    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+}
+
 start_hooks_exporter() {
   local config_file="$1"
   stop_hooks_exporter
-  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
-  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  prepare_hooks_config "$config_file"
   ...
 }

 start_hooks_exporter_single() {
   local config_file="$1"
   stop_hooks_exporter
-  go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
-  go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \
-    /etc/jumpstarter/exporters/test-exporter-hooks.yaml
+  prepare_hooks_config "$config_file"
   ...
 }

12-33: setup_file create commands will fail if resources already exist from a prior aborted run.

If a previous test run didn't clean up (e.g., CI killed mid-run), the jmp admin create commands will fail since the resources already exist, blocking the entire suite. Consider adding || true or a pre-cleanup step.

Proposed fix — delete before create for idempotency
 setup_file() {
   # Initialize the PIDs file at the start of all tests
   echo "" > "$HOOKS_EXPORTER_PIDS_FILE"

+  # Clean up any leftover resources from prior runs
+  jmp admin delete client --namespace "${JS_NAMESPACE}" test-client-hooks --delete 2>/dev/null || true
+  jmp admin delete exporter --namespace "${JS_NAMESPACE}" test-exporter-hooks --delete 2>/dev/null || true
+
   # Create client and exporter for hooks tests
   jmp admin create client -n "${JS_NAMESPACE}" test-client-hooks --unsafe --out "${BATS_RUN_TMPDIR}/test-client-hooks.yaml" \
     --oidc-username dex:test-client-hooks

Based on learnings: E2E tests in bats should be runnable in isolation. Each test (or setup/teardown blocks) should explicitly prepare its prerequisites, reset global state, and ensure deterministic results.

@kirkbrauer kirkbrauer self-assigned this Feb 9, 2026
@kirkbrauer kirkbrauer added enhancement New feature or request go Pull requests that update go code labels Feb 9, 2026
…imeout, "Running afterLease hook" causing a freeze.
…s waiting for a lease on the same exporter and the first lease ends
…the client errors and kills the lease prematurely
…lient disconnects prematurely. Set timeout to 300s
@github-actions
Copy link

Container Images

The following container images have been built for this PR:

Image URI
jumpstarter-controller quay.io/jumpstarter-dev/jumpstarter-controller:pr-140
jumpstarter-operator quay.io/jumpstarter-dev/jumpstarter-operator:pr-140
jumpstarter-operator-bundle quay.io/jumpstarter-dev/jumpstarter-operator-bundle:pr-140
jumpstarter quay.io/jumpstarter-dev/jumpstarter:pr-140
jumpstarter-utils quay.io/jumpstarter-dev/jumpstarter-utils:pr-140
jumpstarter-dev quay.io/jumpstarter-dev/jumpstarter-dev:pr-140
jumpstarter-devspace quay.io/jumpstarter-dev/jumpstarter-devspace:pr-140

Images expire after 7 days.

@kirkbrauer kirkbrauer requested a review from mangelajo February 25, 2026 05:01
@kirkbrauer
Copy link
Member Author

@evakhoni I have taken the test cases you presented above and added them to the e2e tests. Hopefully all of those issues are resolved now that the e2e tests are passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-pr-images enhancement New feature or request go Pull requests that update go code protocol python Pull requests that update python code

Projects

Development

Successfully merging this pull request may close these issues.

3 participants