Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exporter lifecycle hooks (beforeLease/afterLease), exporter status fields/enums and status_message, new ReportStatus release_lease flag and EndSession RPC, StatusMonitor and client status APIs, HookExecutor and LeaseContext, CRD/schema and printer-column updates, protobuf/grpc regeneration, CLI/status rendering, extensive tests and e2e hook configs, and small Helm/manifest build tweaks. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,230,255,0.5)
participant Client
participant Exporter
participant HookExec as HookExecutor
participant Controller
end
Client->>Exporter: Request lease / Dial
Exporter->>Controller: ReportStatus(status=BEFORE_LEASE_HOOK)
Exporter->>HookExec: execute_before_lease_hook(LeaseContext)
HookExec-->>Exporter: result (success / failure)
alt before-hook success
Exporter->>Controller: ReportStatus(status=LEASE_READY)
Client->>Exporter: Driver calls allowed
else onFailure=endLease
Exporter->>Controller: ReportStatus(..., release_lease=true)
Controller->>Controller: mark lease.release = true
end
Client->>Exporter: End session
Exporter->>HookExec: execute_after_lease_hook(...)
HookExec-->>Exporter: after-hook done
Exporter->>Controller: ReportStatus(status=OFFLINE)
sequenceDiagram
rect rgba(230,255,200,0.5)
participant AsyncClient
participant StatusMon as StatusMonitor
participant ExporterSvc as ExporterService
end
AsyncClient->>StatusMon: start()
loop polling
StatusMon->>ExporterSvc: GetStatus()
ExporterSvc-->>StatusMon: status (+version, previous_status)
StatusMon-->>AsyncClient: notify on change / wake waiters
end
AsyncClient->>StatusMon: wait_for_status(LEASE_READY)
StatusMon-->>AsyncClient: signal when LEASE_READY observed
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/config/client.py (1)
184-196: Addstatusfield to exporter reconstruction to preserve it wheninclude_leases=True.The exporter objects lose their
statusfield when leases are included. The original exporters have astatusfield that should be preserved during reconstruction, especially wheninclude_status=Trueis requested. Update line 187-193 to includestatus=exporter.status.Code snippet
exporter_with_lease = Exporter( namespace=exporter.namespace, name=exporter.name, labels=exporter.labels, online=exporter.online, status=exporter.status, # Add this line lease=lease, )python/packages/jumpstarter/jumpstarter/config/exporter.py (1)
224-246: Remove unuseddevice_factoryparameter fromHookExecutor.The
device_factoryparameter passed toHookExecutorat line 231 is never used.HookExecutoronly executes shell commands via subprocess and accesses the session for logging; it has no need for driver instantiation. Remove the unused parameter from both theHookExecutordataclass definition and the instantiation call.
🤖 Fix all issues with AI agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml`:
- Around line 20-47: Update the CRD description strings so they reference
"Client" consistently instead of "identities" or "Identity": change the top
description "Client is the Schema for the identities API" to a concise "Client
is the Schema for the Client API" (or simply "Client"), and update the
spec.description "ClientSpec defines the desired state of Identity" and
status.description "ClientStatus defines the observed state of Identity" to use
"Client" (e.g., "ClientSpec defines the desired state of Client" and
"ClientStatus defines the observed state of Client") so all descriptive fields
in the CRD (the description at the top and the
spec.description/status.description entries) no longer reference
Identity/identities.
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml`:
- Around line 30-31: Update the CRD description for the Lease resource: replace
the current "Lease is the Schema for the exporters API" text in the description
property of the Lease CRD with the corrected wording using "leases" (e.g.,
"Leases is the Schema for the exporters API" or "leases is the Schema for the
exporters API") so the description uses the plural form; locate the description
property under the Lease CRD (kind: Lease) in the CRD template and change that
string accordingly.
In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py`:
- Line 19: basicConfig assignment line exceeds max line length; split the call
across multiple lines to satisfy Ruff E501. Locate the basicConfig variable
initialization (basicConfig = partial(logging.basicConfig, format="%(message)s
[%(name)s]", handlers=[RichHandler(show_path=False)]) ) and reformat the
arguments onto separate lines (e.g., place format=... and handlers=[...] each on
their own lines and close parens on its own line) so the call to
partial(logging.basicConfig, ...) fits within line-length limits while
preserving the same arguments and use of RichHandler(show_path=False).
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 80-89: Remove the unused synchronous helper _run_shell_with_lease:
delete the entire function definition (including its docstring and with-blocks)
since the async variant _run_shell_with_lease_async is now the only one used;
then search the repo for any remaining references to _run_shell_with_lease and
remove or update them, run the test suite/lint to confirm no import or name
errors, and ensure behavior relies solely on _run_shell_with_lease_async.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 395-401: The async context manager on StatusMonitor is incomplete:
__aenter__ currently just returns self and never starts the background polling;
update __aenter__ to initialize the task group and call self.start() (mirroring
client.status_monitor_async() behavior) so entering the context begins
monitoring, and keep __aexit__ awaiting self.stop(); alternatively remove both
methods if context-manager usage is unsupported—modify the methods on the
StatusMonitor class (__aenter__, __aexit__) to either perform full start/stop
lifecycle (create task group + await self.start() in __aenter__, await
self.stop() in __aexit__) or delete them to avoid silent failures.
In `@python/packages/jumpstarter/jumpstarter/common/enums.py`:
- Around line 26-27: The docstring for the enum value AFTER_LEASE_HOOK (mapped
to common_pb2.EXPORTER_STATUS_AFTER_LEASE_HOOK) contains a typo; update the
string from "Lease was releaseed, but exporter is executing after lease hook" to
"Lease was released, but exporter is executing after lease hook" so the
documentation reads correctly.
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 541-613: The Listen retry task started with
tg.start_soon(self._retry_stream, "Listen",
self._listen_stream_factory(lease_name), listen_tx) is attached to the outer
task group so it keeps running after the lease ends; move that start call into
the lease-scoped task group (use conn_tg.start_soon(...) before starting
wait_for_lease_end/process_connections) so the Listen task is cancelled with the
lease, and in the enclosing finally block ensure the send stream is closed
(await listen_tx.aclose() or the appropriate close method) to signal listen_rx
termination; reference symbols: _retry_stream,
_listen_stream_factory(lease_name), listen_tx/listen_rx, tg -> conn_tg, and
conn_tg.start_soon.
- Around line 202-321: Change _get_controller_stub to be an async context
manager (e.g., async def _controller_stub(self):) that creates the channel via
self.channel_factory(), yields a
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and always closes the
underlying channel with await channel.close() in a finally block; then update
all callers to use "async with self._controller_stub() as controller:" instead
of awaiting _get_controller_stub(), specifically in _retry_stream (where
controller is acquired for stream_factory), _register_with_controller (Register
call), and _report_status (ReportStatus call) so channels are closed after use
and no resources leak.
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 299-307: wait_for_process currently uses
anyio.to_thread.run_sync(process.wait, abandon_on_cancel=True) which can leave
the child as a zombie if the wait is abandoned; modify wait_for_process to
ensure the subprocess is reaped on cancellation by adding a try/finally: in the
try await the run_sync as before, but in the finally check process.poll() and if
it is None try to terminate/kill (process.terminate()/process.kill()) then call
a non-abandoning run_sync(process.wait) or loop-poll with process.poll() until
exit to fully reap the child; reference the wait_for_process coroutine and the
process variable when making the changes.
- Around line 257-260: The debug call inside read_pty_output exceeds the
120-char line limit; shorten the line by splitting the log message across
multiple strings or moving the format string to a short variable and then
calling logger.debug. For example, break the message into two adjacent string
literals or assign msg = "read_pty_output: heartbeat at %.1fs, iterations=%d"
and call logger.debug(msg, elapsed, read_count) so you keep the same
logger.debug call and the same format arguments (logger, elapsed, read_count)
but comply with the E501 line-length rule.
- Around line 186-189: The test mock for logging_session must support the
context manager used by logging_session.context_log_source in hooks.py; update
the test fixture (e.g., mock_session) so mock_session.context_log_source returns
an object that implements __enter__ and __exit__ (for async usage you can set
mock_session.context_log_source.return_value.__enter__ and __exit__ to AsyncMock
or wrap with contextlib.nullcontext), ensuring the context manager protocol is
honored when hooks.py calls logging_session.context_log_source.
🧹 Nitpick comments (10)
controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_exporteraccesspolicies.yaml (2)
149-152: Add schema validation for duration and priority.Without constraints, malformed durations or negative priorities can be accepted and only fail at runtime. Consider adding format/bounds in the CRD.
♻️ Suggested schema tightening
maximumDuration: type: string + format: duration priority: type: integer + minimum: 0
40-157: Require spec/policies to avoid empty CRs.Right now an empty resource (missing spec or policies) is valid. If that’s not intended, enforce it in the schema to reduce no-op objects.
♻️ Suggested required fields
spec: description: ExporterAccessPolicySpec defines the desired state of ExporterAccessPolicy. properties: exporterSelector: description: |- A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. properties: matchExpressions: description: matchExpressions is a list of label selector requirements. The requirements are ANDed. items: description: |- A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values. properties: key: description: key is the label key that the selector applies to. type: string operator: description: |- operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist. type: string values: description: |- values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch. items: type: string type: array x-kubernetes-list-type: atomic required: - key - operator type: object type: array x-kubernetes-list-type: atomic matchLabels: additionalProperties: type: string description: |- matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed. type: object type: object x-kubernetes-map-type: atomic policies: items: properties: from: items: properties: clientSelector: description: |- A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. properties: matchExpressions: description: matchExpressions is a list of label selector requirements. The requirements are ANDed. items: description: |- A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values. properties: key: description: key is the label key that the selector applies to. type: string operator: description: |- operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist. type: string values: description: |- values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch. items: type: string type: array x-kubernetes-list-type: atomic required: - key - operator type: object type: array x-kubernetes-list-type: atomic matchLabels: additionalProperties: type: string description: |- matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed. type: object type: object x-kubernetes-map-type: atomic type: object type: array maximumDuration: type: string priority: type: integer spotAccess: type: boolean type: object type: array + minItems: 1 type: object + required: + - policies status: description: ExporterAccessPolicyStatus defines the observed state of ExporterAccessPolicy. type: object type: object + required: + - specAlso applies to: 162-162
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
157-161: Redundant timeout configuration.Both
move_on_after(300)andwait_for_any_of(..., timeout=300.0)specify 300 seconds. Sincewait_for_any_ofhas its own timeout, the outermove_on_afteris redundant. Consider either removing the outer timeout or differentiating them if different semantics are intended (e.g., outer as a hard cap slightly longer than inner).♻️ Suggested simplification
- with anyio.move_on_after(300) as timeout_scope: # 5 minute timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=300.0 - ) + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=300.0 + ) + if result is None: + logger.warning("Timeout waiting for afterLease hook to complete")Note: This would require adjusting the subsequent conditional logic (lines 162-172) to handle the timeout case inline rather than via
timeout_scope.cancelled_caught.controller/Makefile (1)
66-68: Make the CRD copy step more robust with mkdir.If the templates directory is missing in a fresh checkout,
cpwill fail.♻️ Suggested tweak
cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/ + mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/ cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/controller/internal/service/controller_service.go (1)
318-338: Consider logging unexpected status values in the default case.The
protoStatusToStringfunction silently maps unknown proto status values toExporterStatusUnspecified. While this provides a safe fallback, logging at debug/warning level would help diagnose protocol mismatches.♻️ Optional: Add logging for unexpected status values
func protoStatusToString(status pb.ExporterStatus) string { switch status { case pb.ExporterStatus_EXPORTER_STATUS_OFFLINE: return jumpstarterdevv1alpha1.ExporterStatusOffline // ... other cases ... default: + // Log unexpected status values for debugging protocol mismatches + // Consider using a package-level logger if available return jumpstarterdevv1alpha1.ExporterStatusUnspecified } }python/packages/jumpstarter/jumpstarter/common/enums.py (1)
38-45: Consider handling invalid protobuf values gracefully infrom_proto.The current implementation will raise
ValueErrorif an unrecognized integer is passed tofrom_proto. For robustness against protocol version mismatches, consider returningUNSPECIFIEDfor unknown values.♻️ Optional: Handle unknown values gracefully
`@classmethod` def from_proto(cls, value: int) -> "ExporterStatus": """Convert from protobuf integer to enum.""" - return cls(value) + try: + return cls(value) + except ValueError: + return cls.UNSPECIFIEDpython/packages/jumpstarter/jumpstarter/exporter/logging.py (1)
31-39: Consider longest-prefix matching for hierarchical loggers.The current prefix matching returns the first match found during iteration. With hierarchical logger names (e.g.,
driver.power,driver.power.usb), if bothdriveranddriver.powerare mapped, the match depends on iteration order. Consider sorting by key length descending to ensure the most specific (longest) prefix wins.♻️ Suggested improvement for deterministic matching
def get_source_for_record(self, record): """Determine the appropriate log source for a record.""" with self._lock: # Check if this record comes from a logger with a specific source mapping logger_name = record.name - for mapped_logger, source in self._child_handlers.items(): + # Sort by key length descending to match most specific prefix first + for mapped_logger, source in sorted( + self._child_handlers.items(), key=lambda x: len(x[0]), reverse=True + ): if logger_name.startswith(mapped_logger): return source return self.sourcepython/packages/jumpstarter/jumpstarter/client/core.py (1)
440-510: Consider extracting helper functions to reduce complexity.The static analysis correctly flags this method as too complex (C901: 14 > 10). The method handles multiple concerns: reconnection with backoff, log source routing, and filtering. While the current structure is readable, extracting helpers would improve maintainability.
♻️ Suggested structure for reduced complexity
def _route_log_to_logger(self, response, show_all_logs: bool) -> None: """Route a log response to the appropriate logger.""" from jumpstarter.common import LogSource if response.HasField("source"): source = LogSource(response.source) is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK) else: source = LogSource.SYSTEM is_hook = False if not (is_hook or show_all_logs): return severity = response.severity if response.severity else "INFO" log_level = getattr(logging, severity, logging.INFO) logger_names = { LogSource.BEFORE_LEASE_HOOK: "exporter:beforeLease", LogSource.AFTER_LEASE_HOOK: "exporter:afterLease", LogSource.DRIVER: "exporter:driver", } logger_name = logger_names.get(source, "exporter:system") logging.getLogger(logger_name).log(log_level, response.message)Then use this helper in the main loop to reduce nesting.
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
496-498: TODO: Controller-side lease ending not yet implemented.The comment indicates that when
on_failure='endLease', there's no mechanism to actually end the lease from the exporter side. This should be tracked.Would you like me to open an issue to track the controller-side lease ending mechanism for
beforeLeasehook failures withon_failure='endLease'?python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
197-279: Consider extracting common verification logic.Both
wait_for_statusandwait_for_any_ofshare substantial code for connection recovery verification. This duplication contributes to the complexity flagged by static analysis.♻️ Suggested refactor to reduce duplication
async def _verify_connection_recovery(self) -> ExporterStatus | None: """Verify connection and return current status if recovered, None if still lost.""" if not self._connection_lost: return self._current_status logger.debug("Connection was marked as lost, verifying...") try: response = await self._stub.GetStatus(jumpstarter_pb2.GetStatusRequest()) logger.info("Connection recovered during verification poll") self._connection_lost = False new_status = ExporterStatus.from_proto(response.status) self._current_status = new_status self._status_version = response.status_version return new_status except AioRpcError as e: if e.code() == StatusCode.UNAVAILABLE: logger.debug("Connection still lost (UNAVAILABLE)") else: logger.debug("GetStatus error during verification: %s", e.code()) return None except Exception as e: logger.debug("GetStatus error during verification: %s", e) return NoneThen use this helper in both
wait_for_statusandwait_for_any_of.
...y/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml
Show resolved
Hide resolved
...oy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml
Show resolved
Hide resolved
python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py`:
- Around line 8-20: The test_log_stream uses fixed time.sleep(1) which is flaky;
replace these sleeps with a short-polling loop that checks caplog.text until the
expected substring appears or a timeout expires. In test_log_stream (around
client.log_stream(), client.on(), client.off()) implement a small helper or
inline poll using time.monotonic() to loop with a tiny sleep (e.g., 0.01–0.1s)
and fail if timeout reached, then assert "power on" and "power off" only after
the poll succeeds; keep using caplog and the same logger scope so the
client.log_stream(), client.on(), and client.off() calls remain unchanged.
In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Line 3: Remove the unused import OrderedDict from the imports in
client_test.py; locate the line "from collections import OrderedDict" and delete
it (or remove OrderedDict from the import list) so the lint warning is resolved
and no functionality is changed.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py`:
- Line 3: The import list in status_monitor_test.py includes an unused symbol
AsyncMock; remove AsyncMock from the import statement (keep MagicMock) so the
module only imports symbols actually used in the tests, ensuring static analysis
no longer flags the unused AsyncMock.
In `@python/packages/jumpstarter/jumpstarter/common/enums_test.py`:
- Line 3: Remove the unused top-level import "import pytest" from enums_test.py;
locate the import statement at the top of the test module and delete it so the
file no longer contains the unused pytest symbol.
🧹 Nitpick comments (4)
python/packages/jumpstarter/jumpstarter/exporter/lease_context_test.py (1)
186-200: Avoid time-based sleep to reduce test flakiness.Line 195 uses a fixed sleep, which can be flaky on slow CI. Consider a handshake event so the test deterministically waits for
wait_for_drivers()to start before setting the hook.♻️ Proposed deterministic rewrite
async def test_wait_for_drivers_blocks_until_set(self) -> None: """Test that wait_for_drivers() blocks until hook event is set.""" import anyio before_hook = Event() ctx = LeaseContext(lease_name="test-lease", before_lease_hook=before_hook) - # Set the event after a short delay - async def set_after_delay(): - await anyio.sleep(0.05) - before_hook.set() + waiter_started = Event() + + async def wait_and_signal(): + waiter_started.set() + await ctx.wait_for_drivers() async with anyio.create_task_group() as tg: - tg.start_soon(set_after_delay) - await ctx.wait_for_drivers() + tg.start_soon(wait_and_signal) + await waiter_started.wait() + before_hook.set() assert ctx.drivers_ready()python/packages/jumpstarter/jumpstarter/client/core_test.py (2)
16-32: Consider extracting shared test utilities.
MockAioRpcErrorandcreate_mock_rpc_errorare duplicated fromstatus_monitor_test.py. Consider extracting these to a shared test utilities module (e.g.,jumpstarter/client/test_utils.pyor aconftest.py).
90-116: Move import to module level.The
anyioimport insidewait_for_hook_statusshould be at the module level for consistency and clarity.Proposed fix
Add to imports at top of file:
import anyioThen update the method:
async def wait_for_hook_status(self, target_status: ExporterStatus, timeout: float = 60.0) -> bool: """Wait for exporter to reach a target status using polling.""" - import anyio - poll_interval = 0.1 # Fast for testingpython/packages/jumpstarter/jumpstarter/client/client_test.py (1)
121-124: Consider adding assertion to fully verify topological order.The comment states children should be built before parents, but the assertion only verifies
grandparentis built afterparent. Adding an assertion thatchildis built beforeparentwould fully validate the topological ordering.Suggested improvement
# Children should be built before parents in topological order - # The order should be child, parent, grandparent (or parent, child, grandparent) + # The order should be child, parent, grandparent assert "grandparent" in build_order assert build_order.index("grandparent") > build_order.index("parent") + assert build_order.index("parent") > build_order.index("child")
python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py
Show resolved
Hide resolved
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@controller/internal/service/suite_test.go`:
- Around line 56-58: The test harness sets up testEnv as an envtest.Environment
with CRDDirectoryPaths and ErrorIfCRDPathMissing: true, which will abort the
suite if the CRD path is wrong; verify and update the CRDDirectoryPaths value to
the actual CRD directory (or add multiple fallback paths) so envtest can find
the CRDs, e.g., check and correct the path used in the testEnv initialization
(the envtest.Environment struct, the CRDDirectoryPaths slice, and the
ErrorIfCRDPathMissing flag) or set ErrorIfCRDPathMissing to false while you
confirm the correct location.
- Around line 29-31: The test currently ignores errors when creating a
Namespace; change the Namespace creation logic (the call that creates a
corev1.Namespace object, e.g., client.CoreV1().Namespaces().Create or similar in
the test setup) to explicitly handle errors: import
k8s.io/apimachinery/pkg/api/errors as apierrors, then after the Create call
check if err != nil and if apierrors.IsAlreadyExists(err) continue/ignore,
otherwise fail the test (t.Fatalf or return the error). Apply the same pattern
for the other namespace-creation site referenced in the comment (the second
block around lines 81-87) so only AlreadyExists is tolerated and all other
errors cause the test to fail.
- Around line 19-24: Update the test setup in suite_test.go to avoid hard-coding
BinaryAssetsDirectory: first respect the KUBEBUILDER_ASSETS environment variable
if set (use it to set BinaryAssetsDirectory), then fall back to existing helper
GetFirstFoundEnvTestBinaryDir() or a directory-existence check for the current
hard-coded path (bin/k8s/1.30.0-<os>-<arch>), and only assign the hard-coded
path as a last resort; modify the code that sets BinaryAssetsDirectory to
perform conditional checks (os.Getenv("KUBEBUILDER_ASSETS"), os.Stat on
candidate dirs, and GetFirstFoundEnvTestBinaryDir()) so tests run correctly
outside the Makefile and when the version/path differs.
In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Around line 120-124: The test's topological-order check only ensures
"grandparent" comes after "parent" but doesn't verify that "child" precedes
"parent"; update the assertions to explicitly assert that
build_order.index("child") < build_order.index("parent") (and keep/optionally
assert both "child" and "parent" are before "grandparent") so the test enforces
children-before-parents ordering; reference the existing build_order list and
the node names "child", "parent", and "grandparent" when adding the
assertion(s).
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 31-38: The docstring usage example in status_monitor.py
incorrectly calls client.start_status_monitor(); update the example to call the
actual method client.status_monitor_async() (or add a thin alias
start_status_monitor -> status_monitor_async in the client if you prefer) so the
docstring matches the real API; ensure the rest of the example still references
monitor.current_status and monitor.wait_for_status(ExporterStatus.LEASE_READY).
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 426-500: The before-lease failure branch doesn't actually end the
lease for on_failure='endLease'; modify run_before_lease_hook to accept an async
lease-release callback (e.g., end_lease: Callable[[LeaseContext],
Awaitable[None]]) or reuse an existing controller-release method, and in the
HookExecutionError else-branch (the on_failure='endLease' path) invoke await
end_lease(lease_scope) (and still set lease_scope.before_lease_hook
event/unblock connections and report status) so the lease is cleanly terminated;
update all call sites of run_before_lease_hook to pass the new end_lease
callback (mirroring how run_after_lease_hook is wired).
♻️ Duplicate comments (2)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
80-90: Unused synchronous function.This function is no longer called anywhere in the codebase since line 213 now uses
_run_shell_with_lease_async. Consider removing it to reduce maintenance burden.python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
397-403: Async context manager won't start monitoring.The
__aenter__method returnsselfwithout callingstart(). Direct usage as an async context manager (async with StatusMonitor(...) as monitor:) will silently fail to start background polling. Users must useclient.status_monitor_async()instead. Either fix__aenter__to properly start monitoring or remove the context manager methods to prevent misuse.
🧹 Nitpick comments (9)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
157-161: Redundant double timeout.Both
anyio.move_on_after(300)andtimeout=300.0inwait_for_any_ofset 5-minute timeouts. The innertimeoutparameter makes the outer cancel scope redundant. Consider removing one:♻️ Proposed fix
- with anyio.move_on_after(300) as timeout_scope: # 5 minute timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=300.0 - ) + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=300.0 + ) + if result is None: + logger.warning("Timeout waiting for afterLease hook to complete")Then adjust the subsequent conditional logic accordingly, removing the
timeout_scope.cancelled_caughtcheck at line 171.
133-136: Consider checkingconnection_lostfor beforeLease hook.The afterLease handling (line 166) checks
monitor.connection_lostto distinguish between timeout and connection loss. The beforeLease handling treats both cases as "Timeout" which could be misleading in logs.♻️ Proposed enhancement
if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED: logger.warning("beforeLease hook failed") elif result is None: - logger.warning("Timeout waiting for beforeLease hook") + if monitor.connection_lost: + logger.warning("Connection lost during beforeLease hook") + else: + logger.warning("Timeout waiting for beforeLease hook")python/packages/jumpstarter/jumpstarter/client/client_test.py (1)
58-58: EnsureExitStackis closed after each test.
ExitStack()is never closed, which can leak cleanup callbacks ifclient_from_channelregisters any. Preferwith ExitStack() as mock_stack:to guarantee teardown.♻️ Example pattern (apply to each test)
- mock_stack = ExitStack() + with ExitStack() as mock_stack: with patch( "jumpstarter.client.client.MultipathExporterStub", return_value=mock_stub ), patch( "jumpstarter.client.client.import_class", return_value=MockDriverClient ): from jumpstarter.client.client import client_from_channel client = await client_from_channel( mock_channel, mock_portal, mock_stack, allow=[], unsafe=True )Also applies to: 98-98, 143-143, 177-177, 211-211, 245-245
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)
16-32: Consider extracting shared mock utilities to a common test module.
MockAioRpcErrorandcreate_mock_rpc_errorare duplicated here and incore_test.py(per the relevant snippets). Extracting these to a shared test utilities module (e.g.,conftest.pyor a dedicatedtesting.py) would reduce duplication.python/packages/jumpstarter/jumpstarter/client/core.py (1)
332-351: Asymmetric behavior when monitor is not started.
wait_for_lease_ready_monitoredraisesRuntimeErrorif the monitor isn't started, whilewait_for_hook_complete_monitoredsilently returnsTrue. This asymmetry could be surprising. Consider documenting this behavior difference more explicitly or using consistent error handling.python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
309-309: Use%-style formatting in logger calls instead of f-strings.F-strings in logging calls evaluate the string even when the log level is disabled, causing unnecessary overhead. Use
%-style formatting for deferred evaluation.Proposed fix
- logger.warning(f"Missed {missed} status transition(s)") + logger.warning("Missed %d status transition(s)", missed)- logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})") + logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)- logger.error(f"Status change callback error: {e}") + logger.error("Status change callback error: %s", e)- logger.debug(f"GetStatus poll error: {e.code()}") + logger.debug("GetStatus poll error: %s", e.code()) - logger.debug(f"GetStatus poll error: {e}") + logger.debug("GetStatus poll error: %s", e)Also applies to: 320-320, 337-337, 352-354
controller/internal/service/suite_test.go (1)
90-93: Guard teardown if setup failed.If
testEnvnever initialized,Stop()can panic inAfterSuite. A simple nil check avoids cascading failures.🛡️ Suggested guard
By("tearing down the test environment") - err := testEnv.Stop() - Expect(err).NotTo(HaveOccurred()) + if testEnv != nil { + err := testEnv.Stop() + Expect(err).NotTo(HaveOccurred()) + }controller/Makefile (1)
66-75: Guard against duplicatemetadata.annotationsblocks.Line 66–71 unconditionally inserts a new
annotations:map. If the generated role ever includes annotations (now or later), this can create duplicate keys and potentially drop the sync-wave or other annotations depending on YAML parsing behavior. Consider merging into an existing annotations map or using a YAML-aware tool for the insert.python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
341-356: Useanyio.to_thread.run_sync()with timeout for blockingprocess.wait()afterprocess.kill().The direct
process.wait()call at line 354 blocks the event loop if the process doesn't exit promptly. The same block already demonstrates the correct pattern at line 346–347 withanyio.to_thread.run_sync()andanyio.move_on_after(). Apply the same pattern here for consistency and robustness:Safer non‑blocking wait
if process.poll() is None: process.kill() try: - process.wait() + with anyio.move_on_after(5): + await anyio.to_thread.run_sync(process.wait) except Exception: pass
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@controller/Makefile`:
- Line 74: The Makefile currently runs the cp command copying CRDs to
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/, which
will fail if that target directory doesn't exist; update the Makefile to create
the directory first by running a mkdir -p for
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
immediately before the cp command so the cp in the Makefile succeeds even when
the templates/crds directory is missing.
🧹 Nitpick comments (1)
controller/Makefile (1)
66-71: Idempotency risk: duplicateannotations:keys if file already contains annotations.The AWK script unconditionally inserts an
annotations:block after thename:line. If controller-gen ever outputs annotations in the generated role.yaml, or if the manifests target runs without a fresh regeneration, this will produce invalid YAML with duplicateannotations:keys.Consider using a more robust approach that checks for existing annotations before inserting:
♻️ Suggested safer AWK approach
# Add ArgoCD sync-wave annotation to RBAC role for proper deployment ordering (PR `#207`) - `@awk` '/^ name: jumpstarter-manager-role$$/{print; print " annotations:"; print " argocd.argoproj.io/sync-wave: \"-1\""; next}1' \ + `@awk` ' \ + /^ name: jumpstarter-manager-role$$/ { print; found_name=1; next } \ + found_name && /^ annotations:/ { print; found_annotations=1; next } \ + found_name && found_annotations && /^ [a-zA-Z]/ && !/^ / { \ + print " argocd.argoproj.io/sync-wave: \"-1\""; found_name=0; found_annotations=0 \ + } \ + found_name && !found_annotations && /^ [a-zA-Z]/ { \ + print " annotations:"; print " argocd.argoproj.io/sync-wave: \"-1\""; found_name=0 \ + } \ + { print }' \ deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \ deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \ mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \ deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yamlAlternatively, consider using a tool like
yqfor safer YAML manipulation.Also, the comment references "PR
#207" but this appears to be PR#140.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 125-143: The code currently launches the shell unconditionally
after monitor.wait_for_any_of; change it to treat
ExporterStatus.BEFORE_LEASE_HOOK_FAILED and a None (timeout) as hard stops by
not calling _run_shell_only. After the wait_for_any_of call
(monitor.wait_for_any_of), check result and if it equals
ExporterStatus.BEFORE_LEASE_HOOK_FAILED or is None, log the warning and exit the
function (or raise an appropriate exception) instead of proceeding to call
anyio.to_thread.run_sync(_run_shell_only, lease, config, command, path); only
call _run_shell_only when result == ExporterStatus.LEASE_READY.
♻️ Duplicate comments (5)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)
202-204: Controller channels are not closed after use.This creates a new gRPC channel per call but never closes it. Multiple calls during status updates and stream retries will leak channels. The pattern in
_unregister_with_controller()(lines 357-369) demonstrates proper cleanup with explicitawait channel.close().
414-458: Task may hang when lease ends without EndSession.This task only waits on
end_session_requested(line 426). If the lease ends without the client calling EndSession, the task continues waiting indefinitely on the old lease context. Consider racing againstlease_endedor using a task group scoped to the lease lifetime.
546-553: Listen task should be scoped to the lease lifetime.The
_retry_streamtask for Listen is started on the outertg(line 548), causing it to continue running afterhandle_leaseexits. This results in the task persisting across lease transitions. Move this toconn_tgso it's automatically cancelled when the lease ends, and closelisten_txin the finally block to signal proper stream termination.
578-579: Consider scoping_handle_end_sessionto the lease task group.Starting this task on the outer
tgmeans it survives the lease ending. If combined with the fact that it only waits onend_session_requested, the task can be orphaned. This is related to the hang issue noted in the_handle_end_sessionmethod.python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
80-89: Consider removing_run_shell_with_leaseif it’s now unused.The async path appears to be the only entry point, so this sync helper likely adds maintenance overhead.
#!/bin/bash # Verify remaining usages of the sync helper rg -nP '\b_run_shell_with_lease\s*\(' --type py
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 8-18: The import block is unsorted per Ruff I001; reorder imports
into standard-library first (logging, collections.abc.Awaitable,
collections.abc.Callable, contextlib.AsyncExitStack), then third-party packages
(anyio, anyio.Event, anyio.create_task_group, grpc.StatusCode,
grpc.aio.AioRpcError, jumpstarter_protocol.jumpstarter_pb2), and finally
local/project imports (jumpstarter.common.ExporterStatus) so the import order
satisfies the linter and keeps the existing imported symbols intact.
- Around line 196-201: The timeout checks in wait_for_status() and
wait_for_any_of() incorrectly use "if timeout:" which treats 0 as falsy; change
both checks to "if timeout is not None:" so a timeout of 0 is handled as an
immediate timeout; update the branches around anyio.move_on_after in
wait_for_status() and wait_for_any_of() (the blocks that call wait_loop() and
return False) to use the new condition.
- Around line 396-400: stop() currently sets _running = False and signals
_stop_event but does not wake waiters blocked on _any_change_event; update
stop() to also set _any_change_event (and optionally _any_change_event.set()
followed by _any_change_event.clear() if you want a one-shot wake) so waiters in
wait_for_status and wait_for_any_of that are awaiting _any_change_event.wait()
will wake and observe _running == False and exit their loops; reference the stop
method and the _any_change_event, wait_for_status, and wait_for_any_of functions
when applying this change.
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop() method currently ignores the caller's
should_unregister when wait_for_lease_exit is True, causing later calls (e.g.,
in serve()) to override intent; update stop() so that whenever a deferred stop
is requested (the branch setting self._stop_requested = True) you also set
self._unregister = should_unregister (and likewise when a stop has already been
requested you should update/preserve self._unregister accordingly) so the
original caller preference is retained; refer to the stop() method, the
attributes self._unregister, self._stop_requested, and self._tg.cancel_scope to
implement this change.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
147-162: Remove redundant timeout wrapper and improve timeout detection logic.The
StatusMonitor.connection_lostproperty is valid and well-tested. However, the outeranyio.move_on_after(30)is redundant—wait_for_any_of()already has an identicaltimeout=30.0parameter. When both are equal, the inner timeout fires first, making the outer scope'scancelled_caughtflag never True. This means timeout cases currently fall through to "Hook completion not confirmed" rather than the timeout warning.The suggested refactor properly addresses this by removing the outer timeout wrapper and using
result is Noneto detect timeouts, which disambiguates between connection loss and timeout.♻️ Suggested simplification
- with anyio.move_on_after(30) as timeout_scope: # 30 second timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=30.0 - ) - if result == ExporterStatus.AVAILABLE: - logger.info("afterLease hook completed") - elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: - logger.warning("afterLease hook failed") - elif monitor.connection_lost: - # Connection lost - exporter closed, hook likely completed - logger.info("Exporter connection closed (hook completed)") - else: - logger.debug("Hook completion not confirmed") - if timeout_scope.cancelled_caught: - logger.warning("Timeout waiting for afterLease hook to complete") + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=30.0 + ) + if result == ExporterStatus.AVAILABLE: + logger.info("afterLease hook completed") + elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: + logger.warning("afterLease hook failed") + elif result is None and monitor.connection_lost: + # Connection lost - exporter closed, hook likely completed + logger.info("Exporter connection closed (hook completed)") + elif result is None: + logger.warning("Timeout waiting for afterLease hook to complete") + else: + logger.debug("Hook completion not confirmed")
python/packages/jumpstarter/jumpstarter/client/status_monitor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 237-257: The fixed-count retry loop around self.controller.Dial
(using max_retries/base_delay/max_delay) should be replaced with a time‑based
retry bounded by a configurable timeout (use the existing acquisition_timeout if
available or add a new dial_timeout attribute/parameter) so Dial keeps retrying
until time.monotonic() exceeds the deadline; compute deadline = now + timeout,
then loop attempting await self.controller.Dial(...), on AioRpcError with
FAILED_PRECONDITION and "not ready" apply the same exponential/backoff delay
(min(base_delay*2**attempt, max_delay)) but stop retrying and re-raise once
monotonic() > deadline; update the retry log to show remaining time or attempts
left and ensure the new timeout is plumbed through the Lease acquisition
call/site that calls this Dial logic.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 343-347: In _poll_loop (status_monitor.py) the AioRpcError
UNIMPLEMENTED branch currently breaks without clearing _running or notifying
waiters; update that branch to mirror the UNAVAILABLE handling by setting
self._running = False and calling self._any_change_event.set() (and any other
cleanup done in the UNAVAILABLE branch) before breaking so wait_for_status() and
wait_for_any_of() don't hang; also add a unit test that triggers UNIMPLEMENTED
while a waiter (using wait_for_status() or wait_for_any_of()) is active and
assert the waiter returns promptly.
♻️ Duplicate comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)
8-18: Reorder imports to satisfy Ruff (I001).Current ordering may still fail Ruff I001.
🔧 Proposed import order
-import logging -from collections.abc import Awaitable, Callable -from contextlib import AsyncExitStack +from collections.abc import Awaitable, Callable +from contextlib import AsyncExitStack +import loggingPlease verify with:
#!/bin/bash ruff check python/packages/jumpstarter/jumpstarter/client/status_monitor.pyBased on learnings, Ruff is the formatter/linter for Python files.
196-201: Handletimeout=0explicitly.
if timeout:treats0as falsy, so an immediate timeout becomes an infinite wait (Line 196 and Line 280).🐛 Proposed fix
- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_loop() return False- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_for_first() return NoneOptional verification:
#!/bin/bash rg -n "wait_for_status|wait_for_any_of|timeout\s*=\s*0" --type=pyAlso applies to: 280-285
396-400: Wake waiting coroutines on stop().
stop()doesn’t signal_any_change_event, sowait_for_*can hang after shutdown.🛠️ Proposed fix
async def stop(self): """Stop the background polling task.""" self._running = False self._stop_event.set() + # Wake any waiters so they can observe _running=False + self._any_change_event.set() + self._any_change_event = Event()
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@e2e/dex.values.yaml`:
- Around line 37-40: The userID value for the test user "test-exporter-hooks" is
not a valid UUID because it contains the non-hex character 'g' (value:
c6ed6f40-6689-6g7e-c64c-55d0a9a5871f); replace this with a valid UUID (only 0-9
and a-f), e.g., generate a new v4 UUID and set userID to that new value so the
entry for username "test-exporter-hooks" and its userID field are valid.
🧹 Nitpick comments (5)
e2e/tests-hooks.bats (5)
9-10: Unused variable declaration.
CURRENT_HOOKS_CONFIGis declared but never referenced anywhere in the file. Consider removing it to avoid confusion.Suggested fix
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
51-54: Consider graceful termination before SIGKILL.Using
kill -9immediately doesn't give the process a chance to clean up. Consider sendingSIGTERMfirst with a brief grace period before escalating toSIGKILL.Suggested improvement
if ps -p "$pid" > /dev/null 2>&1; then echo " Killing PID $pid" >&2 - kill -9 "$pid" 2>/dev/null || true + kill -TERM "$pid" 2>/dev/null || true + sleep 0.5 + ps -p "$pid" > /dev/null 2>&1 && kill -9 "$pid" 2>/dev/null || true fi
104-110: Fixed sleep may introduce flakiness.The 2-second sleep on line 107 is intended to avoid catching pre-connect state, but this is timing-dependent. If the exporter connects faster or slower than expected, tests could become flaky. Consider polling for a specific pre-condition instead, or documenting why 2 seconds is sufficient.
205-208: Add safeguards for empty grep results.If the grep commands fail to find matches (e.g., due to unexpected output format),
before_posorafter_poscould be empty, causing the comparison on line 208 to fail with a shell error. Consider adding validation or using|| truewith a fallback.Suggested improvement
# Verify order: BEFORE should appear before AFTER in output local before_pos=$(echo "$output" | grep -n "BEFORE_HOOK:" | head -1 | cut -d: -f1) local after_pos=$(echo "$output" | grep -n "AFTER_HOOK:" | head -1 | cut -d: -f1) + # Ensure both markers were found + [ -n "$before_pos" ] && [ -n "$after_pos" ] [ "$before_pos" -lt "$after_pos" ]
315-333: Tests lack verification of actualjcommand effects.Tests E1 and E2 only verify that the hook scripts complete (via output markers), but don't verify that
j power on/offcommands actually executed or had the expected effect. Consider adding assertions that verify the power state changed, or at minimum checking for command output in the hook logs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 358-365: The UNAVAILABLE branch breaks the poll loop but doesn't
update the monitor's running state, so set self._running = False when handling
StatusCode.UNAVAILABLE; specifically, inside the elif e.code() ==
StatusCode.UNAVAILABLE block update self._running to False alongside setting
self._connection_lost, trigger and recreate self._any_change_event as shown, and
ensure start() can restart the monitor after this exit.
♻️ Duplicate comments (2)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)
199-204: Handletimeout=0explicitly.
if timeout:treats0as falsy, sotimeout=0waits indefinitely instead of timing out immediately.🛠️ Proposed fix
- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_loop() return False- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_for_first() return NoneAlso applies to: 283-286
416-420: Wake waiters on stop to avoid hangs.
stop()only signals_stop_event, but waiters block on_any_change_event, so they can hang indefinitely.🛠️ Proposed fix
async def stop(self): """Stop the background polling task.""" self._running = False self._stop_event.set() + # Wake waiters so they can observe _running=False + self._any_change_event.set() + self._any_change_event = Event()
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/config/client.py (1)
162-177: Threadinclude_statusthrough to the ListExporters request.The
include_statusparameter is only used to set a display flag on the response (line 176), but it's never passed toClientService.ListExporters. If the server requires this flag in the request to include status data, status information won't be fetched. UpdateClientService.ListExportersto accept and propagateinclude_statusto the protobuf request.Suggested update
-exporters_response = await svc.ListExporters(page_size=page_size, page_token=page_token, filter=filter) +exporters_response = await svc.ListExporters( + page_size=page_size, + page_token=page_token, + filter=filter, + include_status=include_status, +)
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@controller/internal/service/controller_service_integration_test.go`:
- Around line 76-86: The Lease objects in the tests (the Lease struct created
and passed to k8sClient.Create in controller_service_integration_test.go) are
missing the required Selector field in LeaseSpec which causes API validation to
fail; update the Lease initializations (the variable lease used with
k8sClient.Create and its LeaseSpec) to include Selector: metav1.LabelSelector{}
or an appropriate label selector (e.g., Selector:
&metav1.LabelSelector{MatchLabels: map[string]string{"app":"test"}}) so the
CRD-required selector is present when Create(ctx, lease) is called.
In `@e2e/tests.bats`:
- Around line 190-196: Two occurrences of the yq load expression still reference
the old path load("e2e/exporter.yaml"); update those commands that match the
pattern go run ... -i ". * load(\"e2e/exporter.yaml\")" to use
load("e2e/exporters/exporter.yaml") instead so all yq invocations point to the
moved exporter.yaml file (ensure the same replacement is applied for both
remaining occurrences).
In `@python/packages/jumpstarter-cli/jumpstarter_cli/run.py`:
- Around line 79-94: The parent currently interprets any non-zero child exit as
a signal and adds 128, which double-encodes real exit codes; change the protocol
so the child (exporter) encodes signals as 128 + signal number and the parent in
run.py should stop adding 128 — i.e., use exporter_exit_code directly when set
(replace any logic that converts non-zero exporter_exit_code into 128+value),
and keep received_signal only for locally-caught signals; update handling around
exporter_exit_code and received_signal (and cancellation via
signal_tg.cancel_scope.cancel()) so the parent returns exporter_exit_code as-is
when present.
In `@python/packages/jumpstarter/jumpstarter/client/core.py`:
- Around line 458-485: Guard creation of LogSource from response.source with a
try/except to handle unknown enum values: when evaluating source =
LogSource(response.source) (and computing is_hook), catch ValueError and set
source to LogSource.SYSTEM (and is_hook = False) so the stream continues instead
of raising; keep the existing routing logic that maps source to logger_name and
logging through source_logger.log(log_level, response.message). Also ensure
severity fallback (severity = response.severity if response.severity else
"INFO") remains unchanged.
- Around line 127-187: The wait_for_lease_ready method currently logs a warning
on timeout but returns silently so callers cannot detect failure; modify
wait_for_lease_ready to raise DriverError (or return False if you prefer an API
change) when the timeout elapses instead of only logging: replace the final
self.logger.warning(...) with raising DriverError(f"Timeout waiting for
beforeLease hook to complete after {timeout:.1f}s and {poll_count} polls")
(ensure DriverError is imported/available) so callers of wait_for_lease_ready
can programmatically handle timeouts.
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop method currently ignores the caller's
should_unregister value when deferring stop; update stop(self, ...) to persist
should_unregister into an instance field (e.g. set self._unregister =
should_unregister) whenever wait_for_lease_exit is true and you set
self._stop_requested = True so the original preference is preserved, and ensure
the lease-exit path that later invokes stop uses that stored self._unregister
rather than unconditionally passing True; update references to _unregister and
_stop_requested so deferred stops respect the saved preference.
🧹 Nitpick comments (11)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
147-162: Redundant timeout pattern may cause confusion.The code uses both
anyio.move_on_after(30)andtimeout=30.0inwait_for_any_of. Since both timeouts are identical, the innerwait_for_any_ofwill returnNoneon timeout beforemove_on_aftercan trigger, meaningtimeout_scope.cancelled_caughtat line 161 will likely never beTrue.Consider removing one timeout mechanism:
♻️ Suggested simplification
- with anyio.move_on_after(30) as timeout_scope: # 30 second timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=30.0 - ) - if result == ExporterStatus.AVAILABLE: - logger.info("afterLease hook completed") - elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: - logger.warning("afterLease hook failed") - elif monitor.connection_lost: - # Connection lost - exporter closed, hook likely completed - logger.info("Exporter connection closed (hook completed)") - else: - logger.debug("Hook completion not confirmed") - if timeout_scope.cancelled_caught: - logger.warning("Timeout waiting for afterLease hook to complete") + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=30.0 + ) + if result == ExporterStatus.AVAILABLE: + logger.info("afterLease hook completed") + elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: + logger.warning("afterLease hook failed") + elif monitor.connection_lost: + logger.info("Exporter connection closed (hook completed)") + else: + logger.warning("Timeout waiting for afterLease hook to complete")e2e/tests-hooks.bats (2)
75-76: Consider caching yq installation for faster test execution.Using
go run github.com/mikefarah/yq/v4@latestfetches from the network on each invocation. For repeated test runs, this adds latency. Consider installing yq once in the test environment setup or using a pre-installed binary.
78-82: Clarify the3>&-redirection purpose.The
3>&-closes file descriptor 3 before running the background loop. While this works, a brief comment explaining why this is needed (likely to prevent fd leaks from bats internals) would help maintainability.python/packages/jumpstarter/jumpstarter/client/core_test.py (1)
16-53: Optional: consolidate gRPC error/status helpers.These helpers look similar to those in
status_monitor_test.py; a shared test util could reduce duplication.python/packages/jumpstarter/jumpstarter/client/core.py (1)
105-126: Consider allowingExporterStatus.UNSPECIFIEDfor backward compatibility.If
GetStatusreturns the default/unspecified value, Line 124 currently raisesExporterNotReady, which can block driver calls for exporters that haven’t adopted explicit status reporting yet. AddingUNSPECIFIEDkeeps behavior more permissive without impacting the ready path.♻️ Suggested tweak
- ALLOWED_STATUSES = { - ExporterStatus.LEASE_READY, - ExporterStatus.BEFORE_LEASE_HOOK, - ExporterStatus.AFTER_LEASE_HOOK, - } + ALLOWED_STATUSES = { + ExporterStatus.LEASE_READY, + ExporterStatus.BEFORE_LEASE_HOOK, + ExporterStatus.AFTER_LEASE_HOOK, + ExporterStatus.UNSPECIFIED, + }python/packages/jumpstarter/jumpstarter/common/types.py (1)
16-18: Consider removing the emptyTYPE_CHECKINGblock.The
TYPE_CHECKINGguard and its comment suggest async stub type aliases were planned but the block is empty. This is dead code that could be removed for clarity, or the intended type aliases should be added.🧹 Proposed cleanup
-# Async stub type aliases are only available for type checking (defined in .pyi files) -if TYPE_CHECKING: - pass -Also update the import:
-from typing import TYPE_CHECKING, TypeAlias +from typing import TypeAliaspython/packages/jumpstarter/jumpstarter/exporter/lease_context.py (1)
54-57: Consider using explicit validation instead ofassertstatements.
assertstatements can be disabled when Python runs with optimization flags (-O), which would skip these critical validations in production. Consider raisingValueErrorexplicitly.♻️ Proposed fix
def __post_init__(self): """Validate that required resources are present.""" - assert self.before_lease_hook is not None, "LeaseScope requires a before_lease_hook event" - assert self.lease_name, "LeaseScope requires a non-empty lease_name" + if self.before_lease_hook is None: + raise ValueError("LeaseContext requires a before_lease_hook event") + if not self.lease_name: + raise ValueError("LeaseContext requires a non-empty lease_name")python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
621-629: Consider making the 1-second delay configurable or documenting its purpose more prominently.The delay at Line 624 allows clients to poll and see the AVAILABLE status before the lease is released. While the comment explains this, users might wonder why there's a hardcoded delay. Consider extracting this to a constant or making it configurable.
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)
426-522: Good coverage of wait_for_any_of behavior.The tests cover the key scenarios for multi-target waiting. The parallel structure to
wait_for_statustests ensures consistent behavior.Consider adding a test for
DEADLINE_EXCEEDEDhandling inwait_for_any_offor completeness (similar totest_wait_for_status_deadline_exceeded_returns_promptly), though theUNAVAILABLEtest provides reasonable coverage of the connection-loss path.python/packages/jumpstarter/jumpstarter/exporter/session.py (1)
270-280: Consider reducing GetStatus log level.The
GetStatusRPC may be polled frequently by clients (e.g.,StatusMonitorpolls every 0.1-0.3s). Usinglogger.infocould result in excessive log output.Consider using DEBUG level
async def GetStatus(self, request, context): """Get the current exporter status with transition tracking.""" - logger.info("GetStatus() -> %s (version=%d)", self._current_status, self._status_version) + logger.debug("GetStatus() -> %s (version=%d)", self._current_status, self._status_version) response = jumpstarter_pb2.GetStatusResponse(python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)
360-382: Consider using_controller_stubin_unregister_with_controller.This method manages the channel manually instead of using the
_controller_stubcontext manager. While the explicitchannel.close()withCancelScope(shield=True)is correct, using the context manager would be more consistent with the rest of the codebase.Optional refactor for consistency
async def _unregister_with_controller(self): """Safely unregister from controller with timeout and error handling.""" if not (self._registered and self._unregister): return logger.info("Unregistering exporter with controller") try: with move_on_after(10): # 10 second timeout - channel = await self.channel_factory() - try: - controller = jumpstarter_pb2_grpc.ControllerServiceStub(channel) + async with self._controller_stub() as controller: await self._report_status(ExporterStatus.OFFLINE, "Exporter shutting down") await controller.Unregister( jumpstarter_pb2.UnregisterRequest( reason="Exporter shutdown", ) ) logger.info("Controller unregistration completed successfully") - finally: - with CancelScope(shield=True): - await channel.close() except Exception as e: logger.error("Error during controller unregistration: %s", e, exc_info=True)
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 150-171: The test "hooks: create client and exporter" currently
provisions shared state inline (jmp admin create client/exporter and jmp login)
which other tests depend on; make provisioning idempotent and runnable in
isolation by moving those create/login steps into a setup_file or setup function
or by extracting helper functions (e.g., ensure_client_exists,
ensure_exporter_exists, ensure_logged_in) that each test calls; implement the
helpers to check for existence and only create/login when missing (using the
same jmp admin create client/exporter and jmp login commands referenced) so
every test explicitly prepares and resets required state.
- Around line 74-76: Before merging per-test overlays, restore a pristine base
exporter YAML into the exporter target file to avoid cross-test leakage: copy
the committed clean fixture (a base exporter YAML checked into the repo) to the
exporter output path for the test, then run the existing go run
github.com/mikefarah/yq/v4... ". * load(\"e2e/exporters/${config_file}\")"
merge; apply the same change to the other occurrence (the block around lines
94-96) so each test starts from a deterministic base rather than mutating a
shared file.
🧹 Nitpick comments (6)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
151-155: Redundant double timeout on afterLease hook wait.Both
anyio.move_on_after(30)andmonitor.wait_for_any_of(..., timeout=30.0)enforce the same 30-second limit. The innertimeoutparameter should be sufficient; the outermove_on_afteris redundant.♻️ Proposed simplification
- with anyio.move_on_after(30) as timeout_scope: # 30 second timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=30.0 - ) - if result == ExporterStatus.AVAILABLE: - logger.info("afterLease hook completed") - elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: - logger.warning("afterLease hook failed") - elif monitor.connection_lost: - # Connection lost - exporter closed, hook likely completed - logger.info("Exporter connection closed (hook completed)") - else: - logger.debug("Hook completion not confirmed") - if timeout_scope.cancelled_caught: - logger.warning("Timeout waiting for afterLease hook to complete") + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=30.0 + ) + if result == ExporterStatus.AVAILABLE: + logger.info("afterLease hook completed") + elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: + logger.warning("afterLease hook failed") + elif monitor.connection_lost: + # Connection lost - exporter closed, hook likely completed + logger.info("Exporter connection closed (hook completed)") + else: + logger.warning("Timeout waiting for afterLease hook to complete")
259-283: Consider adding a default case for unexpected config types.The
matchstatement handlesClientConfigV1Alpha1andExporterConfigV1Alpha1, but lacks a default case. If an unexpected config type is passed, the function silently returnsNonewithout callingsys.exit(), which could lead to confusing behavior.♻️ Proposed fix to add a fallback case
case ExporterConfigV1Alpha1(): with config.serve_unix() as path: # SAFETY: the exporter config is local thus considered trusted launch_shell( path, "local", allow=[], unsafe=True, use_profiles=False, command=command, ) + + case _: + raise click.ClickException(f"Unsupported config type: {type(config).__name__}")e2e/tests-hooks.bats (1)
17-22: Enable xtrace for easier E2E debugging.Please add
set -o xtraceinsetup()so test runs are always traceable.
Based on learnings keep tracing enabled (-x) unconditionally because there are no secrets in the bats tests and it makes debugging easier with more understandable logs.Proposed diff
setup() { + set -o xtrace bats_load_library bats-support bats_load_library bats-assertpython/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)
240-241: Consider movingimport timeto method level.The
import timestatement is inside the nestedread_pty_outputfunction, which means it's executed on every call to that function. While the import is cached by Python, moving it to line 174-175 alongside the other local imports (pty,subprocess) would be more consistent and slightly more efficient.🔧 Suggested fix
def _execute_hook_process( # noqa: C901 ... ) -> None: ... import pty import subprocess + import time ... async def read_pty_output() -> None: # noqa: C901 """Read from PTY master fd line by line using non-blocking I/O.""" logger.debug("read_pty_output task started") buffer = b"" read_count = 0 last_heartbeat = 0 - import time start_time = time.monotonic()
625-633: Consider making the status propagation delay configurable.The 1.0 second delay before requesting lease release is well-documented, but this value may need tuning based on network conditions. Consider exposing this as a configurable parameter or documenting the rationale for this specific value in the code comments.
49-55: Remove unuseddevice_factoryfield fromHookExecutor.The
device_factoryfield is declared in the dataclass but is never used in any class method (_create_hook_env,_execute_hook, or the public hook execution methods). Since it serves no purpose within the class, it should be removed to keep the code clean and reduce confusion.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 80-84: The restart loop around `jmp run --exporter
test-exporter-hooks` is missing backoff and can spin if the exporter exits
immediately; modify the loop in e2e/tests-hooks.bats (the `while true; do ...
done` that runs `jmp run`) to add a short sleep between iterations (e.g., sleep
1) or implement simple exponential backoff to avoid rapid respawns when startup
fails.
In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/oidc.py`:
- Around line 22-27: The discovery SSL context respects SSL_CERT_FILE via
_get_ssl_context(), but token endpoint calls using
authlib.requests_client.OAuth2Session in token_exchange_grant, password_grant,
refresh_token_grant, and authorization_code_grant do not — causing cert
verification failures with private CAs. Resolve the CA bundle path the same way
(os.environ.get("SSL_CERT_FILE") or certifi.where()) and pass it to
OAuth2Session/fetch_token via the verify parameter (either when instantiating
OAuth2Session or on each fetch_token call) so token requests use the same custom
CA bundle as discovery; reference the _get_ssl_context() logic and update the
four grant functions named above to accept/use that resolved ca_file.
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 239-270: The AioRpcError handler in the Dial retry loop (around
self.controller.Dial in lease.py) currently treats every
non-FAILED_PRECONDITION/"not ready" error as "exporter went offline" and
returns, which silently swallows important errors; change the except AioRpcError
block so that only terminal/gone-like status codes (e.g.,
grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.NOT_FOUND,
grpc.StatusCode.CANCELLED) are handled by logging a warning and returning, while
all other status codes (e.g., UNAUTHENTICATED, PERMISSION_DENIED, INTERNAL) are
re-raised so they surface; retain informative logging using e.code() and
e.details() in the warning and when re-raising include the original exception.
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 366-380: The timeout block that calls
anyio.to_thread.run_sync(process.wait) inside anyio.move_on_after(5) can hang
because run_sync defaults to abandon_on_cancel=False; change the call in the
cleanup routine (the block that checks process.poll() is None and currently uses
anyio.move_on_after(5) / anyio.to_thread.run_sync(process.wait)) to pass
abandon_on_cancel=True (i.e., anyio.to_thread.run_sync(process.wait,
abandon_on_cancel=True)) so the thread is abandoned when the 5s deadline fires
and the code can proceed to the forced kill; keep the surrounding try/except
behavior unchanged.
🧹 Nitpick comments (12)
e2e/tests-hooks.bats (2)
9-10: Unused variableCURRENT_HOOKS_CONFIG.This variable is initialized but never referenced anywhere in the file. Remove it to avoid confusion.
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
74-78:go run …@latestis invoked twice per test start — consider caching the binary.Each
go run github.com/mikefarah/yq/v4@latestmay re-resolve and potentially re-compile the tool. With two calls per helper and many tests, this adds up. Consider installingyqonce (e.g., insetup_file) and invoking the binary directly.Example
# In setup_file(): GOBIN="${BATS_RUN_TMPDIR}" go install github.com/mikefarah/yq/v4@latest YQ="${BATS_RUN_TMPDIR}/yq" # Then in helpers: "$YQ" -i 'del(.hooks)' /etc/jumpstarter/exporters/test-exporter-hooks.yaml "$YQ" -i ". * load(\"e2e/exporters/${config_file}\")" /etc/jumpstarter/exporters/test-exporter-hooks.yamlAlso applies to: 96-100
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
157-178: Redundant double timeout:move_on_after(30)wrapswait_for_any_of(..., timeout=30.0).Both enforce the same 30-second limit. The
wait_for_any_ofalready returnsNoneon timeout (handled at line 168+), making the outermove_on_afterscope and itscancelled_caughtcheck (line 177) unreachable in practice. Consider removing one layer.♻️ Simplified version removing the outer scope
- with anyio.move_on_after(30) as timeout_scope: # 30 second timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=30.0 - ) - if result == ExporterStatus.AVAILABLE: - logger.info("afterLease hook completed") - elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: - from jumpstarter.common.exceptions import ExporterOfflineError - reason = monitor.status_message or "afterLease hook failed" - raise ExporterOfflineError(reason) - elif monitor.connection_lost: - from jumpstarter.common.exceptions import ExporterOfflineError - reason = ( - monitor.status_message - or "Connection to exporter lost during afterLease hook" - ) - raise ExporterOfflineError(reason) - else: - logger.debug("Hook completion not confirmed") - if timeout_scope.cancelled_caught: - logger.warning("Timeout waiting for afterLease hook to complete") + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=30.0 + ) + if result == ExporterStatus.AVAILABLE: + logger.info("afterLease hook completed") + elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: + from jumpstarter.common.exceptions import ExporterOfflineError + reason = monitor.status_message or "afterLease hook failed" + raise ExporterOfflineError(reason) + elif monitor.connection_lost: + from jumpstarter.common.exceptions import ExporterOfflineError + reason = ( + monitor.status_message + or "Connection to exporter lost during afterLease hook" + ) + raise ExporterOfflineError(reason) + else: + logger.warning("Timeout waiting for afterLease hook to complete")
94-97: Consolidate repeated inline imports ofExporterOfflineError.
ExporterOfflineErroris imported inline at lines 122, 126, 150, 165, and 169 — all within the same function. Move this import to the top of the function alongside the other local imports (lines 94-97) to reduce clutter.Also applies to: 122-126, 150-150, 165-165, 169-169
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)
321-321: Use%-style formatting in logging calls instead of f-strings.f-strings are evaluated eagerly regardless of log level. While the impact is negligible for
info/warning, using%-style defers formatting to the logging framework and is the idiomatic Python logging pattern.For example:
- logger.warning(f"Missed {missed} status transition(s)") + logger.warning("Missed %d status transition(s)", missed)- logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})") + logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)Same pattern applies to lines 349, 379, and 381.
Also applies to: 332-332, 349-349, 379-379, 381-381
127-204: Significant duplication betweenwait_for_statusandwait_for_any_of.The connection-recovery verification block (lines 139-162 vs 218-241) and the inner wait-loop structure are nearly identical, differing only in whether the check is
== targetorin targets. Consider extracting the shared logic (connection verification, wait-loop scaffolding with active-waiter tracking and event capture) into a private helper to reduce the maintenance surface.Also applies to: 206-288
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)
472-486: Polling loop with floating-point accumulation is fine for the 30s scale but could be cleaner.The
elapsed += intervalpattern accumulates small FP errors over iterations, though it's negligible for a 30-second timeout. A monotonic clock comparison would be more robust and is already used elsewhere in this file (line 241).Optional: use monotonic clock
- timeout = 30 # seconds - interval = 0.1 # seconds - elapsed = 0.0 - while not lease_scope.is_ready(): - if elapsed >= timeout: + import time + deadline = time.monotonic() + 30 # seconds + interval = 0.1 # seconds + while not lease_scope.is_ready(): + if time.monotonic() >= deadline: error_msg = "Timeout waiting for lease scope to be ready" logger.error(error_msg) await report_status(ExporterStatus.BEFORE_LEASE_HOOK_FAILED, error_msg) lease_scope.before_lease_hook.set() return await anyio.sleep(interval) - elapsed += interval
627-635:finallyblock unconditionally sleeps 1.0s even when no hook ran.When
after_leaseis not configured,run_after_lease_hookreturns early at line 579, but thefinallyblock still executes the 1.0s sleep and lease release call. This adds unnecessary latency to every lease teardown for exporters without hooks.Consider guarding the sleep:
Proposed change
finally: - # Brief delay to allow client's status monitor to poll and see AVAILABLE - # before we request lease release (which closes the session) - await anyio.sleep(1.0) - - # Request lease release from controller after hook completes (success or failure) - # This ensures the lease is always released even if the client disconnects - if request_lease_release: - await request_lease_release() + # Only delay and release if a hook was actually executed + if self.config.after_lease: + # Brief delay to allow client's status monitor to poll and see AVAILABLE + # before we request lease release (which closes the session) + await anyio.sleep(1.0) + + # Request lease release from controller after hook completes (success or failure) + # This ensures the lease is always released even if the client disconnects + if request_lease_release: + await request_lease_release()
160-168: Consider extracting_execute_hook_processinto smaller helpers.At ~250 lines with nested async closures, this method carries significant cognitive load (acknowledged by
noqa: C901). The PTY setup, output reading, process waiting, and timeout handling could each be separate methods. Not blocking, but worth considering for future maintainability.python/packages/jumpstarter/jumpstarter/exporter/exporter.py (3)
329-332: Use lazy%formatting instead of f-strings in logger calls.Lines 330 and 332 use f-strings which evaluate the string unconditionally, even when the log level is disabled. Use
%-style formatting for consistency with the rest of the file and to avoid unnecessary string construction.Proposed fix
- logger.info(f"Updated status to {status}: {message}") + logger.info("Updated status to %s: %s", status, message) except Exception as e: - logger.error(f"Failed to update status: {e}") + logger.error("Failed to update status: %s", e)As per coding guidelines, "Use Ruff for code formatting and linting on all Python files except
jumpstarter-protocol" — Ruff'sGrules flag f-strings in logging calls.
629-636: Client connection tasks spawned on outertgmay outlive the session.
_handle_client_conntasks are started on the outertg(line 629), but the session socket is closed whensession_for_leaseexits (after the finally block at line 670). Any in-flight client connections will get socket errors. This is likely benign since_handle_client_conncatches exceptions, but spawning onconn_tginstead would give cleaner lifecycle alignment.Suggested change
- tg.start_soon( + conn_tg.start_soon( self._handle_client_conn, lease_scope.socket_path, request.router_endpoint,
360-382:_unregister_with_controllerbypasses_controller_stubcontext manager.This method manually creates and closes a channel (lines 368-380) rather than using
_controller_stub(). This is intentional — it needs the raw channel for both_report_statusandUnregister— but note that_report_statusat line 371 internally opens another channel via_controller_stub. This means two channels are open simultaneously during unregistration.Consider passing the existing stub to
_report_statusor inlining the status report here to avoid the extra channel.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)
683-701:⚠️ Potential issue | 🟡 MinorStatus stream retry exhaustion would crash
serve()with an unhandled exception.If the Status stream fails permanently (retries exhausted in
_retry_stream), the exception propagates intotg, cancelling the entire exporter. The Listen stream is handled similarly but is scoped to a lease. Consider whether the Status stream should have unlimited retries or a more graceful degradation path, since it's the exporter's lifeline to the controller.
🧹 Nitpick comments (5)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)
304-304: Prefere.code() == grpc.StatusCode.UNAVAILABLEfor consistency.
handle_async(line 252) compares status codes directly viagrpc.StatusCode, but here it compares viae.code().name == "UNAVAILABLE". Using the enum directly is more robust and consistent.Suggested fix
- if e.code().name == "UNAVAILABLE": + if e.code() == grpc.StatusCode.UNAVAILABLE:python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)
242-258: Fragile transient-error detection via string matching.The substring checks
"Stream removed"and"UNAVAILABLE"on line 247 are brittle — they depend on exact gRPC error message wording that can change across versions. Consider checkinggrpc.StatusCodefromRpcError.code()instead.Suggested approach
- error_str = str(e) - is_transient = "Stream removed" in error_str or "UNAVAILABLE" in error_str + is_transient = ( + isinstance(e, grpc.RpcError) + and e.code() == grpc.StatusCode.UNAVAILABLE + )
330-332: Use%-style formatting in logger calls instead of f-strings.f-strings are evaluated eagerly regardless of log level. The rest of this file consistently uses
%-style formatting for logger calls — these two lines are the only exceptions.Suggested fix
- logger.info(f"Updated status to {status}: {message}") + logger.info("Updated status to %s: %s", status, message) except Exception as e: - logger.error(f"Failed to update status: {e}") + logger.error("Failed to update status: %s", e)
366-388: Minor:_report_statusopens a separate channel inside_unregister_with_controller.Line 377 calls
self._report_status()which opens its own channel via_controller_stub, while the enclosing code already has a channel open at line 374. This creates two channels when one would suffice. Consider passing the existing stub to avoid the extra channel.
786-788: Magic sleep to prevent SSL corruption is fragile.The
sleep(0.2)with a comment about SSL corruption prevention is a timing-based workaround. If the root cause is overlapping gRPC connections on the same socket, the dual-socket approach insession_for_leaseshould already address that. Consider whether this sleep is still necessary, or if it can be replaced with a more deterministic signal (e.g., confirming the previous session's server has fully stopped).
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)
287-321:⚠️ Potential issue | 🟡 Minor
_wait_for_ready_connectionlogic is sound overall; one concern about_last_dial_errortiming.The method correctly surfaces a stored
_last_dial_erroron the firstAioRpcError, which works becausehandle_async(triggered by the incoming connection fromclient_from_path) runs and sets the field before returning—closing the server-side connection and causing the client to receive the error.However, this relies on
handle_asynccompleting itsDialcall (including the full retry loop up todial_timeout) before_wait_for_ready_connectionexhausts its 5 retries × 1s sleep = ~5s budget. Withdial_timeoutdefaulting to 30s, if the exporter takes a long time to reject with "hook failed", the readiness check may exhaust retries before_last_dial_erroris populated, yielding a generic "Max retries reached" error instead of the hook failure message.Consider either:
- Making
_wait_for_ready_connection's total wait budget at leastdial_timeout(e.g., deriving retries or total timeout fromself.dial_timeout), or- Having
_wait_for_ready_connectioncheck_last_dial_erroreven in the max-retries-exhausted path.Minimal fix: check _last_dial_error on retry exhaustion
if retries_left > 1: retries_left -= 1 else: logger.error("Max retries reached while waiting for ready connection at %s", path) + if self._last_dial_error: + raise ConnectionError(self._last_dial_error) from e raise ConnectionError("Max retries reached while waiting for ready connection at %s" % path) from epython/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
300-310:⚠️ Potential issue | 🟡 MinorLocal exporter shell exit code is silently discarded.
In the
ExporterConfigV1Alpha1branch,launch_shell(...)returns an exit code, but it's not passed tosys.exit()like theClientConfigV1Alpha1path (line 298). The CLI will always exit 0 regardless of the command's actual result.🔧 Proposed fix
case ExporterConfigV1Alpha1(): with config.serve_unix() as path: # SAFETY: the exporter config is local thus considered trusted - launch_shell( + exit_code = launch_shell( path, "local", allow=[], unsafe=True, use_profiles=False, command=command, ) + sys.exit(exit_code)
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 157-183: The outer anyio.move_on_after(30) is redundant and can
mask error-handling because it may cancel before monitor.wait_for_any_of(...,
timeout=30.0) returns; remove the with anyio.move_on_after(30) block and call
result = await monitor.wait_for_any_of([ExporterStatus.AVAILABLE,
ExporterStatus.AFTER_LEASE_HOOK_FAILED], timeout=30.0) directly, then keep the
existing branches that raise ExporterOfflineError for AFTER_LEASE_HOOK_FAILED
and connection_lost (inspect monitor.current_status and monitor.status_message
as currently done); replace the timeout_scope.cancelled_caught check by treating
result is None as the timeout case and emit the same logger.warning("Timeout
waiting for afterLease hook to complete") when result is None.
🧹 Nitpick comments (6)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)
321-321: Prefer lazy%-formatting in logger calls instead of f-strings.f-strings are eagerly evaluated even when the log level is suppressed. Lines 321, 332, 349, 379, and 381 all use f-strings. Consider
logger.warning("Missed %d status transition(s)", missed)style for consistency with the rest of the file (e.g., lines 374-377 already use%-formatting).
218-241: Connection-recovery verification logic is duplicated.The recovery-poll block in
wait_for_any_of(lines 218–241) is nearly identical to the one inwait_for_status(lines 139–162). Consider extracting a shared helper like_verify_connection() -> ExporterStatus | Noneto reduce duplication.python/packages/jumpstarter/jumpstarter/client/lease.py (2)
253-253: Fragile string matching on gRPC error details.
"not ready" in str(e.details())(line 253) and"hook failed" in str(e.details())(line 270) couple the client to the exact wording of server-side error messages. If the server changes the phrasing (e.g.,"hook execution failed"), these checks silently stop matching and the error handling degrades.Consider using gRPC metadata or structured error details (e.g.,
google.rpc.Statuswith error details) to communicate the error type, or at minimum define these sentinel strings as shared constants between the client and server.Also applies to: 270-270
312-315: Comparing status codes by.namestring instead of enum identity.Line 312 uses
e.code().name == "UNAVAILABLE"which works but is less idiomatic and more fragile than comparing the enum directly:- if e.code().name == "UNAVAILABLE": + if e.code() == grpc.StatusCode.UNAVAILABLE:The rest of this file (line 253) already uses the enum comparison pattern.
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
116-119: Hardcoded 300s timeout for beforeLease hook wait.The exporter config supports per-hook
timeout, but the client-side wait uses a hardcoded 300 seconds. If a hook is configured with a longer timeout on the exporter side, the client will give up prematurely. Consider making this configurable or deriving it from the lease/exporter metadata.
94-97: Consolidate repeated inline imports ofExporterOfflineError.
ExporterOfflineErroris imported five times inside this function (lines 122, 126, 150, 165, 169). Move it to a single import alongside the other imports at the top of the function body (line 94–97) to reduce noise and help with the C901 complexity.from contextlib import ExitStack from jumpstarter.client.client import client_from_path from jumpstarter.common import ExporterStatus + from jumpstarter.common.exceptions import ExporterOfflineErrorAlso applies to: 122-126, 150-150, 165-165, 169-169
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Line 64: The _last_dial_error attribute on the Lease dataclass is dead state
(set but never read); either remove the field declaration "_last_dial_error: str
| None = field(default=None, init=False)" from the class, or make its value
observable to callers by surfacing it (for example add a read-only property
get_last_dial_error or include it in the Lease.__repr__/to_dict output and
update any Dial/hook failure handling in the method that sets _last_dial_error
to also document/propagate the error). Locate references to _last_dial_error and
the method that assigns it (the Dial/hook failure path) and implement one
consistent approach: delete the unused field and related assignments, or expose
it via a public accessor and update callers to read that accessor where hook
failure information is needed.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 374-384: The DEADLINE_EXCEEDED branch that handles 20 consecutive
timeouts sets self._connection_lost and resets self._any_change_event but
forgets to stop the monitor loop; set self._running = False in the same block
where deadline_retries >= 20 (alongside self._connection_lost = True and
resetting self._any_change_event) so the monitor actually stops; make this
change in the method that references deadline_retries and _running (the
GetStatus/monitor loop function) to mirror the UNAVAILABLE handling.
🧹 Nitpick comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
390-392: Unhandled gRPC errors and generic exceptions fall through to the sleep-and-retry loop silently.Line 390 catches any
AioRpcErrorthat isn't UNIMPLEMENTED/UNAVAILABLE/DEADLINE_EXCEEDED and only doeslogger.debug. Line 391-392 catches all other exceptions withlogger.debug. For unexpected errors (e.g.,PERMISSION_DENIED,INTERNAL), the monitor will keep retrying forever with only debug-level logging. Consider logging atwarninglevel for unexpected gRPC codes so operators can diagnose issues.python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (2)
35-51:HasFieldmock returns a fixed boolean regardless of field name.
response.HasField.return_valueis set to a single boolean, soresponse.HasField("any_field_name")returns the same value. This works because the production code only callsHasField("previous_status"), but it's worth noting as a fragility point if the production code evolves.
607-618:test_context_manager_stops_on_exitdoesn't fully exercise the context manager lifecycle.Setting
_running = Truemanually before entering the context manager causesstart()(inside__aenter__) to early-return (since it checksif self._running: return). This test only validates that__aexit__→stop()sets_running = False, not that__aenter__properly initializes the polling loop. Consider removing the manual_running = Trueto let the full lifecycle execute.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 59-64: Remove the unused dataclass field _last_dial_error from the
Lease class and delete any places that assign to it (e.g., the assignment in the
Dial/_dial code path that sets self._last_dial_error); ensure no remaining
references remain (search for "_last_dial_error") and if those assignments were
intended to communicate errors replace them with a direct log or exception
handling in the same function (e.g., log via self.logger or raise) so behavior
and tests remain unchanged.
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 717-729: The current logic creates a LeaseContext and starts
handle_lease whenever self._lease_context is None and status.lease_name != "",
which can start unnecessary handlers for stale lease names; update the condition
to also require status.leased to be truthy before creating LeaseContext and
calling tg.start_soon. Specifically, guard the block that creates LeaseContext
(symbol: LeaseContext) and starts the task (tg.start_soon(self.handle_lease,
...)) by checking status.leased along with self._lease_context and
status.lease_name so handle_lease is only spawned for active leases.
🧹 Nitpick comments (5)
python/packages/jumpstarter/jumpstarter/client/lease.py (1)
245-248: Mixingtime.monotonic()with anyiosleepis fine, but consider consistency with anyio's clock.Minor nit: anyio's
current_time()returns the same monotonic clock and keeps the code within the anyio abstraction. Not a functional issue —time.monotonic()works correctly here.python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)
242-258: Prefer gRPC status code checks over string matching for transient error detection.String matching on
str(e)(line 247) is fragile and locale/version-dependent. gRPC errors carry structured status codes that are more reliable:Suggested approach
- error_str = str(e) - is_transient = "Stream removed" in error_str or "UNAVAILABLE" in error_str + is_transient = ( + isinstance(e, grpc.aio.AioRpcError) + and e.code() == grpc.StatusCode.UNAVAILABLE + )This also covers "Stream removed" cases since the server-side removal typically surfaces as
UNAVAILABLE. If you need to handle additional codes (e.g.,INTERNAL), extend the condition.
330-332: Use lazy%-formatting in logger calls instead of f-strings.F-strings are evaluated eagerly even when the log level is disabled, adding unnecessary overhead on hot paths like status reporting.
- logger.info(f"Updated status to {status}: {message}") + logger.info("Updated status to %s: %s", status, message) except Exception as e: - logger.error(f"Failed to update status: {e}") + logger.error("Failed to update status: %s", e)
636-645: Client connection tasks spawned on outertgcan outlive the session.
_handle_client_conntasks are started on the outertg(line 638), so they surviveconn_tgcancellation and may still be running whensession_for_leasetears down the Unix socket. Theexceptin_handle_client_conncatches the resulting errors, so this isn't a crash risk, but it means connections aren't drained gracefully.Consider spawning client connections on
conn_tginstead, so they are cancelled together with the listen stream when the lease ends. If graceful drain is needed, you could add a briefmove_on_afterdelay beforeconn_tgcancellation.
366-388: Minor: Double channel creation during unregistration.
_unregister_with_controlleropens a channel explicitly (line 374), then_report_statusat line 377 opens a second channel via_controller_stub. You could callcontroller.ReportStatus(...)directly on the already-open stub to avoid the extra channel.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 536-626: In run_after_lease_hook, wrap the final
request_lease_release() call in a try/except to defensively handle any
exceptions raised during cleanup: catch Exception around the await
request_lease_release() call, log the error with context (e.g., using
logger.error with exc_info=True) and do not re-raise so cleanup failures don’t
propagate out of the finally block; keep the existing delay/ shutdown_called
behavior unchanged.
🧹 Nitpick comments (6)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
121-135: Redundant repeated imports ofExporterOfflineErrorwithin the same function.
ExporterOfflineErroris imported at lines 122, 126, 150, 164, and 168 — all inside the same function body. A single import at the top of the function (or at module level) would suffice and reduce clutter.Also applies to: 148-186
226-238: Exception group handling silently drops non-timeout exceptions when aTimeoutErroris present.Lines 227-229 iterate the group and raise the first
TimeoutError, discarding any other exceptions (including potentialExporterOfflineError). If both can co-occur, theExporterOfflineErrorwould be lost. This might be intentional, but worth noting.python/packages/jumpstarter/jumpstarter/config/exporter.py (1)
166-168: Pre-existing@classmethodusesselfinstead ofcls— confusing but functional.
dump_yamlis decorated with@classmethodbut names its first parameterselfrather thancls. The addedby_alias=Truechange is correct and necessary for preserving camelCase keys in YAML output. Theself/clsnaming issue pre-dates this PR.python/packages/jumpstarter/jumpstarter/client/core.py (1)
476-487: Per-source loggers lack explicit handler configuration.Loggers like
"exporter:beforeLease"and"exporter:afterLease"are created vialogging.getLogger()but don't have handlers attached. They rely on propagation to the root logger. If the root logger hasn't been configured with a handler (common in library code), hook log messages atINFO/DEBUGlevel may be silently dropped.Consider attaching a
RichHandler(or at minimum aNullHandler) to these loggers on first use, or documenting that the consuming application must configure root-level logging.python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
448-534:request_lease_releaseparameter is accepted but never used inrun_before_lease_hook.The parameter is passed on line 453 and accepted from callers (exporter.py line 744), but the method body never invokes it. The
endLeasefailure path (line 522) explicitly comments "No request_lease_release." If this is intentional for API symmetry withrun_after_lease_hook, consider documenting it in the docstring or removing the parameter to avoid confusion.python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)
313-332: Consider using%-style formatting in logger calls instead of f-strings.Lines 330 and 332 use f-strings in logger calls (
logger.info(f"Updated status to {status}: {message}")). F-strings are evaluated eagerly regardless of log level, while%-style formatting (logger.info("Updated status to %s: %s", status, message)) defers evaluation. This pattern appears throughout the PR. Minor performance concern but aligns with Python logging best practices.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 372-378: The current sync call process.wait() after process.kill()
blocks the async event loop; change it to the async form (await process.wait())
so it matches the earlier async wait usage and doesn't block. Locate the block
around process.kill() / process.wait() in hooks.py (the code using the process
variable) and replace the synchronous wait with an awaited wait, keeping the
surrounding try/except behavior (catching exceptions from the awaited wait) so
the kill-and-wait path remains non-blocking.
🧹 Nitpick comments (1)
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
448-454:request_lease_releaseparameter is accepted but never used.The
request_lease_releaseparameter onrun_before_lease_hookis never called within the method body. If this is intentionally kept for signature consistency withrun_after_lease_hook, consider documenting that intent. Otherwise, it's dead code.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 390-399: The finally block unconditionally closes master_fd and
slave_fd which can double-close an fd if slave_fd was already closed earlier;
instead consult the fds_open tracking dict (used earlier in this module) to only
close descriptors still marked open—check fds_open.get(master_fd) and
fds_open.get(slave_fd) (or remove keys after close) before calling os.close, and
update fds_open accordingly to avoid closing a reused unrelated fd; adjust the
cleanup in the finally block around the existing master_fd/slave_fd close calls
to use fds_open checks and safe removal.
- Around line 118-120: The call passes lease_scope.session (typed Session |
None) into _execute_hook_process which requires a non-optional logging_session,
so add an explicit guard in the public entry points (execute_before_lease_hook
and execute_after_lease_hook) to validate lease_scope.session is not None (or
call lease_scope.is_ready() where appropriate) and raise a clear ValueError (or
return early) if missing; alternatively, coalesce to an explicit default Session
only after ensuring availability and then pass that non-None session into
_execute_hook_process, keeping references to _execute_hook_process,
lease_scope.session, execute_before_lease_hook, execute_after_lease_hook,
run_before_lease_hook and is_ready() to locate the change.
🧹 Nitpick comments (5)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)
127-204: Significant duplication betweenwait_for_statusandwait_for_any_of.The connection-recovery preamble (lines 141–162 vs 220–241), the inner wait loop with double-check pattern, and the timeout wrapper are nearly identical between these two methods. Consider extracting:
- A shared
_verify_connectionhelper for the recovery block.- A generic wait loop parameterized by a match predicate (
lambda status: status == targetvslambda status: status in targets).This would halve the surface area for future bugs in the wait/recovery logic.
Also applies to: 206-288
321-321: Use lazy%-formatting in logger calls instead of f-strings.Lines 321, 332, 349, 392, 394 use f-strings inside
logger.warning/logger.info/logger.debug/logger.error. This eagerly interpolates the string even when the log level is disabled.Example fix for a couple of lines
- logger.warning(f"Missed {missed} status transition(s)") + logger.warning("Missed %d status transition(s)", missed) ... - logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})") + logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version) ... - logger.error(f"Status change callback error: {e}") + logger.error("Status change callback error: %s", e) ... - logger.debug(f"GetStatus poll error: {e.code()}") + logger.debug("GetStatus poll error: %s", e.code()) ... - logger.debug(f"GetStatus poll error: {e}") + logger.debug("GetStatus poll error: %s", e)Also applies to: 332-332, 349-349, 392-394
393-394: Bareexcept Exceptionsilently swallows unexpected errors at DEBUG level.Programming errors (e.g.,
TypeError,AttributeError) in the poll loop will be logged atdebugand silently retried forever. Consider logging atwarningwithexc_info=Trueso unexpected failures are visible in normal operation.except Exception as e: - logger.debug(f"GetStatus poll error: {e}") + logger.warning("Unexpected GetStatus poll error: %s", e, exc_info=True)python/packages/jumpstarter/jumpstarter/exporter/exporter.py (2)
338-340: Use lazy%sformatting in logger calls instead of f-strings.f-strings are evaluated eagerly even when the log level is disabled. Minor performance concern on hot paths like status reporting.
- logger.info(f"Updated status to {status}: {message}") + logger.info("Updated status to %s: %s", status, message) except Exception as e: - logger.error(f"Failed to update status: {e}") + logger.error("Failed to update status: %s", e)
640-653: Client connection tasks outlive the session scope.Tasks spawned at line 646 via
tg.start_soon(outer task group) will continue running afterhandle_leaseexits andsession_for_leasecloses the Unix socket. These tasks will encounter connection errors and exit via the warning at line 437, which is graceful but not clean.Consider spawning them on
conn_tginstead oftgso they're cancelled with the lease, or document this as intentional behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 367-411: The tests "hooks F2: .py file auto-detects Python and
uses driver API" and "hooks F3: script as file path executes the file" create
temp files (/tmp/jumpstarter-e2e-hook-python.py and
/tmp/jumpstarter-e2e-hook-script.sh) that can leak when assertions fail because
rm -f at the end won't run; fix by writing those files into $BATS_TEST_TMPDIR
instead of /tmp and updating the corresponding exporter overlays
(exporter-hooks-exec-python.yaml and exporter-hooks-exec-script-file.yaml) to
reference the test-temp paths, or add a global teardown() that removes both
/tmp/jumpstarter-e2e-hook-python.py and /tmp/jumpstarter-e2e-hook-script.sh to
guarantee cleanup even on failures.
🧹 Nitpick comments (7)
e2e/tests-hooks.bats (2)
9-10: Unused variableCURRENT_HOOKS_CONFIG.This variable is declared but never read or written elsewhere in the file. Remove it to avoid confusion.
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
68-106: Extract shared config-merge logic to reduce duplication.Lines 74-78 and 96-100 are identical. A small helper would keep both callers DRY and reduce the chance of them drifting apart.
Proposed refactor
+# Helper: Prepare exporter config by clearing hooks and merging overlay +prepare_hooks_config() { + local config_file="$1" + go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ + /etc/jumpstarter/exporters/test-exporter-hooks.yaml + go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ + /etc/jumpstarter/exporters/test-exporter-hooks.yaml +} + start_hooks_exporter() { local config_file="$1" stop_hooks_exporter - go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml - go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml + prepare_hooks_config "$config_file" ... } start_hooks_exporter_single() { local config_file="$1" stop_hooks_exporter - go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml - go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml + prepare_hooks_config "$config_file" ... }python/packages/jumpstarter/jumpstarter/config/exporter.py (2)
176-177: Line likely exceeds 120-char limit (Ruff E501).Line 177 is a single long expression. Consider breaking it for readability and linting compliance.
Suggested fix
`@classmethod` def dump_yaml(self, config: Self) -> str: - return yaml.safe_dump(config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}), sort_keys=False) + return yaml.safe_dump( + config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}), + sort_keys=False, + )As per coding guidelines,
python/**/*.py: Use Ruff for code formatting and linting.
187-188: Same line-length issue insave.Suggested fix
with config.path.open(mode="w") as f: - yaml.safe_dump(config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}), f, sort_keys=False) + yaml.safe_dump( + config.model_dump(mode="json", by_alias=True, exclude={"alias", "path"}), + f, + sort_keys=False, + )As per coding guidelines,
python/**/*.py: Use Ruff for code formatting and linting.python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py (1)
20-37: Fixturelease_scopedoesn't sethook_socket_path, which triggers a warning log in every test.Every test using this fixture will hit the
logger.warning("No dedicated hook socket available...")path in_create_hook_env. This is fine for testing the fallback, but if you want cleaner test output or want to also cover the dedicated hook socket path, consider adding a test or parameterizing the fixture.python/packages/jumpstarter/jumpstarter/exporter/hooks.py (2)
503-517: Polling loop uses floating-point accumulation which can drift.The
elapsed += intervalpattern accumulates floating-point rounding errors over 300 iterations (30s / 0.1s). While unlikely to cause issues at this scale, usingtime.monotonic()would be more robust and consistent with the pattern used inread_pty_output.Suggested improvement
- timeout = 30 # seconds - interval = 0.1 # seconds - elapsed = 0.0 + import time + deadline = time.monotonic() + 30 # seconds + interval = 0.1 # seconds while not lease_scope.is_ready(): - if elapsed >= timeout: + if time.monotonic() >= deadline: error_msg = "Timeout waiting for lease scope to be ready" logger.error(error_msg) await report_status(ExporterStatus.BEFORE_LEASE_HOOK_FAILED, error_msg) lease_scope.before_lease_hook.set() return await anyio.sleep(interval) - elapsed += interval
166-236: Inline imports ofpty,subprocess,fcntl,sys,timeinside the method body.These standard-library modules are imported inside the method. While
ptyandfcntlare Linux-only (justifying lazy import for portability),subprocess,sys, andtimeare universal. Consider moving the universal ones to the top of the file for clarity.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 280-296: In the test "@test \"hooks C2: afterLease onFailure=exit
shuts down exporter\"" replace the strict assert_output --partial "afterLease
hook failed" with a regexp alternation to handle the race (e.g. assert_output
--regexp "(afterLease hook failed|Connection to exporter lost)"); update the
assertion in that test block so it accepts either the hook-failed message or the
connection-drop message to avoid flaky failures when the exporter exits before
the status reaches the client.
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 636-652: The code reports ExporterStatus.AFTER_LEASE_HOOK_FAILED
but comments say the exporter remains available; fix the inconsistency by either
(A) updating the comments/logs around the afterLease error handling (the except
blocks that call report_status with ExporterStatus.AFTER_LEASE_HOOK_FAILED) to
state the exporter is now offline/unavailable (matching enums.py, exporter.py
terminal list and client behavior), or (B) if the exporter should remain
available, change the reported status to a non-terminal/recoverable status
(e.g., add a new ExporterStatus value like AFTER_LEASE_HOOK_FAILED_RECOVERABLE
in enums.py, update exporter.py terminal-status checks and shell.py client
handling, and use that new status in the report_status calls); locate and update
the two places calling report_status(...) with
ExporterStatus.AFTER_LEASE_HOOK_FAILED in hooks.py and make corresponding
enum/terminal-list/client changes if choosing option B.
🧹 Nitpick comments (7)
e2e/tests-hooks.bats (4)
9-10: Unused variableCURRENT_HOOKS_CONFIG.This variable is declared but never read or written anywhere in the file. Remove it to avoid confusion.
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
12-33: Consider adding idempotency guards tosetup_fileprovisioning.If a previous run was interrupted (e.g., Ctrl-C skipping
teardown_file), thejmp admin createcommands will fail because the resources already exist. Appending|| true(or using anensure_*pattern that checks existence first) would make re-runs safer.- jmp admin create client -n "${JS_NAMESPACE}" test-client-hooks --unsafe --out "${BATS_RUN_TMPDIR}/test-client-hooks.yaml" \ - --oidc-username dex:test-client-hooks + jmp admin create client -n "${JS_NAMESPACE}" test-client-hooks --unsafe --out "${BATS_RUN_TMPDIR}/test-client-hooks.yaml" \ + --oidc-username dex:test-client-hooks || true - jmp admin create exporter -n "${JS_NAMESPACE}" test-exporter-hooks --out "${BATS_RUN_TMPDIR}/test-exporter-hooks.yaml" \ - --oidc-username dex:test-exporter-hooks \ - --label example.com/board=hooks + jmp admin create exporter -n "${JS_NAMESPACE}" test-exporter-hooks --out "${BATS_RUN_TMPDIR}/test-exporter-hooks.yaml" \ + --oidc-username dex:test-exporter-hooks \ + --label example.com/board=hooks || trueBased on learnings: E2E tests in bats should be runnable in isolation. Each test (or setup/teardown blocks) should explicitly prepare its prerequisites, reset global state, and ensure deterministic results.
92-131: Extract shared config-reset logic into a helper to reduce duplication.
start_hooks_exporterandstart_hooks_exporter_singleduplicate the same 4-line yq clear-and-merge block. A small helper would keep changes in one place.Suggested refactor
+# Helper: Reset exporter config and merge overlay +apply_hooks_config() { + local config_file="$1" + go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ + /etc/jumpstarter/exporters/test-exporter-hooks.yaml + go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ + /etc/jumpstarter/exporters/test-exporter-hooks.yaml +} + start_hooks_exporter() { local config_file="$1" stop_hooks_exporter - go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml - go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml + apply_hooks_config "$config_file" ... } start_hooks_exporter_single() { local config_file="$1" stop_hooks_exporter - go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml - go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml + apply_hooks_config "$config_file" ... }
339-350: E3:client=regex doesn't verify a non-empty value.
[[ "$output" =~ client= ]]would pass even ifCLIENT_NAMEis empty. Consider tightening:- [[ "$output" =~ client= ]] + [[ "$output" =~ client=[^[:space:]]+ ]]python/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)
22-46:Exception.__init__is never called, soe.argsis empty.
@dataclassgenerates__init__but doesn't callsuper().__init__(message). This meanse.argswill be(), which can confuse logging frameworks and traceback formatters that rely onargs[0]for the error message.Suggested fix
`@dataclass` class HookExecutionError(Exception): ... message: str on_failure: Literal["endLease", "exit"] hook_type: Literal["before_lease", "after_lease"] + def __post_init__(self): + super().__init__(self.message) + def __str__(self) -> str: return self.message
335-364:wait_for_processfinally block's async cleanup is unreliable under cancellation.When
move_on_after(timeout)fires, the task group is cancelled, andwait_for_process'sfinallyblock runs. However,await anyio.sleep(0.1)(line 356) andawait anyio.to_thread.run_sync(...)(line 362) will raiseCancelledimmediately in anyio unless shielded. The graceful terminate→poll→kill→reap sequence won't execute.This isn't a bug because lines 396–411 duplicate the cleanup for the timeout path, but it means the finally block in
wait_for_processonly works for non-timeout cancellations (e.g., parent task cancellation). Consider wrapping the finally body inCancelScope(shield=True)so the cleanup is reliable in all cancellation scenarios, or add a comment documenting that timeout cleanup is handled externally.
509-520: Busy-wait polling loop could be replaced with ananyio.Event.The 0.1s polling loop to wait for
lease_scope.is_ready()works but is a busy-wait pattern. IfLeaseContextcould expose ananyio.Eventthat's set when it becomes ready, this would be cleaner and more efficient. Low priority since 0.1s interval is acceptable.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 388-439: The parent_fd is being left unclosed because
fds_open["parent"] is used both to signal read_pty_output to stop and to track
whether parent_fd should be closed; when you set fds_open["parent"]=False to
stop the reader the finally block skips closing parent_fd. Fix by introducing a
separate stop flag (e.g., reader_stop or reader_stop["flag"]) used only by
read_pty_output/read loop and keep fds_open["parent"] solely for tracking
whether parent_fd is open; update read_pty_output to check reader_stop instead
of fds_open, set reader_stop["flag"]=True where currently fds_open["parent"] is
set to False, and ensure the finally block always checks fds_open["parent"] to
close parent_fd (and child_fd) if still open.
🧹 Nitpick comments (3)
e2e/tests-hooks.bats (3)
9-10: Unused variableCURRENT_HOOKS_CONFIG.This variable is declared but never read or written anywhere else in the file. Remove it to avoid confusion.
Proposed fix
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
92-131: Duplicated config-preparation logic betweenstart_hooks_exporterandstart_hooks_exporter_single.The
del(.hooks)+ merge sequence (lines 98–102 and 121–125) is identical. Extract it into a small helper to keep changes in one place.Proposed refactor
+# Helper: Prepare exporter config by clearing hooks and merging overlay +prepare_hooks_config() { + local config_file="$1" + go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ + /etc/jumpstarter/exporters/test-exporter-hooks.yaml + go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ + /etc/jumpstarter/exporters/test-exporter-hooks.yaml +} + start_hooks_exporter() { local config_file="$1" stop_hooks_exporter - go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml - go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml + prepare_hooks_config "$config_file" ... } start_hooks_exporter_single() { local config_file="$1" stop_hooks_exporter - go run github.com/mikefarah/yq/v4@latest -i 'del(.hooks)' \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml - go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/${config_file}\")" \ - /etc/jumpstarter/exporters/test-exporter-hooks.yaml + prepare_hooks_config "$config_file" ... }
12-33:setup_filecreate commands will fail if resources already exist from a prior aborted run.If a previous test run didn't clean up (e.g., CI killed mid-run), the
jmp admin createcommands will fail since the resources already exist, blocking the entire suite. Consider adding|| trueor a pre-cleanup step.Proposed fix — delete before create for idempotency
setup_file() { # Initialize the PIDs file at the start of all tests echo "" > "$HOOKS_EXPORTER_PIDS_FILE" + # Clean up any leftover resources from prior runs + jmp admin delete client --namespace "${JS_NAMESPACE}" test-client-hooks --delete 2>/dev/null || true + jmp admin delete exporter --namespace "${JS_NAMESPACE}" test-exporter-hooks --delete 2>/dev/null || true + # Create client and exporter for hooks tests jmp admin create client -n "${JS_NAMESPACE}" test-client-hooks --unsafe --out "${BATS_RUN_TMPDIR}/test-client-hooks.yaml" \ --oidc-username dex:test-client-hooksBased on learnings: E2E tests in bats should be runnable in isolation. Each test (or setup/teardown blocks) should explicitly prepare its prerequisites, reset global state, and ensure deterministic results.
…arily offline when exiting jmp shell
…imeout, "Running afterLease hook" causing a freeze.
…offline exporters
… exporter to get stuck
…using a premature release
…s waiting for a lease on the same exporter and the first lease ends
…the client errors and kills the lease prematurely
…lient disconnects prematurely. Set timeout to 300s
…Lease hook fails with exit
…ow exits without running afterLease hook
f5fd308 to
6770825
Compare
Container ImagesThe following container images have been built for this PR:
Images expire after 7 days. |
|
@evakhoni I have taken the test cases you presented above and added them to the e2e tests. Hopefully all of those issues are resolved now that the e2e tests are passing. |
Summary
Implement hook scripts that can be configured in the exporter config to run
jcommands within a temporary lease environment. This PR combines changes across Python, Controller, and Protocol components.Environment Variables
The following environment variables are exposed to hook scripts:
JUMPSTARTER_HOST- Path to the Unix socket for j CLI accessJMP_DRIVERS_ALLOW=UNSAFE- Allows all drivers for local accessLEASE_NAME- The name of the current leaseCLIENT_NAME- Name of the client that acquired the leaseExample Configuration
Changes
Python (
python/)exit,warn)jmp admin get exporter)Controller (
controller/)Protocol (
protocol/)release_leasefield for exporter-initiated lease releasestatus_versionandprevious_statusfields for transition trackingRelated PRs
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests
Documentation