-
-
Notifications
You must be signed in to change notification settings - Fork 52
feat: Implement Smart Retry Logic with Exponential Backoff for Installations #658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a SmartRetry utility and integrates retry-enabled command execution into InstallationCoordinator (new max_retries parameter, default 5). CLI passes max_retries=5 for sequential installs. Tests and docs updated; tests patch time.sleep. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (cortex/cli.py)
participant Coordinator as InstallationCoordinator
participant SmartRetry as SmartRetry
participant Subprocess as subprocess.run
participant Logger as Logger
CLI->>Coordinator: instantiate(..., max_retries=5)
Coordinator->>SmartRetry: create(max_retries, backoff_factor, status_callback)
loop for each command
Coordinator->>SmartRetry: run(execute_command)
SmartRetry->>Subprocess: execute command (subprocess.run)
alt returncode == 0
Subprocess-->>SmartRetry: success result
SmartRetry-->>Coordinator: return result
else transient error (_should_retry → true)
Subprocess-->>SmartRetry: failed result
SmartRetry->>Logger: log retry attempt / status_callback
SmartRetry->>SmartRetry: sleep(backoff_factor * 2^(attempt-1))
SmartRetry->>Subprocess: retry execute command
else permanent error (_should_retry → false)
Subprocess-->>SmartRetry: failed result
SmartRetry-->>Coordinator: return failed result
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @KrishnaShuk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the installation process by introducing a smart retry mechanism with exponential backoff. This system is designed to automatically re-attempt installation commands that fail due to temporary issues, such as network glitches or resource contention, while immediately failing on permanent errors. This will lead to more reliable and self-healing installations, reducing manual intervention. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Working on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust smart retry mechanism with exponential backoff for installation processes, significantly improving the resilience of command execution. A new SmartRetry class has been added, which intelligently distinguishes between transient and permanent errors using ErrorParser to decide whether to retry. The InstallationCoordinator has been updated to integrate this new retry logic, and corresponding unit tests for SmartRetry have been added. While the core retry logic is well-implemented, some existing tests for InstallationCoordinator need to be updated to fully reflect and verify the new retry behavior. Additionally, the handling of subprocess.TimeoutExpired could be more explicit within the retry logic to prevent unnecessary retries.
CLA Verification PassedAll contributors have signed the CLA.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements smart retry logic with exponential backoff for installation commands to handle transient errors more gracefully. The implementation adds a new SmartRetry class that uses error classification to distinguish between retryable transient errors (network issues, locks) and permanent failures (permission denied, missing packages).
Changes:
- Added
SmartRetryclass with exponential backoff and error categorization - Integrated retry logic into
InstallationCoordinatorfor command execution - Added comprehensive test coverage for retry scenarios
- Added
max_retriesparameter to coordinator initialization
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cortex/utils/retry.py | New retry handler with exponential backoff and error classification |
| cortex/coordinator.py | Integration of retry logic into command execution pipeline |
| tests/test_retry.py | Comprehensive test suite for retry behavior |
| tests/test_coordinator.py | Updated tests to mock time.sleep for retry integration |
| cortex/cli.py | Added max_retries parameter to coordinator instantiation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cortex/cli.py (1)
1037-1044: Add--max-retriesCLI flag to make retry attempts user-configurable.The CLI hard-codes
max_retries=5when instantiatingInstallationCoordinator, and there is no existing--max-retriesflag or configuration mechanism to override it. Users cannot customize retry behavior for installation steps. Add a--max-retriesargument toinstall_parser(default 5) and wire it through the install method toInstallationCoordinator.🛠️ Suggested wiring
- coordinator = InstallationCoordinator( + coordinator = InstallationCoordinator( commands=commands, descriptions=[f"Step {i + 1}" for i in range(len(commands))], timeout=300, stop_on_error=True, progress_callback=progress_callback, - max_retries=5, + max_retries=max_retries, )Update install method signature:
def install( self, software: str, execute: bool = False, dry_run: bool = False, parallel: bool = False, + max_retries: int = 5, json_output: bool = False, ):Add to install_parser:
+install_parser.add_argument( + "--max-retries", + type=int, + default=5, + help="Maximum retry attempts for installation steps (default: 5)", +)Update dispatch call:
elif args.command == "install": return cli.install( args.software, execute=args.execute, dry_run=args.dry_run, parallel=args.parallel, + max_retries=args.max_retries, )
🤖 Fix all issues with AI agents
In `@cortex/coordinator.py`:
- Around line 191-199: status_callback currently only calls self._log (which
writes to optional log_file) so retry updates are silent in the default CLI;
modify status_callback used for SmartRetry to also emit CLI-visible output
(e.g., call an existing CLI progress callback or print to stdout) so retry
progress is visible to users. Locate status_callback and update it to call both
self._log(msg) and the CLI-facing notifier (e.g., self._progress_callback(msg)
or a console print) before passing it into SmartRetry.
In `@cortex/utils/retry.py`:
- Around line 17-26: Validate inputs in the __init__ method: ensure max_retries
is an int >= 0 and backoff_factor is a float >= 0.0, and raise a clear
ValueError if either is negative or of the wrong type to fail fast (this
prevents negative durations passed to time.sleep later). Set self.max_retries
and self.backoff_factor only after validation; keep self.status_callback and
self.error_parser initialization unchanged.
- Around line 94-115: The current logic retries by default for unlisted
categories causing permanent errors (e.g., DISK_SPACE) to be retried; update the
retry decision in the function that calls error_parser.parse_error (uses
analysis = self.error_parser.parse_error(error_message) and reads
analysis.primary_category) to treat non-fixable errors as permanent: first check
analysis.is_fixable and immediately return False when is_fixable is False, then
keep the existing retry-on-transient list (ErrorCategory.NETWORK_ERROR,
LOCK_ERROR, UNKNOWN) and the permanent list (add DISK_SPACE plus other permanent
categories like PERMISSION_DENIED, PACKAGE_NOT_FOUND, CONFIGURATION_ERROR,
DEPENDENCY_MISSING, CONFLICT). This ensures permanent categories are not retried
while preserving current transient retry behavior.
- Around line 63-66: In the except block that sets last_exception and calls
self._should_retry (the block capturing "except Exception as e"), replace "raise
e" with a bare "raise" so the original traceback is preserved; locate the
handler that assigns last_exception and invokes self._should_retry(str(e)) and
change the re-raise to a bare raise to keep the original stack trace for
debugging.
🧹 Nitpick comments (1)
tests/test_retry.py (1)
9-13: Add explicit return type hints on test methods.
Type hints are required in Python files; please add-> NonetosetUpand eachtest_*method to keep the new test module compliant. As per coding guidelines, type hints are required in Python code.♻️ Suggested update
- def setUp(self): + def setUp(self) -> None: self.retry = SmartRetry(max_retries=3, backoff_factor=0.01) - def test_success_first_try(self): + def test_success_first_try(self) -> None:
Anshgrover23
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KrishnaShuk Kindly address all coderabbit comments.
|
will do it under a day. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cortex/coordinator.py`:
- Around line 182-194: Add explicit type annotations to the inner helpers:
annotate run_cmd with a return type of subprocess.CompletedProcess[str] (i.e.,
def run_cmd() -> subprocess.CompletedProcess[str]:) and annotate status_callback
to return None (i.e., def status_callback(msg: str) -> None:). Ensure
subprocess.CompletedProcess is referenced via the subprocess module (or
imported) so the annotation resolves.
♻️ Duplicate comments (1)
cortex/utils/retry.py (1)
68-88: Preserve the original traceback when re-raising after retries.
Raisinglast_exceptionat the end loses the original traceback, which makes debugging harder after multiple retries. Consider storingexc_infoand re-raising with its original traceback.🛠️ Proposed fix
+import sys @@ - last_exception = None + last_exc_info = None last_result = None @@ - except Exception as e: - last_exception = e + except Exception as e: + last_exc_info = sys.exc_info() if not self._should_retry(str(e)): raise @@ - if last_exception: - raise last_exception + if last_exc_info: + _, exc, tb = last_exc_info + raise exc.with_traceback(tb) return last_result
🧹 Nitpick comments (1)
cortex/coordinator.py (1)
55-66: Document the newmax_retriesparameter in public docstrings.
Both__init__andfrom_planare public entry points; updating their docstrings keeps the API change discoverable.Also applies to: 86-98
|
PR passes all acceptance criteria. @Anshgrover23 PR is ready to be merged! |
|
Anshgrover23
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KrishnaShuk added some comments, also there is missing test for ErrorCategory integration, add them.
Also, two things are missing from requirements:
- Different strategies for different error types
- Documentation with configuration examples
| from collections.abc import Callable | ||
| from typing import Any | ||
|
|
||
| from cortex.error_parser import ErrorCategory, ErrorParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrorParser class is unused, remove that.
| timeout=300, | ||
| stop_on_error=True, | ||
| progress_callback=progress_callback, | ||
| max_retries=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use magic numbers add a constant instead something like = DEFAULT_MAX_RETRIES = 5
| if not error_message: | ||
| # If no error message, assume it's a generic failure that might be transient | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is risky. A command could fail with returncode != 0 and empty stderr/stdout for permanent reasons. Consider logging a warning here.
| if hasattr(result, "stderr") and result.stderr: | ||
| error_msg = result.stderr | ||
| elif hasattr(result, "stdout") and result.stdout: | ||
| error_msg = result.stdout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stdout may contain success output, not error info. This could cause false retry classification. So fix it.
| attempt += 1 | ||
| sleep_time = self.backoff_factor * (2 ** (attempt - 1)) | ||
|
|
||
| msg = f"⚠️ Transient error detected. Retrying in {sleep_time}s... (Attempt {attempt}/{self.max_retries})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first retry, attempt=1, so message shows "Attempt 1/5" but it's actually the 2nd execution. Consider clarifying: "Retry 1/5" instead.
| def status_callback(msg: str) -> None: | ||
| self._log(msg) | ||
| # Also print to stdout so the user sees the retry happening | ||
| print(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If progress_callback is also set, user may see duplicate output. Consider checking if a callback exists before printing.
| ): | ||
| if not isinstance(max_retries, int) or max_retries < 0: | ||
| raise ValueError("max_retries must be a non-negative integer") | ||
| if not isinstance(backoff_factor, (int, float)) or backoff_factor < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should reject backoff_factor=0 as well since it defeats the purpose of backoff.



Related Issue
Closes #43
Summary
This PR implements a robust "Smart Retry" mechanism for the Cortex CLI to improve reliability during installations by automatically handling transient failures.
and updated the status callback to print visible warnings ("
AI Disclosure
Google antigravity AI coding assistant(Claude Opus 4.5 ) was used for framing better tests and creating the command file.
Demonstration
I have manipulated coordinator.py for producing below behaviour.
Screencast.from.2026-01-22.12-59-37.webm
Checklist
type(scope): descriptionor[scope] descriptionpytest tests/)Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.