-
Notifications
You must be signed in to change notification settings - Fork 85
[runtime] Fix Python awaitable lost during checkpoint restore #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When a Python async action fails mid-execution (e.g., LLM timeout), the job restarts from checkpoint. During restore, the Python awaitable (coroutine) reference is None because Python coroutines cannot be serialized. Previously this caused: AttributeError: 'NoneType' object has no attribute 'send' This fix: - Detects when awaitable is None in PythonActionExecutor.callPythonAwaitable() - Throws AwaitableLostException to signal the awaitable was lost - PythonGeneratorActionTask catches this and re-executes the action from beginning - Durable execution cache ensures already-completed calls are skipped
|
Hey @Sxnan, wondering if there’s a way to re-run the failed CI? It seems I don’t have access to trigger it. This looks like a pre-existing flaky test, not related to the PR. |
|
@weiqingy , Thanks for identifying this bug and submitting the fix! The issue you found is real and important. |
Thanks for the quick review and changes, @Sxnan! It LGTM. |
|
@xintongsong Could you take a look at this PR? |
xintongsong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[runtime] Fix Python awaitable lost during checkpoint restore Co-authored-by: sxnan <suxuannan95@gmail.com>
|
Ported to release-0.2 in 5d9fb77 |
Linked issue: #508
Purpose of change
Fix a crash that occurs when restoring a Python async action from checkpoint.
Root cause: Python coroutines cannot be serialized. When a checkpoint captures state while an async action is in progress (e.g., waiting for LLM response), the coroutine object is lost. On restore, the awaitable reference is
None, causing:AttributeError: 'NoneType' object has no attribute 'send'
Fix:
Noneawaitable inPythonActionExecutor.callPythonAwaitable()AwaitableLostExceptionto signal the awaitable was lostPythonGeneratorActionTaskcatches this and re-executes the action from the beginningTests
react_agent_example.py, trigger LLM timeout, verify job recovers instead of crashingAPI
No public API changes.
Documentation
doc-neededdoc-not-neededdoc-included